Transformers 101 – transforming data in a warehouse
Transformation is regarded as the major part of the ETL process, where all the extracted data is cleansed and transformed into the appropriate detail for the final loading into the data warehouse. Starting with cleansing the data, when loading data from a number of different transaction systems, it is important to go through the following steps to ensure that high quality data is being prepared for the data warehouse by;
- Standardise descriptions, attributes and currencies. For example, in one transaction system, ‘Customer Service’ could be referenced while in another system, it may be known as ‘Customer Relations’. Similarly if data from different countries is being amalgamated, ensure that a common currency and an agreed exchange rate is used. If a garden rake is considered part of the garden equipment group, then this group attribute should be used consistently in all references to the garden rake, irrespective of the system that the particular data is taken from.
- Correcting data and spelling. For example, ensure the correctness of data by using one standard, all references to RB Consulting should be loaded as ‘RB Consulting’ and so any references to RBConsulting or RBC should be converted. Spelling is important, as transaction data is taken from systems where there can be a high volume of manual data entry and errors can occur!
- Complete a de-duplication exercise by either selecting whatever record is deemed the best quality or merging similar records. The de-duplication can require judgment and input from staff familiar with the data, particularly if for example relevant data such as address or date of birth is missing from the transaction record. However database scripts can be used to isolate and group this suspect data for analysis.
- Consistent data types are important – sometimes when data is extracted, it may not be in a consistent data format. For example, dates of birth as characters rather than date format, order amounts as characters, rather than as numerical format. These inconsistencies can be resolved either in the Extract or Transformation stages.
Calculating and deriving data
Not all of the data that you require for your analysis will have been extracted directly from the transaction systems; additional data may be required by aggregating and/or completing calculations using the extracted data. For example, using data on total sales and total customer returns, you may calculate a ‘Sales net of returns’ figure.
The amount of aggregation that will happen during the Transformation process is be driven by the level of detail required for analysis in the data warehouse (i.e. the ‘fact grain’). If high level queries with a minimal need for drilling down into the individual details, then data can be aggregated during the transformation process.
Dimensional modelling – the final transformation
As was noted earlier, data in online transaction processing systems is designed in a normalised manner so as to structure the data tables for efficient transaction processing. A dimensional data model, consisting of facts and dimensions is required to enable efficient data analysis and querying in a data warehouse where;
- Facts consist of the data that the business is trying to measure, such as sales, salaries, stock levels or profit
- Dimensions are the attributes with which a business is analysing the data, such as time, geographical regions, or products
Another perspective of the difference between Facts and Dimensions is that Facts are usually numeric and Dimensions are hierarchical (e.g. Time: Year, Quarter, Month)
A star schema is one approach to creating your dimensional data model, where a fact table is in the centre of your star and each of the required dimension tables are at the points of the star.
With the transformation of the data into a dimensional model in the staging area, it is now ready to be Loaded into the data warehouse – the final step of the ETL process.
[...] all the necessary data has been extracted into a staging database, data transformation can begin. Transformation is regarded as the major part of the ETL process, where all the extracted data is cleansed and [...]