In this post I would like to describe some of the important tasks and considerations when developing a reliable extract – transform & load process as part of your data warehousing project. In a previous post, I described Data warehousing (DWH) “as the means with which data is delivered from different operational systems such as Purchasing, Stock or Sales systems for the business intelligence process”. In this post, we will focus on the E – extract step with details on the T & L steps to follow in the next posts.
Data warehousing involves bringing information together from disparate transaction systems, each of which are organised for online processing of small amounts of transactions by a large number of individual users. The ETL process enables this transaction data to be modified in terms of its structure and detail for analysis where large volumes of data are queried by a relatively small number of users.
Extracting the data
Before extracting any information, it is important to ensure that the correct data is being sourced in the appropriate ‘master’ system. For example, customer id and name may be available in three different systems, so let’s ensure that we draw the customer information from the system where it is most complete and up to date. By compiling a map of the source systems, the map helps to ensure that the right data is drawn from the correct ‘master’ system by containing; information on;
- Each of the ‘master’ systems
- The data that is provided by each system
- The format of the data
- Frequency of change
- The relevant database tables
When data is being extracted from a number of different transaction systems as part of the ETL process, this data is stored in a ‘Staging’ database.
Keep the data consistent
Consistency is another key requirement – when extracting data from different systems, it is important that that the data is valid and accurate. For example, if sales data is being extracted from a sales transaction system where the end of day update routines are still being run, taking in sales figures from different sales locations, then that sales data extract may not be accurate or fully up to date.
It is important to ensure that the data extraction happens after all transaction updates have been completed and that each transaction is date/time stamped to help with any date/time analysis that may be occurring later in the data warehouse.
There are a number of quality checks that can be applied to the extraction process;
- Using count functions to ensure that the same amount of extracted data exists in the Staging database as the relevant transaction systems
- The use of date/time data such as last business day or last transaction date can be used as a comparison with individual transaction date/time stamps to ensure that the extraction has been completed and no data from a different business period/day has been taken in error
A selection of tools to help you
While it is possible to develop a series of extraction scripts from scratch in Java or Php, there are a range of available tools for completing data extraction and most of these tools provide both a visual gui interface as well as a scripting capability for designing and completing the extraction functions, including;
Once all the necessary data has been extracted into a staging database, data transformation can begin. Transformation is regarded as the major part of the ETL process, where all the extracted data is cleansed and transformed into the appropriate detail for the final loading into the data warehouse.