Restricted access

October 20, 2010

ETL: “T” For “Transformation”. Part 2

Filed under: ETL — Katherine Vasilega @ 7:06 am

I keep on telling you about data transformation viewed as the crucial stage of the ETL process and today’s post will be devoted to the following ETL transformation types: character set conversion, unit of measurement conversion, date/time conversion, aggregation (summarization), key restructuring, and de-duplication.

Character set conversion.
It is a common task to convert character sets to an agreed standard character set. There are ISO character sets, Unicode character sets, DOS character sets, Apple character sets, Windows character sets, etc. When your source data is on different types of hardware and operating systems, you will have to deal with this type of conversions.

Unit of measurement conversion. Many companies today are located around the globe. If your company’s headquarters is in the USA, while the production offices are in Europe, you will have to convert lb. into kg, ml into km, and so on.

Date/Time conversion.
Date and time have to be represented in standard formats. For example, the American and the British date formats may be standardized to an international format. The date of October 20, 2010 is written as 10/20/2010 in the American format and as 20/10/2010 in the British format. This date may be standardized to be written as 20 OCT 2010.

Aggregation and summarization
. Summarization is the addition of corresponding values in a single business dimension, e.g., adding up revenue values by day to calculate weekly totals. Aggregation refers to a summarization across different business dimensions. Commonly, both summarization and aggregation can be deployed during data transformation in the ETL process.

Key restructuring. When choosing keys for your database tables, you have to avoid the ones with built-in meanings. If you use the product code as the primary key, you are sure to face this issue. If the product is moved to another warehouse, the warehouse part of the product key will have to be changed. Restructuring in the ETL is the transformation of such keys into generic keys produced by the system itself.

De-duplication. In a customer database some customers may be represented by several records for various reasons: incorrect data values because of data entry errors, incomplete information, change of address, etc. It makes sense to keep a single record for one customer and link all the duplicates to this single record. This process in ETL is called de-duplication of the customer file.

As you can see, the amount of manipulation needed for the transformation process in ETL depends on the data. Accurate data sources will require little transformation, while inaccurate and outdated sources may require various transformation techniques to meet your business and technical requirements.

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URL

Leave a comment