Restricted access

October 20, 2010

ETL: “T” For “Transformation”. Part 2

Filed under: ETL — Katherine Vasilega @ 7:06 am

I keep on telling you about data transformation viewed as the crucial stage of the ETL process and today’s post will be devoted to the following ETL transformation types: character set conversion, unit of measurement conversion, date/time conversion, aggregation (summarization), key restructuring, and de-duplication.

Character set conversion.
It is a common task to convert character sets to an agreed standard character set. There are ISO character sets, Unicode character sets, DOS character sets, Apple character sets, Windows character sets, etc. When your source data is on different types of hardware and operating systems, you will have to deal with this type of conversions.

Unit of measurement conversion. Many companies today are located around the globe. If your company’s headquarters is in the USA, while the production offices are in Europe, you will have to convert lb. into kg, ml into km, and so on.

Date/Time conversion.
Date and time have to be represented in standard formats. For example, the American and the British date formats may be standardized to an international format. The date of October 20, 2010 is written as 10/20/2010 in the American format and as 20/10/2010 in the British format. This date may be standardized to be written as 20 OCT 2010.

Aggregation and summarization
. Summarization is the addition of corresponding values in a single business dimension, e.g., adding up revenue values by day to calculate weekly totals. Aggregation refers to a summarization across different business dimensions. Commonly, both summarization and aggregation can be deployed during data transformation in the ETL process.

Key restructuring. When choosing keys for your database tables, you have to avoid the ones with built-in meanings. If you use the product code as the primary key, you are sure to face this issue. If the product is moved to another warehouse, the warehouse part of the product key will have to be changed. Restructuring in the ETL is the transformation of such keys into generic keys produced by the system itself.

De-duplication. In a customer database some customers may be represented by several records for various reasons: incorrect data values because of data entry errors, incomplete information, change of address, etc. It makes sense to keep a single record for one customer and link all the duplicates to this single record. This process in ETL is called de-duplication of the customer file.

As you can see, the amount of manipulation needed for the transformation process in ETL depends on the data. Accurate data sources will require little transformation, while inaccurate and outdated sources may require various transformation techniques to meet your business and technical requirements.

October 19, 2010

ETL: ”T” For ”Transformation”

Filed under: ETL — Tags: — Katherine Vasilega @ 7:07 am

One of the main functions of an Extract, Transform, and Load (ETL) tool is to transform data. The transformation step is the most vital stage of building a structured data warehouse. With this post, I’d like to help you get a better understanding of the major transformation types in ETL. Here they are:

    • Format revision
    • Decoding of fields
    • Calculated and derived values
    • Splitting of single fields
    • Merging of information
    • Character set conversion
    • Unit of measurement conversion
    • Date/Time conversion
    • Summarization
    • Key restructuring
    • De-duplication

Today I will tell you about the following ETL transformation types: format revision, decoding of fields, calculated and derived values, splitting of single fields, and merging of information.

Format revision. Fields can contain numeric and text data types. If they do, you need to standardize and change the data type to text to provide values that could be correctly perceived by the users. The length of fields can also be different and you can standardize it.

Decoding of fields.
In multiple source systems, the same data items are described by a variety of field values. Also, many legacy systems are notorious for using cryptic codes to represent business values. This ETL transformation type changes codes into values that make sense to the end-users.

Calculated and derived values. Sometimes you have to calculate the total cost and the profit margin before data can be stored in the data warehouse, which is an example of the calculated value. You may also want to store customer’s age separately—that would be an example of the derived value.

Splitting single fields. The first name, middle name, and last name, as well as some other values, were stored as a large text in a single field in the earlier legacy systems. You need to store individual components of names and addresses in separate fields in your data repository to improve the operating performance by indexing and analyzing individual components.

Merging of information. This type of data transformation in ETL does not literally mean the merging of several fields to create a single field. In this case, merging of information stands for establishing the connection between different fields, such as product price, description, package types, and viewing these fields as a single entity.

I will continue to overview ETL transformation types in my next post. See you soon!

October 13, 2010

ETL Alternatives: Data Federation Technology

Filed under: ETL — Tags: , — Katherine Vasilega @ 8:01 am

Today, I’ll continue talking about ETL alternatives that a company can use to satisfy its specific data integration needs. One of the relatively new approaches to data integration is data federation. This technology enables a single virtual view of enterprise information across multiple data repositories, without a need to move or copy data.

While ETL moves data into a single repository, the virtual database created by data federation technology doesn’t contain the data itself. Instead, it contains information about the actual data and its location. When a user wants to access the data, he or she sends a query specifying the needed information, and the federation server immediately delivers it as a virtual, integrated view.

This approach to data integration is especially useful if company’s data is stored by a third-party cloud service provider. It allows business analysts to aggregate and organize data quickly without having to synchronize or to copy it. Data federation as an ETL alternative is also helpful when a user needs quick reporting. Additionally, data federation technology can be used to save the cost of creating a permanent, physical relational database.

The disadvantage of data federation, however, is actually one of its strengths: it prevents data from being changed. This feature is great for retaining historical accuracy, but awkward for companies looking to continually update their data. As long as a company doesn’t need to constantly change information, data federation can be a good choice for the data integration strategy.

October 12, 2010

ETL Alternatives: ELT

Filed under: ETL — Tags: , — Katherine Vasilega @ 8:06 am

With plenty of data integration strategies to choose from, most organizations automatically look for an ETL (extract-transform-load) data integration tool. However, depending on the particular business needs, a company may require a different method for their data integration initiative, such as data federation, ELT, and real-time data integration. Today, I will talk about ELT (Extract, Load, and Transform) technology.

ELT method is also known as in-database integration. With this process, most of the data transformations occur after the data has been loaded into the target database. The order of transforming the data is the main difference between ETL and ELT. It leads to some other major differences:

• ETL functions at the integration server level, while ELT functions at the infrastructure level, using SQL and related procedural features of the relational database to optimize performance.
• Transforming the data after it has reached the target helps minimize costs, because software licensing and development costs in Data Warehousing initiatives can be much larger than the costs of infrastructure.
• ELT method leverages the convenience of virtualization and cloud computing, which helps to speed up processes.

Choosing the right method is quite subjective, and people tend to stick to what they know. However, you can begin by carefully analyzing the requirements related to functionality, volumes, performance, and costs of both technologies.

October 7, 2010

Should ETL Support HTML?

Filed under: ETL — Tags: , — Katherine Vasilega @ 3:57 am

While ETL stands for Extract, Transform, and Load, the most important and useful function of the tool is to transform the data so it can be properly structured.

Traditional ETL vendors are most effective for extracting and loading data from sources which can be accessed in traditional ways through SQL, XML, or program APIs. Their ETL products are not very functional when it comes to transforming data from applications that are Web-based. Meanwhile, when accessing data from the Web, the transforming capability is crucial since the data is typically unstructured there. Remember, the more structure you can add to your data, the higher the value you will get.

This is one of the reasons you have to identify your technical criteria and business parameters before deploying the ETL tool. You have to consider the following:

• Do you need support for Web services?
• How scalable is the tool?
• Does it provide a trial version?
• Does it run the required transactions at the speed you need?
• What will the total cost of ownership be?
• Is there a community behind the ETL tool to fix bugs, develop new connectors, and create data maps?

Whether you are using ETL processes for data warehousing, business intelligence, or legacy systems integration initiatives, it’s a good idea to choose the tool that allows users to extract and load data to and from all the data sources, including the Web. It is important, because the data we need for business analysis and decision making changes more rapidly. It is a good idea to access the up-to-date information to be able to respond to the ever changing business needs.

October 1, 2010

Data Integration: 3 Most Common Mistakes

Filed under: Data Integration, Data Quality, ETL — Tags: , , — Katherine Vasilega @ 4:51 am

Implementing a data integration solution is not an easy task. There are some common mistakes that companies tend to make in data integration. These mistakes result in delayed data integration projects, increased costs, and reduced data quality. Today, I’d like to focus on three most common data integration mistakes that businesses tend to make.

1. Lack of a comprehensive approach

Data integration is not only about gathering requirements, determining what data is needed, creating the target databases, and then moving data. You have to develop a comprehensive data integration approach that will provide for:

• Processing complex data, such as products and customers, in relation to facts, such as business transactions
• Filtering and aggregating data
• Handling data quality
• Capturing changes and avoiding gaps in historical data.

2. Missing data quality requirements

You may think that data quality problems are simply data errors or inconsistencies in the transactional systems that can be easily fixed. The truth is that you have to prevent quality problems at the initial stage of a data integration process. You have to plan how to set data quality requirements, incorporate data quality metrics into your system architecture, monitor those metrics in all your data integration processes, and report on data quality.

3. Using custom coding instead of ETL

While most businesses consider ETL the best practice, there are still a lot of companies that use custom coding to create countless data shadow systems. Keep in mind that custom code makes it difficult to manage and maintain programs, does not offer the centralized storage of programs, limits metadata capabilities, and also has a longer development cycle. Besides, debugging is more difficult with a custom code than with an ETL tool. To add more, an ETL tool usually has a user-friendly interface, provides for centralized storage of programs, and is relatively easy to customize.

Thinking ahead about all these issues before developing and implementing a data integration solution, you are sure to save time, money, and valuable data.

September 28, 2010

What Is The Difference Between ETL and EAI?

Filed under: EAI, ETL — Katherine Vasilega @ 3:48 pm

There has been some talk lately about the merge between EAI (Enterprise Application Integration) and ETL (Extract, Transform, Load) tools. I’ve decided to research in what way these tools differ and what they have in common.

Both tools deal with transformation and meta-data, feature connectors and service oriented architecture. They provide new capabilities for solving business intelligence and reporting issues and are an essential part of Enterprise Information Management (EIM). ETL is used for data integration and EAI is used for application integration.

Entities vs Processes

An ETL tool takes entities from one or more data sources and puts them into a target. These entities can be invoices, contacts, budgets, etc. It’s not the order of things that is important to a data model, but the relations existing between the entities. The ETL tool coordinates multiple entities and their relationships as data objects.

An EAI tool coordinates multiple entities and their relationships within a given process. EAI deals with transactions within the process, not with entities. In this case, the order in which things happen is vital, and the relationship within entities is less important, than the relationship within transactions.

In other words, EAI is generally process-oriented, while ETL is generally entity-oriented.

Convergence vs Divergence

ETL is about moving and transforming large amounts of data from many sources into one place. This process can be referred to as convergence. EAI, in its turn, is often used to move small amounts of data, spreading these transactions across various systems, which is known as divergence.

Business Cases

ETL is utilized for data migration and data integration to allow better decision making. EAI is utilized for process optimization and workflows and to make sure that data is entered only once and is properly used by systems and applications. The business case for ETL use is business intelligence, decision making, while EAI is used for IT, e-business, and better workflow.

From what I’ve learned, ETL and EAI tools are unlikely to merge completely at present. Though they are both important for Enterprise Information Management market, each tool is used for different purposes, and sometimes you have to combine them. Each concept is likely to remain and have its own users.

September 17, 2010

Data Mapping for Data Integration

Filed under: Data Integration, Data Migration, Data Synchronization, Database Integration, ETL — Katherine Vasilega @ 6:43 am

Your data sources grow together with your business. You have ERP, CRM systems, mail clients, Web forms, Excel documents and it’s getting harder to distinguish the accurate data. Data integration can solve this issue, but how do you transfer data from multiple sources in a nice and easy way? You’d probably need to deploy a system that allows automating the process of data transfer.

What is data mapping?

Data mapping is used in data integration when you need to gather information form multiple sources. Data mapping involves matching between a source and a target, e.g., two databases that contain the same data elements but call them by different names. A simple example of data mapping includes moving the value from a ‘customer name’ field in one DB to a ‘customer last name’ field in another DB. To do so, your ETL tool needs to know that you want to take the value from the source field ‘customer name’, cut out the first part (name) and leave the second part (last name), and move it to the target field ‘customer last name’. Besides, the steps of performing these operations need to be marked in the data integration process.

Data mapping tools

The modern ETL systems include the functionality of making data maps. Commonly, these are graphical mapping tools. They enable you to draw a line from one field to another, identifying the correct connection. It’s relatively easy to do if you want to, let’s say, move your contacts from the mail client to the CRM. But what if the task is more complicated, such as to move the information received through a Web contact form (first name, last name, address, phone, email, company name) to your CRM that has different fields for all these values? It will take much time to do it manually and it will take some time for you to draw a data map in your ETL tool, unless you are an IT specialist.

Open source ETL

Remember, if you are using an open source ETL tool, there is always a community behind it. People from all over the world create data maps for various purposes and make them available for free download. You can find great tools for complicated data integration tasks and use them for free. No need to draw a data map of your own, just use what has already proved to be effective. That way you can execute your data integration with no effort and money spending at all.

September 13, 2010

Data Transformation with ETL

Filed under: Data Integration, ETL — Katherine Vasilega @ 4:15 am

Finding the necessary information can turn into a tremendously time-consuming process, when stored data is assigned to different platforms and systems. ETL tools help solve this problem efficiently. Extract-transform-load software enables to keep data of different types and from various sources at one place so that it could be later on accessed with a single click.

Today, I’d like to concentrate on data transformation in the ETL process. Data transformation is a set of rules that provides functionality to remove useless data and produce relevant, complete and standardized data. The data transformation process includes several stages:

  • Sort – the data is sorted by type to be structured.
  • Clean – the data that violates the given rules is changed to fit these rules in the ETL process.
  • Summarize – values are summarized and stored at the multiple levels in fact tables.
  • Derive – new data is created from the source data. For example, customers’ age is calculated based on their date of birth and the current year.
  • Aggregate – data elements are collected from multiple source files and databases (.doc, .exe).
  • Integrate – as a result of this ETL process, each data element is given one standard name with one standard definition. Data integration brings together different names and values for the same data element.

One more thing to mention, data transformation is only a part of the whole data integration process that is first of all aimed at satisfying your business needs. That’s why data integration rules should imply 80% business rules + 20% technical process.

So, it’s a good idea to outline the business rules first, then consider the technical sides, and only then choose the suitable ETL tool.

September 10, 2010

How to Efficiently Perform Data Integration

Filed under: Data Integration, ETL — Tags: — Katherine Vasilega @ 7:44 am

Data integration process is a lot about changing the way your company works. It happens quite often that company managers are not ready for innovations and think it will take too much time and effort to perform quality data integration. They can be really afraid of possible data loss and, as my experience shows, if they are eventually ready for it, they may still prefer doing it in an old-fashioned way – with hand-coded integration.

Today, I’d like to tell you how to efficiently perform the data integration and why it is more secure and productive to choose a data integration platform versus hand-code integration.

Do the planning. I mention it in most of my posts and though it’s a basic thing, it’s still being neglected in data integration processes. Data integration has never been a single-day process. And successful companies do focus on it and find resources for thorough planning.

Don’t make IT specialists do it by themselves. They can ruin the entire data integration process for at least two reasons. They have too many tools and approaches to perform the integration in the first place. Second, they are technology-focused experts and what they are going to do is to get the application working right, not your data remaining right, safe and accurate. As a result of such miscommunication, your company will have to pay for mistakes and redundant work.
Get business analysts involved. You have to figure out what your data integration needs are and what is the best way to satisfy them. Business analysts will help you to define your priorities and to spend your integration budget in a proper way. Working as a team will ensure that everyone sees the data integration process the same way.

Select data integration software. Fortunately, there is plenty of software that supports data integration standards and will ensure your data quality. There are many tools on the market ranging from expensive to free ones that provide performance and flexibility required to perform safe and accurate data integration with no need for hand-coding. Intuitive user interfaces will help your company’s employees to perform the data integration without bugging the IT department.

As you can see, efficient data integration is not a single-day process. It requires thorough planning, engaging both business and IT specialists and careful integration software selection. Equipped with all this knowledge, you are sure to have a control over the process and perform it successfully.

« Older PostsNewer Posts »