Restricted access

November 30, 2010

ETL Challenges: Data Formats

Filed under: ETL — Katherine Vasilega @ 6:54 am

ETL tools should be able to find a standard way of handling a large variety of source and target data formats. This feature is needed to implement comprehensive business rules in your data integration solution. This is how you can optimize an ETL tool to avoid issues with multiple data formats:

    • Make data normalization rules descriptive, don’t hide them in procedural code blocks. Enable business people to specify rules in terms that make sense to them.
    • Set clear identification and reporting rules to detect the impact of data format changes, as well as maintenance rules, when a change occurs.
    • Rules should be expressed in terms of context-independent concepts that can easily be referred to by business people.
    • Do not express business rules in physical field names, as it will require normalization functions to be created for each new source data format.

Some ETL tools force developers into a variety of steps and complex procedures for accommodating the variation in source and target data formats. They lack the features to clearly express business rules, and therefore will hardly ever be leveraged by business users.

November 29, 2010

ETL Tools: Functionality and Ease of Use

Filed under: ETL — Katherine Vasilega @ 8:21 am

I continue with helping you evaluate ETL tools. On the dawn of data integration, ETL tools were supposed to be used by IT people. Today, they have to be managed by business users. At the same time, the increasing complexity of data sets requires ETL tools to have many sophisticated functions and features. So what is the right combination of functionality and ease of use?

Here are some core features that are a must for modern ETL tools:

    • The ability to deal with multiple data formats
    • Built-in analysis functionality allowing the examination of source data
    • Built-in data transformation functions
    • Support for data quality checking
    • Support for cleansing functionality
    • Metadata support that provides for creating business rules
    • Task scheduling functionality
    • Error tracking and logging

The following functionality is required for ETL tools to be leveraged by business users:

    • GUI interface that enables to drag and drop data elements from the source to the target
    • Intuitive interface for building data maps
    • Easy management of data maps
    • The ability to manipulate data either inside or outside the target database
    • Comprehensive user manuals and demos

In addition, it is a good idea to avoid purchasing ETL tools, which are difficult to troubleshoot and maintain. Remember, that you should define your master data requirements and data integration goals before selecting an ETL tool.

November 15, 2010

ETL tool: Out of the Box or Custom?

Filed under: ETL — Katherine Vasilega @ 7:08 am

As companies seek for software to fit their data integration needs, the first question they need to decide upon is whether to build a custom or to obtain an out of the box ETL tool. The best answer to this question is simple: it depends on your situation. In any case, both approaches have to be carefully examined. I offer you to consider some advantages and disadvantages of each method.

Out of the Box ETL Tool

The primary reason to obtain an out of the box ETL tool is that it helps companies minimize time and cost required for developing the proprietary code. Many companies are confident that vendor ETL tools are more effective at delivering projects on time and within budget. There are also other advantages, such as a wide range of rich built-in features; the ability to be reused for future projects; simplicity of ETL processes maintenance.

On the other hand, a lot of time spent on evaluation, selection and learning of the product can be a disadvantage of ETL vendor tools. Training costs and support fees, as well as not fitting 100% to your business needs are also among the drawbacks.

Custom ETL Tool

Some organizations believe that custom software saves time and money on support and maintenance of their data warehouse, because custom ETL tools are 100% geared to their business needs. More flexibility for complex data sources and transformation mappings are also among the advantages of custom ETL tools. Meanwhile, time needed for the development life cycle is a big disadvantage of custom ETL tools. Also, these tools don’t provide all the rich functionality of ETL software packages and can be hardly reused for different projects.

Each approach has its own particular features across industries and they vary a lot depending on your requirements. Just keep in mind that the general tendency of the efficient ETL tool is to integrate data quickly, practically, and cost-effectively.

November 12, 2010

ETL Tool Types Explained

Filed under: ETL — Katherine Vasilega @ 7:12 am

Today, I’d like to continue talking about how to choose the appropriate ETL tool for your data integration or data migration needs. To make the right decision, you need to understand the types of ETL tools available on the market. ETL tools can be classified according to their function, engine type, data sources, development environment, and metadata support.

Many ETL tools have either strong transformation or strong cleansing capabilities. If your data can potentially be corrupted, make sure your ETL tool has strong cleansing capabilities. Logically, if various data transformations are a higher priority, pick a tool that is strong in transformation.

The Engine Type:
ETL tools can be server engine or client engine. The server engine ETL tool allows execution of multiple concurrent tasks from several computers. The client engine ETL tool is simpler and assumes that the ETL routines are developed and executed on the same machine.

Development Environments: ETL tools can be GUI- based or code-based. Code-based tools are dependent of the language they represent. Aside from general programming languages, several tools on the market utilize a custom-scripting language developed for the optimization of ETL processes. GUI-based ETL tools enable to remove the coding layer and allow non-technical users generate ETL processes.

Data Source: As there are many types of data sources, for each organization, there will be a different set of them. Make sure that the ETL tool you select can connect directly to your source data.

Metadata Support: ETL tools play a key role in your metadata management, because they map the source data to the target. Data mapping rules by themselves are an important piece of the metadata. It is crucial to select an ETL tool that works with your general metadata strategy.

In general, ETL tools depend on your business needs and technical capabilities. The best way to decide, which ETL tool is going to work for your organization, is to conduct a deep analysis of your data integration or data migration needs with a help of qualified professionals.

November 9, 2010

Metadata in Data Integration

Filed under: Data Integration, ETL — Tags: — Katherine Vasilega @ 4:00 am

Metadata unfolds the definitions, meaning, origin and rules of the data used in a data warehouse. When talking about metadata, it is important to keep in mind that there are two metadata types involved in the data integration and warehousing process: business metadata and technical metadata. These two types illustrate both business and technical approaches to data integration. Let’s have a look at both.

Business metadata stores business definitions of the data. It contains high-level definitions of all the fields present in the data warehouse. Business metadata is mainly used by report authors, data managers, testers, and analysts.

Technical metadata is a representation of the ETL process. It stores data mapping and transformations from source systems to the target and is mostly used by data integration specialists and ETL developers.

Most specialists agree that metadata management projects require a certain amount of customization. Well-built ETL tools take that need into consideration, and offer some means to extend functionality and provide metadata integration. Still, the metadata management needs of individual enterprises are so widely varied that trying to create a one-size-fits-all metadata data integration solution is kind of impossible.

November 5, 2010

ETL Support for Heterogeneous Technologies

Filed under: ETL — Katherine Vasilega @ 6:13 am

In my previous posts I often mentioned the ‘transform’ function of ETL tools. Today, I’m going to concentrate on ETL’s extract and load processes. These operations can be homogeneous or heterogeneous. Homogeneous operations involve extracting data from a repository and loading into a repository of the same technology. If two or more technologies are involved in the ETL process, it is called a heterogeneous operation.

The challenge with data integration is that there are always numerous repository technologies to be addressed. For example, if data is stored in SharePoint, Excel, and Web logs sources and has to be consolidated in the CRM, then extract must be performed on three different technologies and load on one technology.

In order to work with heterogeneous data structures, the ETL tool must support the following features:

    • Work with a variety of formats and databases
    • Convert flat files, unstructured data
    • Define mappings and transformation rules through a drag-and-drop interface and store these rules independently from the actual implementations
    • Have reusable components

Constructing extract connections to each source repository technology is the most technically challenging part of data integration. Using pre-existing extract connections that some ETL tools provide reduces this risk greatly.

November 3, 2010

How to Evaluate an ETL Tool

Filed under: ETL — Katherine Vasilega @ 8:44 am

Before making a decision about purchasing commercial software or downloading an open source ETL tool, you need to research the market. What are the features you should look for in the ideal ETL software? Here is a list of steps to help you decide if a certain ETL tool suits your requirements.

    • Which operating systems the ETL tool supports
    • What amount of data within a certain period the tool is capable of processing
    • What types of data sources have connectivity with this ETL tool
    • Check the data format that you are going to use with this tool: Text, CSV, XML, Databases (Oracle, MySQL, SQL Suite, DB2, Sybase), EDI, HIPAA, ACORD AL3, any fixed-length file format, or any other format
    • Is the ETL tool desktop or Web-based?
    • Find out about the process of enforcing the data validation rules. Are they user-friendly?
    • Does it offer comprehensive data mapping rules?
    • What are the logging features?
    • You also have to check the loads and find out how the ETL tools handles error conditions

It is also helpful to obtain a customer reference from each ETL vendor, or search community forums (if it is an open source tool) to get the reviews written by a company that needs the same functionality as your organization. You have to focus on key technical and business needs to make sure that the customer reference is helpful for the ETL evaluation process.

October 25, 2010

Apatar Open Source Data Integration v1.12.13 Release

Filed under: Data Integration, ETL, Open Source — Tags: — Katherine Vasilega @ 12:47 am

New Apatar Open Source Data Integration v1.12.13 has been released recently. It features numerous improvements, bug fixes and new features. Here is a summary of the new release’s highlights.

Apatar users can now copy/paste data components. This functionality is designed to save time on project’s configuration. Now you can copy/paste both single and linked datamap components, selecting any part of the datamap (including connectors, operations, and functions).

Improvements in the connector for E-mail.
There are several improvements in the connector for E-mail: You can now retrieve all or only unread mail from an IMAP mail server; filter E-mails on the server level; set an encrypted connection.

Improvements in database connectors. The other improvements include: displaing views, not only tables in database connectors; the connector for Salesforce CRM is set to prevent from exceeding the number of allowed query characters when updating the tables; specified Long Text fields used at the “Join On” tab of the Join operation into Text fields are transformed automatically, and much more.

You can learn all the details about the New Apatar Open Source Data Integration v1.12.13 release here.

October 22, 2010

Data Integration in Four Steps

Filed under: Data Integration, ETL — Katherine Vasilega @ 1:29 am

With this post, I’d like to explore the process of designing a data integration solution. What is the sequence of the development stages? Which stage is the most challenging one? Let’s have a closer look at each stage.

The first step in data integration is identification of common elements
, and this is how you can do it:

    • Identify the common entities. Once the common entity is identified, its definition should be standardized. Do Customers include permanent customers only or those who made single purchases as well?
    • Identify the common attributes. What are the attributes common to customers: first name, last name, date of purchase, anything else? Each attribute should be defined.
    • Identify the common values. The same information can be represented in different forms across multiple source systems. For example, sex, can be represented as ‘M’ or ‘1′ or ‘male’ or something else by each source system. A common representation must be defined.

The second step of data integration is the appointment of Data Steward, who will own the responsibility for a particular set of data elements. The Data Steward ensures that each assigned data element:

    • Has clear and unambiguous definition
    • Does not conflict with other data elements
    • Has enumerated value definitions if it is of code type
    • Is still being used
    • Is being used consistently in various systems
    • Has suitable documentation on appropriate usage

The third step of data integration is to design an ETL process to integrate the data into the target. This is the most important stage of data integration, which I have discussed in the previous posts.

The final step of data integration is to establish a process of maintenance, reviewing and reporting of data elements.

The key to successful data integration is a clear vision coupled with a comprehensive plan that will cover each stage of the process. Data integration is not an easy task to perform. Still, when being performed properly, it will help your company to lower your costs, improve your decision making process and make it more flexible, and make your company more successful.

October 21, 2010

Data Map Modeling as an Important Part of ETL Process

Filed under: ETL — Katherine Vasilega @ 6:01 am

One of the interesting phases in ETL is a data mapping stage. By the time the data-mapping phase starts, the target model and the source model will have been defined. But sometimes users pull out a report and require it to be a target. Is this the right approach? Absolutely not! Will it work? Probably, yes. Why?

Before building an actual data map for ETL, you have to define a data model for target even if it is not an actual database. It will help you understand the entities which it belongs to. Then it will be easier to find the similar entity in the source and map it to the target element. This is called building a logical data map model and you have to do it before you actually start mapping the data. Data map modeling includes the following stages:

    1. Identifying the data sources. The data modeling session indicates only the major source systems. It depends on the ETL team if it dives deeper and finds every source that can be used.
    2. Collecting and documenting source systems. You have to create the source systems tracking report that shows information about who is responsible for each source.
    3. Identifying the system-of-record (SOR). SOR is an information storage system that is required because in most enterprises data is re-processed in order to fit a new business use. That is why most of legacy systems have redundant and duplicated data.
    4. Creating and analyzing the entity relations of SOR. You have to build diagrams to analyze how two or more entities are related to one another.

After you have built a data map model for ETL, you have to analyze the data content, complete business rules for ETL process, integrate data sources to a single source and only then build an actual data map. This way, it will be easier for you to re-use data map components when you build new data maps for similar purposes.

« Older Posts