Restricted access

October 13, 2010

ETL Alternatives: Data Federation Technology

Filed under: ETL — Tags: , — Katherine Vasilega @ 8:01 am

Today, I’ll continue talking about ETL alternatives that a company can use to satisfy its specific data integration needs. One of the relatively new approaches to data integration is data federation. This technology enables a single virtual view of enterprise information across multiple data repositories, without a need to move or copy data.

While ETL moves data into a single repository, the virtual database created by data federation technology doesn’t contain the data itself. Instead, it contains information about the actual data and its location. When a user wants to access the data, he or she sends a query specifying the needed information, and the federation server immediately delivers it as a virtual, integrated view.

This approach to data integration is especially useful if company’s data is stored by a third-party cloud service provider. It allows business analysts to aggregate and organize data quickly without having to synchronize or to copy it. Data federation as an ETL alternative is also helpful when a user needs quick reporting. Additionally, data federation technology can be used to save the cost of creating a permanent, physical relational database.

The disadvantage of data federation, however, is actually one of its strengths: it prevents data from being changed. This feature is great for retaining historical accuracy, but awkward for companies looking to continually update their data. As long as a company doesn’t need to constantly change information, data federation can be a good choice for the data integration strategy.

October 12, 2010

ETL Alternatives: ELT

Filed under: ETL — Tags: , — Katherine Vasilega @ 8:06 am

With plenty of data integration strategies to choose from, most organizations automatically look for an ETL (extract-transform-load) data integration tool. However, depending on the particular business needs, a company may require a different method for their data integration initiative, such as data federation, ELT, and real-time data integration. Today, I will talk about ELT (Extract, Load, and Transform) technology.

ELT method is also known as in-database integration. With this process, most of the data transformations occur after the data has been loaded into the target database. The order of transforming the data is the main difference between ETL and ELT. It leads to some other major differences:

• ETL functions at the integration server level, while ELT functions at the infrastructure level, using SQL and related procedural features of the relational database to optimize performance.
• Transforming the data after it has reached the target helps minimize costs, because software licensing and development costs in Data Warehousing initiatives can be much larger than the costs of infrastructure.
• ELT method leverages the convenience of virtualization and cloud computing, which helps to speed up processes.

Choosing the right method is quite subjective, and people tend to stick to what they know. However, you can begin by carefully analyzing the requirements related to functionality, volumes, performance, and costs of both technologies.

October 7, 2010

Should ETL Support HTML?

Filed under: ETL — Tags: , — Katherine Vasilega @ 3:57 am

While ETL stands for Extract, Transform, and Load, the most important and useful function of the tool is to transform the data so it can be properly structured.

Traditional ETL vendors are most effective for extracting and loading data from sources which can be accessed in traditional ways through SQL, XML, or program APIs. Their ETL products are not very functional when it comes to transforming data from applications that are Web-based. Meanwhile, when accessing data from the Web, the transforming capability is crucial since the data is typically unstructured there. Remember, the more structure you can add to your data, the higher the value you will get.

This is one of the reasons you have to identify your technical criteria and business parameters before deploying the ETL tool. You have to consider the following:

• Do you need support for Web services?
• How scalable is the tool?
• Does it provide a trial version?
• Does it run the required transactions at the speed you need?
• What will the total cost of ownership be?
• Is there a community behind the ETL tool to fix bugs, develop new connectors, and create data maps?

Whether you are using ETL processes for data warehousing, business intelligence, or legacy systems integration initiatives, it’s a good idea to choose the tool that allows users to extract and load data to and from all the data sources, including the Web. It is important, because the data we need for business analysis and decision making changes more rapidly. It is a good idea to access the up-to-date information to be able to respond to the ever changing business needs.

October 1, 2010

Data Integration: 3 Most Common Mistakes

Filed under: Data Integration, Data Quality, ETL — Tags: , , — Katherine Vasilega @ 4:51 am

Implementing a data integration solution is not an easy task. There are some common mistakes that companies tend to make in data integration. These mistakes result in delayed data integration projects, increased costs, and reduced data quality. Today, I’d like to focus on three most common data integration mistakes that businesses tend to make.

1. Lack of a comprehensive approach

Data integration is not only about gathering requirements, determining what data is needed, creating the target databases, and then moving data. You have to develop a comprehensive data integration approach that will provide for:

• Processing complex data, such as products and customers, in relation to facts, such as business transactions
• Filtering and aggregating data
• Handling data quality
• Capturing changes and avoiding gaps in historical data.

2. Missing data quality requirements

You may think that data quality problems are simply data errors or inconsistencies in the transactional systems that can be easily fixed. The truth is that you have to prevent quality problems at the initial stage of a data integration process. You have to plan how to set data quality requirements, incorporate data quality metrics into your system architecture, monitor those metrics in all your data integration processes, and report on data quality.

3. Using custom coding instead of ETL

While most businesses consider ETL the best practice, there are still a lot of companies that use custom coding to create countless data shadow systems. Keep in mind that custom code makes it difficult to manage and maintain programs, does not offer the centralized storage of programs, limits metadata capabilities, and also has a longer development cycle. Besides, debugging is more difficult with a custom code than with an ETL tool. To add more, an ETL tool usually has a user-friendly interface, provides for centralized storage of programs, and is relatively easy to customize.

Thinking ahead about all these issues before developing and implementing a data integration solution, you are sure to save time, money, and valuable data.

December 9, 2009

Open Source ETL and BI Tools Deployments to Grow Says Gartner

Filed under: Data Integration, ETL, Open Source — Tags: , — Olga Belokurskaya @ 10:25 am

In one of their latest press releases, Gartner made some interesting predictions regarding open source ETL and other kinds of BI tools. Though it’s early to put open source data integration platforms on the same level with complex proprietary solutions, the deployment of open source tools shows solid growth.

Gartner has pointed out the fact that open source ETL and BI tools are more frequently used by mid-sized businesses, governmental and public sectors. And many ISVs use open source BI solutions as the additional functionality to their own applications.

What is even more interesting: large vendors, proprietary commercial tools providers seriously care about finding the ways to address the challenges from open source data integration and BI offerings, as the latter become their lower-cost competitors.

Assuming this, I think that open source data integration, ETL and other kinds of business intelligence solutions have earned confidence from the business side, and are expected to become even more widely used in the nearest future.

April 25, 2009

Data Warehousing Pros and Cons

Filed under: Data Warehousing — Tags: , — Olga Belokurskaya @ 4:32 am

First, let’s remember what is data warehouse, and why it may be useful for a business.

In fact, it is a repository of an organization’s  data which is electronically stored, and it is designed to facilitate reporting and analysis. The broader meaning of data warehouse focuses not only on data storage, but the means to retrieve and analyze data, to extract, transform and load data, and to manage the data dictionary are also considered essential components of a data warehousing system.

Nowadays, data warehousing is a popular management technique and is frequently used as a business model. However, not every system is applicable to every business setting. So when thinking about implementing the strategy, one should consider pros and cons of data warehousing.

Among the major benefits of data warehousing is enhanced access to data and information and easy reporting and analysis. Besides:

  • Data retrieval is faster within data warehouses.
  • Prior to loading data into the data warehouse, inconsistencies are identified and resolved.
  • Data warehouses can work in conjunction with and, hence, enhance the value of operational business applications, such as, for example, CRM systems.

And here are some cons:

  • Preparation is very frequently time consuming for effort is needed to create a cohesive, compatible system of data collection, storage, and retrieval. Moreover, because data must be extracted, transformed and loaded into the warehouse, there is an element of latency in data warehouse data.
  • Compatibility with existing systems. The use of data warehousing technology may require a company to modify the database system already in place. This could really be the foremost concern of businesses when adapting the model given the cost of the computer systems and software needed.
  • Security flaws that data warehousing technology may contain. If the database contains sensitive information, its use may be restricted to a limited group of people and precautions will be required to insure that access is not compromised. Limited data access situations can also effect the overall utilization of the data strategy.
  • Over their life, data warehouses can have high costs. The data warehouse is usually not static, it gets outdated and needs regular maintenance, which may be quite costly.

So, before any implementations, one should make sure that data warehousing will be a good fit for the business and be prepared to commit to the level of work required to get the system in place. However, once data warehouse starts working, most companies are glad to have their “corporate memory.”

March 6, 2008

ETL – what’s in abbreviation?

Filed under: ETL — Tags: — Alena Semeshko @ 5:24 am

ETL (data Extraction, Transformation and Loading) is an integral part of data warehousing.

In brief, ETL processes are used to extract data from various sources, transform it by cleaning and integrating, and finally load it into data warehouses. The data you get in the end is clean, well-structured, systematic, and ready to use. It sounds quite easy, but in reality there’s much more to it. There’s a catch to each step.

During the first step, data extraction, the challenge you are dealing with is the bulk of data from different sources. It might be different departments, databases, formats, reporting systems, etc. This scattered data needs to somehow be captured and moved into the staging database.

Next — data transformation. This step is the most complicated. It can be broken into four separate steps under the umbrella of transformation:
- data verification – comparing the extracted data with the DW quality standards. In case data doesn’t meet the outlined standards, it either gets rejected, or held to be reviewed by the administrator.
- data cleaning – the data left from the previous step is made more precise. (The techniques this stage includes are so many that they deserve a different post.)
- data mapping – merging data from different sources into a single interface, structuring it into columns and tying it together logically.
- data consolidation (or aggregation) – summarizing data from the previous step and performing overall calculations to provide the user with a more complete picture.

Finally, loading. This step simply uploads the data organized during transformation into a warehouse.

During this whole process, the one thing to be careful of is losing data. ETL process does not presuppose changing the initial data, it should only make it better, cleaner, more correct and organized.