Restricted access

December 13, 2010

Data Integration Best Practices: Using Taxonomies

Filed under: Data Integration, Data Quality — Katherine Vasilega @ 8:20 am

Data taxonomies are tree-structured classification systems, which provide increasing refinement of classes as you go deeper into the tree. Here are some tips for working with taxonomies when building a data integration solution.

    1. If the data is rich enough, you might not need taxonomies at all, as you may be able to find what you need using a keyword search. Taxonomies are only needed when there is no other data available to assist classification.

    2. Your taxonomy is never going to go away once you have it. Nodes are only going to be added to it, not removed. So keep it as small and simple as you can, and try to minimize the addition of new nodes.

    3. You have to understand what kind of the taxonomy is going to be used in the data integration solution. Most taxonomies are designed with human browsing in mind. On the other hand, they can be built with an intent to reduce the search space for an item when the data set is large. There may also be the need to automatically classify a data item into the taxonomy. The features that make a taxonomy detectable to business users are not be the same ones that make it easies to be processed by electronic systems.

    4. If you need a taxonomy for electronic systems, try to keep it small. This makes classifiers much easier to build.

    5. Have a precise data-labeling policy, don’t ever label a data point with both a parent and child class from the taxonomy.

You have to keep in mind that sometimes the need will arise to ingest a new data source into the existing system. This data source will have its own classification that will be not quite compatible with the existing one. This is why you should avoid deep and highly refined taxonomies in your data integration solution in general.

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URL

Leave a comment