Harry Potter, The Elephant, The FBI and The Data Warehouse
In the ancient Indian parable of the elephant, six blind men touch an elephant and report six very different views of the same animal. Compare this scenario to a data warehouse that is getting data from six different sources. “Harry Potter and the Sorcerer’s Stone” as a field in a database can be written as “HP and the Sorcerer’s Stone” or as “Harry Potter I” or simply – “Sorcerer’s Stone”. In the data warehouse these are four separate movie titles. For a Harry Potter fan, they are the same movie. Now increase the number of movies to cover the entire Harry Potter series and further include fifty languages. You now have a set of titles which may perplex even a real Harry Potter aficionado.
What does this have to do with data analytics?
In our conversations with Information Management professionals, one common stumbling block to effective data analytics stands out – data quality. The root cause of this problem is that the same data field is described in different ways by different data sources. When we start collecting data from different data sources into a data warehouse, we often get multiple names for the same data field. In a typical Enterprise Data Warehouse (EDW) it is not uncommon to find the same customer referred by five different names within the system. This introduces significant errors in data analytics and the business decisions that are dependent on them. Enterprises that spend large amounts of money to create a business analytics solution are often tripped up by the lack of data quality which undermines the investment in a business intelligence infrastructure.
Data mapping, in simple terms means the mapping data from different sources into a single format. Some of us would have experienced the challenge in trying to map our address books from our smartphones into mailing lists. After a few synch-ups, it is not uncommon to have multiple listings for the same person. In a business context, where the size of data is much larger with multiple sources of information being used by business-critical applications, the problem becomes much more complex. This often cited by our clients as the single largest contributor to the overall time and cost for running a business analytics operation. A common solution for this problem is to use Subject Matter Experts (SMEs) who manually map the data before loading it into a data warehouse. Needless to say these are expensive resources and this is not the best solution.
There are several approaches to solving this problem. The most common approach is to manually create a map between two data sets. ETL (Extract Transform Load) tools offer graphical tools to enable this map and automate the process. This method is a good solution for static data which undergoes minor changes over time. It has limited applicability when the volume of data is very large and the data sets as well as data sources keep changing. The release of Harry Potter 8 in 50 languages can bring this process down. Another approach to data mapping is based on statistical analysis and heuristics and called data-driven mapping. This follows a process of establishing a pattern of transformation between two data sets. This is typically a self-learning system whose performance improves with time as these patterns are validated. Solutions such as Sypherlink Harvester incorporate this approach to create an automated data mapping process for large enterprises.
In 2005 the U .S. Department of Justice (DOJ) and the U.S. Department of Homeland Security (DHS) partnered to launch the National Information Exchange Model (NIEM) initiative to provide seamless information exchange. This was deployed by law enforcement organizations in state and local governments in Georgia to track criminals by mapping their data profiles across multiple databases. The National Data Exchange (N-DEx)deployed by the FBI is based on NIEM and “connects the dots” between unrelated data on criminals’ and their attributes such as locations, vehicles and case reports. This has led to successful apprehension of criminals which would otherwise not be possible. In 2010 the Department of Health and Human Services (HHS) joined the National Information Exchange Model which has a much bigger problem of mapping data from the health records of 300M Americans. Data standards will be a fundamental requirement to solve this problem.
At the center of these applications of data mapping is the data warehouse that synthesizes the data into usable taxonomies. Sypherlink, reports that 80% of data integration is data discovery and mapping. They further state that accurate, consistent information usually exists in multiple, disparate data sources on a wide variety of platforms. A well-implemented data mapping solution can make the difference between success and failure of an enterprise. Scenarios ranging from business analytics of movie titles to apprehending criminals to managing electronic medical records create significant economic imperatives to invest in the time and expense required to integrate data. We expect a greater investment in data integration technologies over the next decade.