Published in: Information Management Online (click here to view it)

Author: Seth Rao
MAR 23, 2016 6:59am ET


Data Integrity: The Horseshoe Nail That’ll Cost You the War

How often is it that the unsexy, critical piece will fail you, because nobody was paying attention to it? Without the basic blocks and tackles you don’t have a prayer to win. What is so fundamental in the Big-Data world, and yet so overlooked?

Boston Consulting Group* rHorseshoe nail of Big Dataecently identified poor quality of Big Data as that “horseshoe nail” that could lose wars. It impacts as much as 25% of the full potential when making decisions in marketing, bad-debt reduction, pricing, etc. Paying attention to that little thing can literally make you millions. With the increasing use of high speed, large volume, complex data
in a variety of formats in supporting cross functional operational processes such as marketing, compliance initiatives, analytics initiatives, customer and product management, Data Quality (DQ) has become more important than ever in the age of Big Data. Having a lot of data in different volumes and formats coming in at high speed is worthless if that data is incorrect.

Cost of incorrect data:

Poor quality of Big Data results in compliance failures, manual rework cost to fix errors, inaccurate insights, failed initiatives and lost opportunity.  The current focus in most big-data projects is on data ingestion, processing and analysis of large volume of data. Data quality issues start surfacing during the data analysis and operation phase.  Our research estimates that an average of 25-30% of any big-data project is spent on identifying and fixing data quality issues. In extreme scenarios where data quality issues are significant, projects get abandoned. That is very expensive loss of capability!

Fighting today’s Data Integrity battles with yesterdays’ Data Quality tools:

In the “regular-data” world, small and manageable data-volume and velocity, Data Integrity and Data Quality validation is either automated or manual.  But, when data flows at a high volume and high speed, in different formats, from multiple sources and through multiple platforms, validating data using conventional approaches is a nightmare. The conventional data validation tools and approaches are architecturally limited and unable to handle massive scale of Big Data volume and meet processing speed requirements.

Big-Data teams in organizations often rely on a number of these methods to validate the data quality:

– Profiling the source system data prior to the ingestion

– Matching the record count pre and post data ingestion

– Sampling the big-data to detect data quality issues

They hard-code Big Data-based scripts (e.g. Pig/Spark SQL, etc.) to perform some of these Data Quality checks because of the architectural limitations of the existing tools. Which are executed during the development cycle in an ad-hoc manner. While these methods are somewhat effective in detecting the errors, scripts are often the susceptible to human error or system change related errors. More importantly, these approaches are not effective during the operational phase. In addition, these approaches are not designed to detect hidden data quality issues such as transaction outliers. A transaction outlier is defined as a transaction that is statistically different from the transaction set but passes all deterministic data quality tests. Such scenarios require advanced statistical logic for identifying the outlier transactions.

Icing on the cake:

The problem exacerbates when multiple big-data platforms are involved. For example, transactions from source systems may be dumped to operational “NO-SQL” database and a HDFS based data storage repository for reporting and analytics. In such scenario, script based solution would not work cohesively to provide an end to end view. You are doomed from the beginning!

What you need:

In our view, and that of BCG and others, you belong to an exclusive group of wise executives if you realize the importance of Big Data Quality from the very beginning. Current approaches are not scalable, not sustainable, and definitely not suitable for Big Data initiatives. In the absence of a scalable, cross platform, comprehensive, and automated solution to detect data quality issues, organization will risk any returns on their big data initiatives.

Check out how new tools like DataBuck** can dramatically improve Data Integrity and Data Quality by 10x your current situation.


* “How to Avoid the Big Bad Data Trap”, BCG Perspectives, June 2015