Key Takeaway

Health insurance companies are losing millions of dollars each year due to poor healthcare Data Integrity and healthcare Data Quality in critical business processes such as claims, enrollment and membership, billing and payment, population management, pricing etc. In order to comply with the Accountable Care Act (ACA), Healthcare insurance organizations are turning to data for help, but are challenged by hidden data inaccuracies and lack of Data Integrity between systems. Significant cost is routinely sunk into IT organizations to weed out data errors. Because of the architectural limitations, existing Data Integrity validation tools have become cost prohibitive specially dealing with large data volume that is common for most health insurance companies.

Newer solutions that leverage Big Data technologies “under the hood,” like DataBuck+, can reduce costs by over 50% while increasing healthcare Data Quality by over 10x. With hardly any fixed costs (SaaS) and quick set up (<30mins) they can be rapidly adopted and jettisoned as needed.


Healthcare insurance organizations are increasingly turning to big data analytics to reduce fraud and abuse, control costs, increase customer loyalty and enhance operational efficiency to support transition towards a more retail orientated value based insurance marketplace. They are analyzing massive amounts of data in claims, clinical, billing and customer service data that they have at their disposal. Our experience shows that the health care Data Integrity (DI) and Data Quality (DQ) of claims and clinical data often is not pristine. For example, a number of key fields in the claims data are often left blank or incorrectly coded and do not align with the clinical data. Analytics team often spends more than 30% of the time in ensuring data quality prior to analyzing the data. Data Integrity issues often results in costly manual rework.

Fighting Healthcare’s Data Integrity battles with yesterdays’ Data Quality tools: In the “regular-data” world data-volume and velocity are manageable. Data Quality validation is either automated or manual.  But, when data flows at a high volume and high speed, in different formats, from multiple sources and through multiple platforms, validating data using conventional approaches is a nightmare. The conventional data validation tools and approaches are architecturally limited and unable to handle massive scale of Big Data volume and meet processing speed requirements.

Big-Data teams in organizations often rely on a number of these methods to validate the Data Integrity and Data Quality:

  • Profiling the source system data prior to the ingestion
  • Matching the record count pre and post data ingestion
  • Sampling the big-data to detect data quality issues

Drawbacks of existing tools: Architectural limitations of the existing tools force them to hard-code Data Integrity checks using Big Data-based scripts (e.g. Pig/Spark SQL, etc.). These scripts are executed during the development cycle in an ad-hoc manner. While these methods are somewhat effective in detecting the errors, scripts are often the susceptible to human error or system change related errors. More importantly, these approaches are not effective during the operational phase. In addition, these approaches are not designed to detect hidden data quality issues such as transaction outliers. A transaction outlier is defined as a transaction that is statistically different from the transaction set but passes all deterministic data quality tests. Such scenarios require advanced statistical logic for identifying the outlier transactions.

The last straw- Big Data: The problem is exacerbated when multiple big-data platforms are involved. For example, transactions from source systems may be dumped to operational “NO-SQL” database and a HDFS-based (Hadoop) data storage repository for reporting and analytics. In such scenario, script based solution would not work cohesively to provide an end to end view. You are doomed from the beginning!


Boston Consulting Group++ reported that poor Data Integrity/Data Quality impacts as much as 25% of the full potential when making decisions in marketing, fraud detection, pricing, etc. Information Management magazine+++ recently identified poor quality of Big Data as the “horseshoe nail” that could lose wars. Having a lot of data in different volumes and formats coming in at high speed is worthless if that data is incorrect. Paying attention to the oft forgotten Data Integrity can literally save you millions!

Cost of poor Data Quality/Data Integrity: Poor quality of Big Data results in compliance failures, manual rework cost to fix errors, inaccurate insights, failed initiatives and lost opportunity.  The current focus in most big-data projects is on data ingestion, processing and analysis of large volume of data. Data Integrity and Quality issues start surfacing during the data analysis and operation phase.  Our research estimates that an average of 25-30% of any big-data project is spent on identifying and fixing data quality issues. In extreme scenarios where Data Quality issues are significant, projects get abandoned. That is very expensive loss of capability!


Big Data has increasingly become a valuable asset for organizations. While it enables organizations to find the needle in the proverbial haystack, poor quality of underlying data may provide misleading results. Current approaches for ensuring big-data quality are inadequate and are full of operational challenges. There is an urgent need to adopt an enterprise approach for systematically validating quality of big data across platform.

Organizations should only consider Big Data Integrity validation solutions that are equipped to access data across multiple platforms (small- and big-data platforms), parse variety of data formats without transformations, and are scalable as the underlying big-data platform. They must be enabled for Cross Platform Data Profiling, Cross Platform Data Quality tests, Cross Platform Reconciliation and Anomaly Detection. They must also integrate with the other enterprise systems.


Contact– Jen:


+ DataBuck,

++ “How to Avoid the Big Bad Data Trap”, BCG Perspectives, June 2015

+++ What is the “Horseshoe Nail” of Big Data? (2016)