Angsuman Dutta
CTO, FirstEigen
3 Ways to Solve Last Mile Data Integrity Challenges With Advanced Observability
Data quality issues have been a long-standing challenge for data-driven organizations. Even with significant investments, the trustworthiness of data in most organizations is questionable at best. Gartner reports that companies lose an average of $14 million per year due to poor data quality.
Data observability has been all the rage in data management circles for a few years now and has been positioned as the panacea to all data quality problems. However, in our experience working with some of the world’s largest organizations, data observability has failed to live up to its promise.
The reason for this is simple: Data integrity problems are often caused by issues that occur at the “last mile” of the data journey – when data is transformed and aggregated for business or customer consumption.
In order to improve data quality effectively, data observability needs to detect the following three types of data errors:
- Metadata Error (First Mile Problem): This includes things like detecting freshness, changes in record volume, and schema changes. Metadata errors are caused by incorrect or obsolete data, changes in the structure of the data, a change in the volume of the data, or a change in the profile of the data.
- Data Error (Middle Mile Problem): This includes things like detecting record-level completeness, conformity, uniqueness, anomaly, consistency, and violation of business-specific rules.
- Data Integrity Error (Last Mile Problem): This includes things like detecting loss of data and loss of fidelity between source and target system.
However, most data observability projects/initiatives only focus on detecting metadata errors. As a result, these initiatives fail to detect data integrity errors – which impact the quality of the financial, operational, and customer reporting. Data integrity errors can have a significant impact on business, both in terms of cost and reputation.
Data integrity errors are most often caused by errors in the ETL process and incorrect transformation logic.
Current Approach for Detecting Data Integrity Errors and Challenges
Most data integrity initiatives leverage the following types of data integrity checks:
- Schema Checks: It checks for schema and record count mismatches between source and target systems. This is a very effective and computationally cheap option when data does not undergo any meaningful transformations. Generally used during migration projects or data refinement processes as depicted in the data flow.
- Cell by Cell Matching: Data often undergoes transformations throughout its journey. In this scenario, cell-by-cell matching of data elements between the source and target system is done to detect data loss or data corruption issues.
- Aggregate Matching: Data is often aggregated or split for business or financial reporting purposes. In this scenario, one-to-many matching of the aggregated data elements between the source and target system is done to detect data loss or data corruption issues due to aggregation errors.
Most data teams experience the following operational challenges while implementing data integrity checks:
- Time it takes to analyze data and consult the subject matter experts to determine what rules need to be implemented for a schema check or cell-by-cell matching. This often involves the replication of transformation logic.
- Data needs to be moved from the source system and target system to the data integrity platform for matching resulting in latency, increased compute cost, and significant security risks. [2,3]
Solution Framework
Data teams can overcome the operational challenges by leveraging machine learning-based approaches:
- Finger Printing Technique: Traditional brute force data matching algorithms becomes computationally prohibitive to match all source records with all target records when the data volume is large.
Fingerprinting mechanisms can be used to identify if two data sets are identical without the need to compare each record in the data set. A fingerprint is a small summary of a larger piece of information. The key idea behind using fingerprints for data matching is that two pieces of information will have the same fingerprint if and only if they are identical. There are three types of advanced fingerprinting mechanisms – Bloom filters [1], Min-Hash, and Locality Sensitive Hashing (LSH).
Fingerprinting techniques are computationally cost-effective and do not suffer from scalability problems. More importantly, fingerprinting technique eliminates the need to move the source and target system data to another platform.
- Immutable Field Focus: Cell-by-cell matching should focus only on immutable data elements – business-critical columns that do not change or lose their meaning because of transformation. For example, the total principal loan amount of a mortgage should remain unchanged between the source and target system irrespective of the transformation. Matching all data fields requires replication of the transformation logic which is time-consuming.
- Autonomous Profiling: Autonomous means for identifying and selecting immutable fields help data engineers focus on the most important data elements that need to be matched between the source and target system. When these critical fields are matched successfully, it is likely that the entire record has been transformed correctly.
Conclusion
So, is data observability the silver bullet for all data quality problems? In short, no. However, if you are experiencing data integrity issues at the “last mile” of your data journey, it is worth building a data observability framework that detects not only metadata errors but also data errors and data integrity errors. Automated machine learning can be leveraged to eliminate the operational challenges associated with traditional data integrity approaches.
- arxiv.org/pdf/1901.01825.pdf
- datafold.com/blog/open-source-data-diff
- https://firsteigen.com/blog/how-to-ensure-data-quality-in-your-data-lakes-pipelines-and-warehouses/
Contact FirstEigen today to learn more about data quality and data trustability.
Check out these articles on Data Trustability, Observability & Data Quality Management-
FAQs
Common causes of data integrity issues include human errors during data entry, incomplete or outdated records, technical glitches, and cybersecurity threats. Data corruption during transfer, lack of proper validation checks, and improper system configurations also contribute to data integrity problems.
To prevent data integrity loss, implement strict validation rules, perform regular data audits, and ensure proper encryption and security measures. Utilize data observability tools to monitor your pipeline for anomalies in real time and set up alerts to address issues before they escalate.
Signs of data integrity issues in observability tools include data mismatches between source and target systems, unexpected data anomalies, missing records, and inconsistent reporting outcomes. Frequent error notifications and discrepancies in key data fields can also be red flags.
Data integrity issues frequently occur during data transfers, system migrations, or when integrating new tools with legacy systems. They can also arise when dealing with large-scale data transformations or when insufficient validation and error-checking mechanisms are in place.
First Mile Observability focuses on monitoring the initial stages of the data pipeline, ensuring data is accurate, complete, and correctly formatted right from the start. By catching issues early, it helps prevent downstream errors that could escalate into more significant integrity problems.
Businesses can use data observability to track data quality by setting up monitoring systems that detect anomalies, missing records, and inconsistencies. These tools provide real-time alerts and visibility into the data lifecycle, allowing businesses to address potential integrity issues promptly.
First Mile Observability monitors the initial stages of data ingestion and processing, ensuring data accuracy at the source. Last Mile Observability focuses on data transformation and aggregation, ensuring data is intact and correct by the time it reaches its final destination.
DataBuck ensures data integrity by using machine learning to monitor and validate data in real time, flagging anomalies, missing data, or discrepancies. It automates the data matching process, reducing manual errors and ensuring data remains consistent throughout the pipeline.
Discover How Fortune 500 Companies Use DataBuck to Cut Data Validation Costs by 50%
Recent Posts
Get Started!