Digital image representing Informatica data quality.

Angsuman Dutta

CTO, FirstEigen

3 Ways to Solve Last Mile Data Integrity Challenges With Advanced Observability

Posted Nov 14, 2022

Table of Content

Data quality issues have been a long-standing challenge for data-driven organizations. Even with significant investments, the trustworthiness of data in most organizations is questionable at best. Gartner reports that companies lose an average of $14 million per year due to poor data quality.

Data observability has been all the rage in data management circles for a few years now and has been positioned as the panacea to all data quality problems. However, in our experience working with some of the world’s largest organizations, data observability has failed to live up to its promise.

The reason for this is simple: Data integrity problems are often caused by issues that occur at the “last mile” of the data journey – when data is transformed and aggregated for business or customer consumption.

In order to improve data quality effectively, data observability needs to detect the following three types of data errors:

Metadata Error (First Mile Problem): This includes things like detecting freshness, changes in record volume, and schema changes. Metadata errors are caused by incorrect or obsolete data, changes in the structure of the data, a change in the volume of the data, or a change in the profile of the data.
Data Error (Middle Mile Problem): This includes things like detecting record-level completeness, conformity, uniqueness, anomaly, consistency, and violation of business-specific rules.
Data Integrity Error (Last Mile Problem): This includes things like detecting loss of data and loss of fidelity between source and target system.

However, most data observability projects/initiatives only focus on detecting metadata errors. As a result, these initiatives fail to detect data integrity errors – which impact the quality of the financial, operational, and customer reporting. Data integrity errors can have a significant impact on business, both in terms of cost and reputation.

Data integrity errors are most often caused by errors in the ETL process and incorrect transformation logic.

Current Approach for Detecting Data Integrity Errors and Challenges

Most data integrity initiatives leverage the following types of data integrity checks:

Schema Checks: It checks for schema and record count mismatches between source and target systems. This is a very effective and computationally cheap option when data does not undergo any meaningful transformations. Generally used during migration projects or data refinement processes as depicted in the data flow.
Cell by Cell Matching: Data often undergoes transformations throughout its journey. In this scenario, cell-by-cell matching of data elements between the source and target system is done to detect data loss or data corruption issues.
Aggregate Matching: Data is often aggregated or split for business or financial reporting purposes. In this scenario, one-to-many matching of the aggregated data elements between the source and target system is done to detect data loss or data corruption issues due to aggregation errors.

Most data teams experience the following operational challenges while implementing data integrity checks:

Time it takes to analyze data and consult the subject matter experts to determine what rules need to be implemented for a schema check or cell-by-cell matching. This often involves the replication of transformation logic.
Data needs to be moved from the source system and target system to the data integrity platform for matching resulting in latency, increased compute cost, and significant security risks. [2,3]

Solution Framework

Data teams can overcome the operational challenges by leveraging machine learning-based approaches:

Finger Printing Technique: Traditional brute force data matching algorithms becomes computationally prohibitive to match all source records with all target records when the data volume is large.

Fingerprinting mechanisms can be used to identify if two data sets are identical without the need to compare each record in the data set. A fingerprint is a small summary of a larger piece of information. The key idea behind using fingerprints for data matching is that two pieces of information will have the same fingerprint if and only if they are identical. There are three types of advanced fingerprinting mechanisms – Bloom filters [1], Min-Hash, and Locality Sensitive Hashing (LSH).

Fingerprinting techniques are computationally cost-effective and do not suffer from scalability problems. More importantly, fingerprinting technique eliminates the need to move the source and target system data to another platform.

Immutable Field Focus: Cell-by-cell matching should focus only on immutable data elements – business-critical columns that do not change or lose their meaning because of transformation. For example, the total principal loan amount of a mortgage should remain unchanged between the source and target system irrespective of the transformation. Matching all data fields requires replication of the transformation logic which is time-consuming.
Autonomous Profiling: Autonomous means for identifying and selecting immutable fields help data engineers focus on the most important data elements that need to be matched between the source and target system. When these critical fields are matched successfully, it is likely that the entire record has been transformed correctly.

Conclusion

So, is data observability the silver bullet for all data quality problems? In short, no. However, if you are experiencing data integrity issues at the “last mile” of your data journey, it is worth building a data observability framework that detects not only metadata errors but also data errors and data integrity errors. Automated machine learning can be leveraged to eliminate the operational challenges associated with traditional data integrity approaches.

arxiv.org/pdf/1901.01825.pdf
datafold.com/blog/open-source-data-diff

Contact FirstEigen today to learn more about data quality and data trustability.

Check out these articles on Data Trustability, Observability & Data Quality Management-

FAQs

What are the common causes of data integrity issues?

Common causes of data integrity issues include human errors during data entry, incomplete or outdated records, technical glitches, and cybersecurity threats. Data corruption during transfer, lack of proper validation checks, and improper system configurations also contribute to data integrity problems.

How do I prevent loss of data integrity in my data pipeline?

What are the signs of data integrity problems in observability tools?

When do data integrity issues occur most often?

How does First Mile Observability help prevent data issues?

How can businesses use data observability to monitor data integrity?

What’s the difference between First Mile and Last Mile Observability?

What role does DataBuck play in solving data integrity issues?

Discover How Fortune 500 Companies Use DataBuck to Cut Data Validation Costs by 50%

Schedule DataBuck Demo Today!

The Power of Data Quality for AI Success

June 26, 2026

AI agents are only as reliable as the data they act on. As enterprises race to deploy AI, data quality has quietly become the deciding factor between success and costly failure. The Problem Nobody Is Solving Most AI conversations…

Learn more

Mainframe Data Reconciliation for Cloud Migration

June 23, 2026

Cloud migration is no longer just an infrastructure decision. For data leaders and data engineers, it is a trust decision. …

Learn more

What Do Failed AI Projects Have in Common?

June 2, 2026

Most AI failures are not model failures — they are data, governance, operational trust, and weak AI-ready foundations. “AI alone is not the solution – trusted, validated, continuously governed data is the…

Learn more

Bad Data Is Costing
You More Than You Think

See how DataBuck helps modern enterprises prevent data errors and scale analytics with confidence.

Book a Demo