Data operations and engineering teams spend 30-40% of their time firefighting data issues raised by business stakeholders.
A large percentage of these data errors can be attributed to the errors present in the source system or errors that occurred or could have been detected in the data pipeline.
Current data validation approaches for the data pipeline are rule-based – designed to establish data quality rules for one data asset at a time—as a result, there are significant cost issues in implementing these solutions for 1000s of data assets/buckets/containers. Dataset-wise focus often leads to an incomplete set of rules or often not implementing any rules at all.
With the accelerating adoption of AWS Glue as the data pipeline framework of choice, the need for validating data in the data pipeline in real-time has become critical for efficient data operations and to deliver accurate, complete, and timely information.
This blog provides a brief introduction to DataBuck and outlines how to build a robust AWS Glue data pipeline to validate data as data moves along the pipeline.
What is DataBuck?
DataBuck is an autonomous data validation solution purpose-built for validating data in the pipeline. It establishes a data fingerprint for each dataset using its ML algorithm. It then validates the dataset against the fingerprint to detect erroneous transactions. More importantly, it updates the fingerprints as the dataset evolves over time thereby reducing the efforts associated with maintaining the rules.
DataBuck primarily solves two problems:
- Data Engineers can incorporate data validations as part of their data pipeline by calling a few python libraries. They do not need to have a priori understanding of the data and its expected behaviors (i.e. data quality rules)
- Business stakeholders can view and control auto-discovered rules and thresholds as part of their compliance requirements. In addition, they will be able to access the complete audit trail regarding the quality of the data over time.
DataBuck leverages machine learning to validate the data through the lens of standardized data quality dimensions as shown below:
- Freshness — determine if the data has arrived within the expected time of arrival.
- Completeness — determine the completeness of contextually important fields. Contextually important fields are identified using mathematical algorithms.
- Conformity — determine conformity to a pattern, length, format of contextually important fields.
- Uniqueness — determine the uniqueness of the individual records.
- Drift — determine the drift of the key categorical and continuous fields from the historical information
- Anomaly — determine volume and value anomaly of critical columns
Setting up DataBuck for Glue
Using DataBuck within Glue job is a three step process as shown in the following diagram
Step 1: Authenticate and Configure DataBuck
Step 2: Execute Databuck
Step 3: Analyze the result for the next step
Business Stakeholder Visibility
In addition to providing programmatic access to validate AWS dataset within the Glue Job, DataBuck provides the following results for compliance and audit trail
1. Data Quality of a Schema Overtime:
2. Summary Data Quality Results of Each Table
3. Detailed Data Quality Results of Each Table
4. Business Self-service for controlling the rules
DataBuck provides a secure and scalable approach to validate data in within the glue job. All it takes is a few lines of code and you can validate the data on an going manner. More importantly, your business stakeholder will have full visibility to the underlying rules and can control the rules and rule threshold using a business user friendly dashboard.