Enhance AWS Glue Pipeline with Autonomous Data Validation

Table of Content

    Data operations and engineering teams spend 30-40% of their time firefighting data issues raised by business stakeholders.

    A large percentage of these data errors can be attributed to the errors present in the source system or errors that occurred or could have been detected in the data pipeline.

    Current data validation approaches for the data pipeline are rule-based – designed to establish data quality rules for one data asset at a time—as a result, there are significant cost issues in implementing these solutions for 1000s of data assets/buckets/containers.  Dataset-wise focus often leads to an incomplete set of rules or often not implementing any rules at all.

    With the accelerating adoption of AWS Glue as the data pipeline framework of choice, the need for validating data in the data pipeline in real-time has become critical for efficient data operations and to deliver accurate, complete, and timely information.

    This blog provides a brief introduction to DataBuck and outlines how to build a robust AWS Glue data pipeline to validate data as data moves along the pipeline.

    What is DataBuck?

    DataBuck is an autonomous data validation solution purpose-built for validating data in the pipeline. It establishes a data fingerprint for each dataset using its ML algorithm. It then validates the dataset against the fingerprint to detect erroneous transactions. More importantly, it updates the fingerprints as the dataset evolves over time thereby reducing the efforts associated with maintaining the rules.

    DataBuck primarily solves two problems:

    1. Data Engineers can incorporate data validations as part of their data pipeline by calling a few python libraries. They do not need to have a priori understanding of the data and its expected behaviors (i.e. data quality rules)
    2. Business stakeholders can view and control auto-discovered rules and thresholds as part of their compliance requirements. In addition, they will be able to access the complete audit trail regarding the quality of the data over time

    DataBuck leverages machine learning to validate the data through the lens of standardized data quality dimensions as shown below:

    1. Freshness — determine if the data has arrived within the expected time of arrival.
    2. Completeness — determine the completeness of contextually important fields. Contextually important fields are identified using mathematical algorithms.
    3. Conformity — determine conformity to a pattern, length, and format of contextually important fields.
    4. Uniqueness — determine the uniqueness of the individual records.
    5. Drift — determine the drift of the key categorical and continuous fields from the historical information
    6. Anomaly — determine volume and value anomaly of critical columns

    How does DataBuck for AWS Glue work?

    In DataBuck, the user provides Snowflake connection information along with the database details and triggers the continuous data validation process. Once the data validation process is activated, DataBuck sends its ML engine to snowflake to analyze the data and identify data quality issues. Summary results are then presented to the user through the web console. At no point in this process, the user needs to write rules or move data out of snowflake.

    Setting up DataBuck for Glue

    Using DataBuck within Glue job is a three-step process as shown in the following diagram

    Step 1: Authenticate and Configure DataBuck

    Step 2: Execute Databuck

    Step 3: Analyze the result for the next step

    Business Stakeholder Visibility

    In addition to providing programmatic access to validate AWS dataset within the Glue Job, DataBuck provides the following results for compliance and audit trail

    1. Data Quality of a Schema Overtime:

    2. Summary Data Quality Results of Each Table

    3. Detailed Data Quality Results of Each Table

    4. Detailed Data Profile of Each Table

    5. Discovered Data Quality Rules for Each Table


    DataBuck provides a secure and scalable approach to validate data within the glue job. All it takes is a few lines of code and you can validate the data on a going manner. More importantly, your business stakeholder will have full visibility to the underlying rules and can control the rules and rule threshold using a business user-friendly dashboard.

    Check out these articles on Data Trustability, Observability, and Data Quality. 

    Posted in ,


    1. […] AI-automated data entry and ingestion can improve data quality. Using intelligent data capture, AI systems identify and ingest data without manual intervention, ensuring that all necessary data inputs have no missing fields.  […]