Cloud Data Pipeline Leaks: Challenge of Data Quality in the Cloud

Author

Joe Hilleary

https://www.linkedin.com/in/joseph-hilleary-501ab9156/

Research Analyst at Eckerson Group

https://www.eckerson.com/

Summary

Organizations, especially those in financial services, struggle to ensure data quality in the cloud. Data pipelines often drop rows thanks to issues with infrastructure that companies don’t control, but the scale at which these organizations process data means that manual methods of identifying these leaks are insufficient.

​Introduction

The cloud has revolutionized the ability of small companies to process large amounts of data. In data intensive industries like financial services, it helps boutique firms go toe-to-toe with the traditional goliaths and their enormous data centers. At the same time, it’s created new challenges for teams trying to maintain data quality.

​Leaky Pipelines is the Top Data Quality Issue in the Cloud (AWS, Azure, GCP and Snowflake) for Financial Services Firms

At small financial services firms, almost all data comes from external sources—credit bureaus, data vendors, governments, and other service providers. It arrives in massive quantities, often in near real time. Operational or transactional databases capture this data in the short term before companies can move it to their cloud-based analytics environment. In between, data might pass through a data lake or other intermediate repositories. Handling these data flows requires running hundreds or even thousands of extract, transform, load (ETL) jobs every day.

Every move represents an opportunity for systemic data quality issues to arise (“*Systems Risks*”). These errors stem from problems with technology systems (varying types and versions of infrastructure, data repositories, applications, etc.) and are independent of any particular line of business. Miscommunications between applications can result in the duplication, corruption, or even omission of data in systems down the line. Of these, the most common–and therefore most impactful–for financial services firms is missing data.

“Systems Risks” are data quality issues which arise from problems with technology systems (varying types and versions of infrastructure, data repositories, applications, etc.) and are independent of any particular line of business.

Pipelines drop data when they get out of sync. For instance, one data leader I spoke with discussed the specific challenge of moving Salesforce data into Snowflake. His Salesforce system can only process five jobs at a time and locks tables while it does so. This prevents ETL processes from accessing the data, causing them to drop those rows. Even with the new official Salesforce-Snowflake connector, he still loses millions of rows, which creates a headache for downstream analytics.

Pipelines drop data when they get out of sync.

Pipelines also desync when network links go down or degrade. In the cloud (AWS, Azure, GCP, Snowflake and others), organizations have no control over remediation–when something has gone wrong, they must wait for the provider to fix it. Multiply these issues across every external provider feeding data into a firm’s environment and you begin to realize the scale of the issue.

​The Cost of Poor Cloud Data Quality

These pipeline leaks represent a real threat to organizations’ bottom line. Depending on the use case, firms sometimes require more than 99.999% accuracy to meet downstream service-level agreements (SLAs) or financial reporting requirements. Even for less critical use cases like internal marketing analytics, missing more than .25% of the data can impact the validity of analyses.

Unless they’re able to identify and control pipeline leaks, organizations can face regulatory penalties, miss business opportunities, and lose their competitive edge. What seems like a mundane issue can lead to hundreds of thousands of dollars in fines. Data quality is a bit like eating vegetables, not necessarily pleasant, but critical to maintaining the health of the enterprise.

Unless they’re able to identify and control pipeline leaks, organizations can face regulatory penalties, miss business opportunities, and lose their competitive edge.

​Identifying Leaks at Scale

Headcount has nothing to do with data scale; even small firms handle enormous quantities of data. As a result, catching pipeline leaks becomes a significant challenge. It often requires row by row reconciliation of millions of rows of data and the application of hundreds of data quality rules. Larger companies might have the personnel to take a more manual approach, but small firms don’t have that luxury. For them, catching data quality issues requires automation.

Larger companies might have the personnel to take a more manual approach, but small firms don’t have that luxury. For them, catching data quality issues requires automation.

​Thankfully, new tools have emerged to reduce the burden of writing data quality rules. Instead of relying on human teams to craft rules, these automated data quality platforms, such as FirstEigen’s DataBuck (https://www.firsteigen.com/DataBuck), use machine learning to analyze data flows and generate rules. All the humans need to do is review and refine the rules produced by the software. This technique frees up data teams to focus on fixing data quality issues instead of flagging them.

By necessity, these modern tools also tend to be more cloud oriented than previous generations of data quality platforms. Vendors have designed them from the ground up to work with modern cloud-based data stacks. As a result, they integrate with cloud data sources more smoothly than traditional data quality tools built for on-prem deployments.​

Takeaway

The cloud is a game changer for smaller businesses that handle large quantities of data. But the same features that make it appealing–outsourced infrastructure and maintenance–can lead to data quality errors due to systems risks when pipelines get out of sync and drop data. In order to detect these errors at scale and prevent negative downstream consequences, businesses need automated data quality tools that integrate with their cloud environments.