How to automate continuous Data Trustability for Cloudera Data Lake

Angsuman Dutta, CTO, FirstEigen, Chicago, IL, USA 

Kenton Troy Davis, Manager Partner Solutions Engineering, Cloudera, Washington DC, USA 

In today’s digital economy, data is arguably the most valuable asset for businesses. The phrase “data is the new oil” has grown from being a catchphrase to a reality that many companies have embraced. However, as with oil, the value of data lies not just in its presence, but in its quality, refinement, and trustworthiness. In the realm of vast data repositories like data lakes, establishing trust in data becomes both vital and challenging. 

As organizations pivot to platforms like Cloudera Data Lake, especially in sectors like finance, healthcare, and pharmaceuticals, a new set of challenges and opportunities emerge. The inclination towards Cloudera’s private and hybrid cloud offerings, driven by data security concerns associated with public cloud data lakes, has made the quest for data trustability even more pertinent. 

This article delves deep into the concept of establishing autonomous data trustability for Cloudera Data Lake. 

Understanding the Cloudera Landscape 

Firstly, it’s essential to understand why Cloudera has become a platform of choice for industries dealing with sensitive data. Financial, healthcare, and pharmaceutical industries have always been at the forefront of regulatory scrutiny due to the sensitive nature of their data. The growing concerns about data breaches, especially in public cloud data lakes, have made Cloudera’s private and hybrid cloud offerings appealing. These platforms offer more granular control over data, tailored security measures, and the flexibility to meet varied regulatory requirements. 

Lessons in Data Validation 

Some core challenges observed include: 

Rule-based Conundrums: Traditional validation, based on preset rules, falls short in scalability. As data grows, managing and maintaining these rules become unwieldy. 

Architectural Hurdles: Data transfer between validation tools and the main repository presents latency and security issues. 

Knowledge Barriers: Analysts often operate in knowledge silos, requiring consultations with experts, which can be time-consuming and sometimes impractical. 

Enter DataBuck, which use machine learning to automate validation and establish an objective “Data Trust Score” (DTS). This approach is promising for several reasons: 

Machine Learning Scalability: ML models can adapt and scale as data grows, without linear increases in resources. 

Objective Trust Metrics: An algorithmic trust score eliminates human biases, offering a consistent metric across datasets. 

Holistic Validation Criteria: DataBuck’s trust score encompasses freshness, completeness, conformity, and more, providing a comprehensive validation framework. 

Exhibit-1: Data Trust Score monitoring report from DataBuck

Crafting a Trust Framework for Cloudera 

Here is how Cloudera customers are building autonomous trustability framework using DataBuck: 

Machine Learning at Core: DataBuck leverages Cloudera-specific ML models that assess data quality. These models should consider Cloudera’s architecture, data structures, and unique challenges. 

In-situ Validation: DataBuck operates within the Cloudera environment, eliminating the need for data transfer. This not only speeds up validation but also minimizes security risks. 

Continuous Monitoring: The Cloudera framework leveraging DataBuck should continuously monitor data, adjusting trust scores as new data is ingested. 

User-Friendly Implementation: Keeping the end-user in mind, the process should be straightforward. A Cloudera user should be able to initiate data validation with minimal steps, perhaps even with a single click. 

Transparency in Results: The results should be easily interpretable by both technical and non-technical stakeholders. Dashboards, reports, and alerts can help in communicating the trust scores and any potential issues. 

Industry-Specific Considerations 

Given Cloudera’s rising prominence in the financial, healthcare, and pharmaceutical sectors, the trust framework should consider industry-specific nuances: 

Regulatory Compliance: The validation tools should be designed to ensure data complies with industry regulations. For instance, in healthcare, ensuring PHI data is anonymized and meets HIPAA standards is crucial. 

Sensitive Data Handling: Especially relevant for the financial sector, the framework should have robust mechanisms to detect and handle sensitive data, ensuring it’s not compromised. 

R&D Data in Pharmaceuticals: Pharmaceutical companies invest heavily in R&D, leading to vast datasets that are both proprietary and sensitive. The trust framework should be able to handle such data, ensuring its integrity and security.

Using Cloudera Ranger to curate trusted data  

Proper governance for a data pipeline requires a chain of responsibility across various data personas. For example, raw data landed in a Data Lake by Data Engineers usually requires cleansing and curation with additional participation by Data Stewards. The Data Stewards establish policies that advise on or even limit access to data with poor quality. Data Consumers rely upon the Data Stewards to provide a source of truth.  

Cloudera offers a Shared Data Experience (SDX) that extends data governance across hybrid deployment options. Apache Atlas and Ranger are key components of SDX, as is Cloudera’s continued work on its own Data Catalog. FirstEigen DataBuck and Cloudera SDX complement each other.  

Consider some high-level details of how this integration works. FirstEigen produces data quality scores. Custom tags created in Atlas map the scores as values to a table’s data quality attribute. Ranger policies use the attribute values to authorize access to resources such as Hive or Impala tables. As one policy example that follows, “only Data Consumers who are in an Active Directory admin or developer group can see tables with a data quality score < 95%”. Cloudera and FirstEigen empower the Data Steward to mandate such policies.

Conclusions 

Accurate decisions require robust mechanisms to ensure data trustworthiness. Cloudera is gaining prominence, and its users need streamlined processes to establish this trust. By leveraging machine learning-based solutions, organizations can swiftly set up processes for autonomously measuring and monitoring Data Trustability for Cloudera Data Lake in as little as 60 seconds. Such rapid setups not only save time but also ensure that businesses can immediately start drawing insights from reliable, high-quality data, positioning them for success in the data-driven future. Furthermore, integrating Data Trust Scores with Cloudera Ranger is a powerful way to govern and curate trusted data. Business users will not see data errors as data trust-based curation circuit breakers will contain erroneous data from impacting business decisions.  

Posted in