Data Lakehouse

Shashi Shankar

Mar 22, 2023

Accelerate insights by using Data Lakehouse

What is a Data Lakehouse

A data lakehouse is a unified data platform that combines the capabilities of a data lake and a data warehouse into a single architecture. It aims to address the shortcomings of traditional data warehouses and data lakes by providing a more integrated and flexible solution for storing, managing, and analyzing data.

History of Lakehouse

The lakehouse data architecture term was first coined by a company, Medallion Analytics, to describe their approach to data management and analytics. While there isn't a widely recognized industry standard definition for "Medallion Lakehouse," it likely refers to a unified data platform that combines the capabilities of both data lakes and data warehouses.

Here's a general interpretation of what a "Medallion Lakehouse" data architecture might entail:

Data Lake:

This component serves as a centralized repository for storing raw, unstructured, and semi-structured data at scale. The data lake allows organizations to ingest diverse data types from various sources without predefining a schema. It provides flexibility for data exploration, experimentation, and ad-hoc analysis.

Data Warehouse:

The data warehouse aspect of the architecture focuses on organizing and structuring data for optimized querying and analytics. It typically involves transforming and aggregating data from the data lake into a structured format suitable for business intelligence (BI) reporting, dashboards, and data analytics.

Unified Data Platform:

The "Medallion Lakehouse" concept likely emphasizes the integration and convergence of data lake and data warehouse capabilities into a single, unified platform. This unified approach enables organizations to leverage the strengths of both paradigms, such as scalability and flexibility from the data lake, and performance and governance from the data warehouse.

Analytics Capabilities:

In addition to storage and management, the Medallion Lakehouse architecture may encompass analytics capabilities such as data visualization, machine learning, and advanced analytics. These capabilities enable organizations to derive actionable insights and drive data-driven decision-making.

This architecture ensures ACID (Atomicity, Consistency, Isolation, Durability) properties as data undergoes multiple stages of validation and transformation before being stored in a format optimized for efficient analytics

The terms bronze (raw), silver (validated), and gold (enriched) describe the quality of the data in each of these layers

Lakehouse vs Data Lake vs Warehouse

Data Warehouse

· Enterprise data warehouses prioritize query optimization for BI reports.

· Query execution in data warehouses can be time-consuming, sometimes taking minutes or hours.

· They are tailored for relatively static data that doesn't change frequently.

· Data warehouses aim to avoid conflicts between simultaneous queries.

· Some data warehouses utilize proprietary formats, restricting compatibility with machine learning applications.

Data Lake

· Data lakes are cost-effective storage solutions capable of storing and processing data efficiently.

· Unlike data warehouses, which provide structured data for BI analytics, data lakes store data of any type and format.

· Data lakes are commonly used for data science and machine learning applications due to their flexibility.

· However, they are not typically utilized for BI reporting because of their unvalidated nature.

Data Lakehouse

The data lakehouse combines the benefits of data lakes and data warehouses and provides:

· Open, direct access to data stored in standard data formats.

· Indexing protocols optimized for machine learning and data science.

· Low query latency and high reliability for BI and advanced analytics.

Key characteristics of a data lakehouse:

Unified Storage:

A data lakehouse stores both structured and unstructured data in a centralized repository, similar to a data lake. This allows organizations to ingest and store raw data from various sources without predefined schemas both in batch and real-time modes.

Schema Enforcement:

Unlike traditional data lakes, a data lakehouse enforces schema-on-read capabilities, ensuring that data is structured and validated at the time of analysis, similar to a data warehouse. This enables organizations to maintain data quality and consistency while still leveraging the flexibility of a data lake.

Optimized Analytics:

A data lakehouse provides tools and technologies for processing and analyzing data stored in the unified repository. This includes support for SQL queries, data indexing, and optimization techniques to accelerate analytical queries and improve performance.

Scalability and Flexibility:

By combining the scalability of a data lake with the structured querying capabilities of a data warehouse, a data lakehouse offers scalability and flexibility to handle large volumes of data and diverse analytical workloads.

Data Governance and Security:

A data lakehouse includes features for data governance, access control, and data lineage tracking to ensure compliance with regulatory requirements and security standards.

Domains of Data Lakehouse Platform

Storage: In cloud, data is mainly stored in scalable, efficient and resilient cloud object storages provided by the cloud providers.

Governance: Capabilities around data governance, e.g. access control, auditing, metadata management, lineage tracking, monitoring for all data and AI assets.

AI engine: The AI Engine provides generative AI capabilities for the whole platform.

Ingest & transform The capabilities for ETL workloads.

Advanced analytics, ML & AI All capabilities around machine learning, AI, Generative AI, and also streaming analytics.

Data warehouse The domain supporting DWH and BI use cases.

Orchestration domain for central workflow management and the

ETL & DS tools: The front-end tools that data engineers, data scientists and ML engineers primarily use for work.

BI tools: The front-end tools that BI analysts primarily use for work.

Collaboration: Capabilities for data sharing between two or more parties.

Seven Pillars of Well-architected Lakehouse Framework

Data governance

The oversight to ensure that data brings value and supports your business strategy.

Interoperability and usability

The ability of the lakehouse to interact with users and other systems.

Operational excellence

All operations processes that keep the lakehouse running in production.

Security, privacy, compliance

Protect the Azure Databricks application, customer workloads, and customer data from threats.

Reliability

The ability of a system to recover from failures and continue to function.

Performance efficiency

The ability of a system to adapt to changes in load.

Cost optimization

Managing costs to maximize the value delivered.

Databricks Lakehouse on Azure

Key Components of Databricks Lakehouse

Spark engine - compute
Azure Blob Storage Gen2 - storage
Delta Lake – an optimized logical storage layer on top of Gen2 storage for ACID transaction
Unity Catalog – a fine-grained governance solution for data and AI

Guiding Principles for the Lakehouse

Curate data and provide only trusted data for consumption
- Ingest Layer – raw data
- Curated Layer – deduped data
- Final Layer – curated data for consumption

Eliminate data silos
Democratize data consumption
Implement an organize-wise data governance
Build to scale
Optimize for performance and cost

Lakehouse on Amazon AWS

Amazon S3 and AWS Glue (Amazon Web Services): Amazon S3 (Simple Storage Service) is a highly scalable cloud storage service, and AWS Glue is a fully managed extract, transform, and load (ETL) service. Together, they enable building a data lakehouse architecture on AWS.

Data Lakehouse Framework for Google Cloud Platform (GCP)

Google Cloud Storage and Google BigQuery (Google Cloud): Google Cloud Storage is a scalable and durable object storage service, and Google BigQuery is a serverless data warehouse for analytics. They can be combined to create a cloud data lakehouse architecture on Google Cloud.