Unleashing the Power of Databricks Lakehouse Federation

3 October 2024

Do you want to query data from many external sources? But you want to avoid moving it into lakehouse. Databricks’ Lakehouse Federation lets you analyze data across platforms in real-time. This feature enables you to break down data silos. Query and catalog external data directly through Unity Catalog’s unified governance layer.

This post will discover why the Lakehouse Federation is revolutionizing data management.

But that’s not all! We’ve also prepared a video where Hubert explains all the features in practice. You can find the link to the video at the end of this blog post – don’t miss out on these valuable insights.

What is the Lakehouse Federation?

Lakehouse Federation lets you query external data sources using a robust Databricks SQL warehouse powered by Spark. You don’t need to ingest or move data into your Lakehouse. This suits businesses that want to cut data duplication and storage costs. They also want to access external data for analytics and machine learning.

 

Supported data sources are:

  • Microsoft Synapse
  • Amazon Redshift
  • Google BigQuery
  • Snowflake
  • Salesforce Data Cloud
  • MySQL
  • PostgreSQL
  • Microsoft SQL Server / Azure SQL
  • Other Databricks Unity Catalog
  • More are on the way, including Teradata and Oracle and the possibility of federating Hive metastores and AWS Glue

 

 

 

Technology Behind

Lakehouse Federation leverages the Apache Spark engine to process data across diverse sources efficiently. This ensures that loading large datasets is highly optimized, especially when performing operations like joining external data with data already in the Lakehouse. The Spark engine intelligently imports only the data required by pushing down filters and aggregations into the queries. This pushdown capability minimizes data transfer and enhances performance by executing computations closer to the data source.

Key Benefits of Lakehouse Federation

 

  1. Real-Time Queries Across External Data Performs real-time queries on external data sources. This saves time and resources by avoiding data movement.
  2. No Data Movement Required Lakehouse Federation queries data in place. This reduces data duplication and storage costs. It also maintains flexible access to your data.
  3. Unified Data Management with Unity Catalog Integrate external data into Unity Catalog. It provides enterprise-level access management, data lineage, and governance. This ensures that one roof manages all your data, regardless of its location.
  4. Delta Sharing: Possibility of sharing data without ingestion to Lakehouse.

 

Hands-on demonstration by our Databricks Expert – check out our special video with all the insights.

 

Hubert Dudek, our Databricks Expert, demonstrates Lakehouse Federation in action. His guide shows how to use Azure SQL Database with Databricks. The video steps through startup and use, offering insight for those wanting to tap into this integration.

 

  • Setting Up a Connection: Learn to create a secure connection between Databricks and external sources like Azure SQL. Use the Databricks UI or a programmatic approach.
  • Building a Catalog: Check how to register the external data source as a catalog in the Unity Catalog.
  • Performing Joins Across Sources: See how to join data from external sources with existing lakehouse data. Hubert shows this by joining product info from Azure SQL with the lakehouse’s existing product categories.
  • Query Execution: Lakehouse Federation speeds up queries by filtering and aggregating external data sources. This method improves speed and efficiency by fetching only the necessary data.
  • Creating Materialized Views: Hubert explains how to create materialized views for frequently queried data. These views boost performance and keep external data updated without needing to re-query the source each time.

 

 

 

Unlocking New Use Cases with Lakehouse Federation

 

Bridging clouds and blending architectures, Lakehouse Federation unlocks new potential. It connects hybrid data systems. It turns multi-cloud setups into powerful, efficient, unified sources of insight. By allowing direct querying of external data sources, organizations can:

 

  • Enhance machine learning models by incorporating data from external sources.
  • Speed up analytics projects by reducing the time and complexity of data preparation.
  • Simplify governance with unified metadata management through the Unity Catalog.
  • Migration: By leveraging Lakehouse Federation during migration, you can access and work with data across multiple sources without moving it immediately into Databricks. This simplifies the replication of data layers, enabling an incremental and less risky migration process with minimal disruption. It allows your teams to start utilizing Databricks’ capabilities early, ensuring a smoother transition and faster realization of benefits.
  • Data Sharing: While Delta Sharing has become the industry standard for data sharing, Lakehouse Federation allows you to share data from sources that do not natively support it. This enables you to collaborate with other departments within your company or share data with customers through secure, clean rooms.

 

Ready to explore? 

Unlock data’s full potential with Databricks Lakehouse Federation. This powerful tool breaks down silos, offering instant insights from scattered sources. No more costly transfers or lengthy waits – analyze live data wherever it lives. Improve your decision-making. Be faster, deeper, and more flexible. Keep the information in its proper place.

Witness Hubert Dudek unleash Lakehouse Federation’s full capabilities in our latest video. His presentation lights up the room, highlighting the platform’s great potential and versatility.

And while you’re here, check out our other blog post related to Databricks: What’s new in Databricks (September 2024). There’s a wealth of information waiting for you!