Unifying Your Knowledge Ecosystem with Delta Lake Integration

As organizations are maturing their information infrastructure and accumulating extra information than ever earlier than of their information lakes, Open and Dependable desk codecs corresponding to Delta Lake turn out to be a important necessity.

1000’s of corporations are already utilizing Delta Lake in manufacturing, and open-sourcing all of Delta Lake (as introduced in June 2022) has additional elevated its adoption throughout numerous domains and verticals.

Since lots of these corporations are utilizing each Databricks and different information and AI frameworks (e.g., Energy BI, Trino, Flink, Spark on Kubernetes) as a part of their tech stack, it’s essential for them to have the ability to learn and write from/to Delta Lake utilizing all these frameworks.

The objective of this weblog submit is to assist these customers achieve this, as seamlessly as attainable.

Integration Choices

Databricks offers a number of choices to learn information from and write information to the lakehouse. These choices differ from one another on numerous parameters. Every of those choices match totally different use circumstances.

The parameters we use to guage these choices are:

  1. Learn Solely/Learn Write – Does this selection present learn/write entry or learn solely.
  2. Upfront funding – Does this integration possibility require any customized growth or organising one other element.
  3. Execution Overhead – Does this selection require a compute engine (cluster or SQL warehouse) between the information and the shopper software.
  4. Value – Does this selection entail any extra price (past the operational price of the storage and the shopper).
  5. Catalog – Does this selection present a catalog (corresponding to Hive Metastore) the shopper can use to browse for information belongings and retrieve metadata from.
  6. Entry to Storage – Does the shopper want direct community entry to the cloud storage.
  7. Scalability – Does this selection depend on scalable compute on the shopper or offers compute for the shopper.
  8. Concurrent Write Assist – Does this selection deal with concurrent writes, permitting write from a number of shoppers or from the shopper and Databricks on the similar time. (Docs)

Direct Cloud Storage Entry

Entry the information instantly on the cloud storage. Exterior tables (AWS/Azure/GCP) in Databricks Unity Catalog (UC) may be accessed instantly utilizing the trail of the desk. That requires the shopper to retailer the trail, have a networking path to the storage, and have permission to entry the storage instantly.

a. Professionals

  1. No upfront funding (no scripting or tooling is required)
  2. No execution overhead
  3. No extra price

b. Cons

  1. No catalog – requires the developer to register and handle the situation
  2. No discovery capabilities
  3. Restricted Metadata (no Metadata for non delta tables)
  4. Requires entry to storage
  5. No governance capabilities
    • No desk ACLs: Permission managed on the file/folder degree
    • No audit
  6. Restricted concurrent write assist
  7. No inbuilt scalability – the studying software has to deal with scalability in case of huge information units

c. Stream:

Integrating Delta Lake with other platforms
  1. Learn:
    • Databricks carry out ingestion (1)
    • It persists the file to a desk outlined in Unity Catalog. The information is persevered to the cloud storage (2)
    • The shopper is supplied with the trail to the desk. It makes use of its personal storage credentials (SPN/Occasion Profile) to entry the cloud storage on to learn the desk/information.
  2. Write:
    • The shopper writes on to the cloud storage utilizing a path. The trail is then used to create a desk in UC. The desk is accessible for learn operations in Databricks.

Exterior Hive Metastore (Bidirectional Sync)

On this situation, we sync the metadata in Unity Catalog with an exterior Hive Metastore (HMS), corresponding to Glue, frequently. We preserve a number of databases in sync with the exterior listing. It will enable a shopper utilizing a Hive-supported reader to entry the desk. Equally to the earlier resolution it requires the shopper to have direct entry to the storage.

a. Professionals

  1. Catalog offers an inventory of the tables and manages the situation
  2. Discoverability permits the person to browse and discover tables

b. Cons

  1. Requires upfront setup
  2. Governance overhead – This resolution requires redundant administration of entry. The UC depends on desk ACLs and Hive Metastore depends on storage entry permissions
  3. Requires a customized script to maintain the Hive Metastore metadata updated with the Unity Catalog metadata
  4. Restricted concurrent write assist
  5. No inbuilt scalability – the studying software has to deal with scalability in case of huge information units

c. Stream:

Integrating Delta Lake with other platforms
  1. The shopper creates a desk in HMS
  2. The desk is persevered to the cloud storage
  3. A Sync script (customized script) syncs the desk metadata between HMS and Unity Catalog
  4. A Databricks cluster/SQL warehouse seems to be up the desk in UC
  5. The desk information are accessed utilizing UC from the cloud storage

Delta Sharing

Entry Delta tables through Delta Sharing (learn extra about Delta Sharing right here).

The information supplier creates a share for current Delta tables, and the information recipient can entry the information outlined throughout the share configuration. The Shared information is stored updated and helps actual time/close to actual time use circumstances together with streaming.

Usually talking, the information recipient connects to a Delta Sharing server, through a Delta Sharing shopper (that’s supported by quite a lot of instruments). A Delta sharing shopper is any device that helps direct learn from a Delta Sharing supply. A signed URL is then offered to the Delta Sharing shopper, and the shopper makes use of it to entry the Delta desk storage instantly and browse solely the information they’re allowed to entry.

On the information supplier finish, this strategy removes the necessity to handle permissions on the storage degree and offers sure audit capabilities (on the share degree).

On the information recipient finish, the information is consumed utilizing one of many aforementioned instruments, which implies the recipient additionally must deal with the compute scalability on their very own (e.g., utilizing a Spark cluster).

a. Professionals

  1. Catalog + discoverability
  2. Doesn’t require permission to storage (carried out on the share degree)
  3. Offers you audit capabilities (albeit restricted – it’s on the share degree)

b. Cons

  1. Learn-only
  2. You must deal with scalability by yourself (e.g., use Spark)

i. Stream:

Integrating Delta Lake with other platforms
  1. Databricks ingests information and creates a UC desk
  2. The information is saved to the cloud storage
  3. A Delta Sharing supplier is created and the desk/database is shared. The entry token is offered to the shopper
  4. The shopper accesses the Delta Sharing server and appears up the desk
  5. The Shopper is offered entry to learn the desk information from the cloud storage

JDBC/ODBC connector (write/learn from wherever utilizing Databricks SQL)

The JDBC/ODBC connector means that you can join your backend software, utilizing JDBC/ODBC, to a Databricks SQL warehouse (as described right here).

This primarily isn’t any totally different than what you’d usually do when connecting backend functions to a database.

Databricks and a few third occasion builders present wrappers for the JDBC/ODBC connector that enable direct entry from numerous environments, together with:

  • Python Connector (Docs)
  • Node.JS Connector (Docs)
  • Go Connector (Docs)
  • SQL Execution API (Docs)

This resolution is appropriate for standalone shoppers, because the computing energy is the Databricks SQL warehouse (therefore the compute scalability is dealt with by Databricks).

Versus the Delta Sharing strategy, the JDBC/ODBC connector strategy additionally means that you can write information to Delta tables (it even helps concurrent writes).

a. Professionals

  1. Scalability is dealt with by Databricks
  2. Full governance and audit
  3. Straightforward setup
  4. Concurrent write assist (Docs)

b. Cons

  1. Value
  2. Appropriate for standalone shoppers (much less for distributed execution engines like Spark)

l. Workflow:

Integrating Delta Lake with other platforms
  1. Databricks ingests information and creates a UC desk
  2. The information is saved to the cloud storage
  3. The shopper makes use of a JDBC connection to authenticate and question a SQL warehouse
  4. The SQL warehouse seems to be up the information in Unity Catalog. It applies ACLs, accesses the information, performs the question and returns a end result set to the shopper

c. Be aware that you probably have Unity Catalog enabled in your workspace, you additionally get full governance and audit of the operations. You possibly can nonetheless use the strategy described above with out Unity Catalog, however governance and auditing can be restricted.

d. That is the one possibility that helps row degree filtering and column filtering.

Integration choices and use-cases matrix

This chart demonstrates the match of the above described resolution alternate options with a choose checklist of widespread use circumstances. They’re rated 0-4:

  • 0 – N/A
  • 1 – Require Adjustment to Match the use case
  • 2 – Restricted Match (can present the required performance with some limitation)
  • 3 – Good Match
Unifying Your Data Ecosystem with Delta Lake Integration

Evaluate Documentation

Related Articles


Please enter your comment!
Please enter your name here

Latest Articles