Databricks on GCP – A practitioners information on information exfiltration safety.


The Databricks Lakehouse Platform supplies a unified set of instruments for constructing, deploying, sharing, and sustaining enterprise-grade information options at scale. Databricks integrates with Google Cloud & Safety in your cloud account and manages and deploys cloud infrastructure in your behalf.

The overarching aim of this text is to mitigate the next dangers:

  • Information entry from a browser on the web or an unauthorized community utilizing the Databricks internet utility.
  • Information entry from a consumer on the web or an unauthorized community utilizing the Databricks API.
  • Information entry from a consumer on the web or an unauthorized community utilizing the Cloud Storage (GCS) API.
  • A compromised workload on the Databricks cluster writes information to an unauthorized storage useful resource on GCP or the web.

Databricks helps a number of GCP native instruments and providers that assist defend information in transit and at relaxation. One such service is VPC Service Controls, which supplies a option to outline safety perimeters round Google Cloud assets. Databricks additionally helps community safety controls, akin to firewall guidelines based mostly on community or safe tags. Firewall guidelines help you management inbound and outbound visitors to your GCE digital machines.

Encryption is one other necessary element of knowledge safety. Databricks helps a number of encryption choices, together with customer-managed encryption keys, key rotation, and encryption at relaxation and in transit. Databricks-managed encryption keys are utilized by default and enabled out of the field. Prospects may carry their very own encryption keys managed by Google Cloud Key Administration Service (KMS).

Earlier than we start, let us take a look at the Databricks deployment structure right here:

Databricks is structured to allow safe cross-functional staff collaboration whereas conserving a major quantity of backend providers managed by Databricks so you possibly can keep centered in your information science, information analytics, and information engineering duties.

Databricks operates out of a management airplane and a information airplane.

  • The management airplane contains the backend providers that Databricks manages in its personal Google Cloud account. Pocket book instructions and different workspace configurations are saved within the management airplane and encrypted at relaxation.
  • Your Google Cloud account manages the information airplane and is the place your information resides. That is additionally the place information is processed. You need to use built-in connectors so your clusters can hook up with information sources to ingest information or for storage. You can too ingest information from exterior streaming information sources, akin to occasions information, streaming information, IoT information, and extra.

The next diagram represents the movement of knowledge for Databricks on Google Cloud:

Excessive-level Structure

High-level view of the default deployment architecture.

Community Communication Path

Let’s perceive the communication path we need to safe. Databricks might be consumed by customers and purposes in quite a few methods, as proven under:

High-level view of the communication paths.

A Databricks workspace deployment contains the next community paths to safe

  1. Customers who entry Databricks internet utility aka workspace
  2. Customers or purposes that entry Databricks REST APIs
  3. Databricks information airplane VPC community to the Databricks management airplane service. This contains the safe cluster connectivity relay and the workspace connection for the REST API endpoints.
  4. Dataplane to your storage providers
  5. Dataplane to exterior information sources e.g. package deal repositories like pypi or maven

From end-user perspective, the paths 1 & 2 require ingress controls and three,4,5 egress controls

On this article, our focus space is to safe egress visitors out of your Databricks workloads, present the reader with prescriptive steering on the proposed deployment structure, and whereas we’re at it, we’ll share finest practices to safe ingress (consumer/consumer into Databricks) visitors as properly.

Proposed Deployment Structure

Deployment Architecture

Create Databricks workspace on GCP with the next options

  1. Buyer managed GCP VPC for workspace deployment
  2. Personal Service Join (PSC) for Internet utility/APIs (frontend) and Management airplane (backend) visitors
    • Person to Internet Software / APIs
    • Information Airplane to Management Airplane
  3. Visitors to Google Providers over Personal Google Entry
    • Buyer managed providers (e.g. GCS, BQ)
    • Google Cloud Storage (GCS) for logs (well being telemetry and audit) and Google Container Registry (GCR) for Databricks runtime photographs
  4. Databricks workspace (information airplane) GCP undertaking secured utilizing VPC Service Controls (VPC SC)
  5. Buyer Managed Encryption keys
  6. Ingress management for Databricks workspace/APIs utilizing IP Entry checklist
  7. Visitors to exterior information sources filtered by way of VPC firewall [optional]
    • Egress to public package deal repo
    • Egress to Databricks managed hive
  8. Databricks to GCP managed GKE management airplane
    • Databricks management airplane to GKE management airplane (kube-apiserver) visitors over approved community
    • Databricks information airplane GKE cluster to GKE management airplane over vpc peering

Important Studying

Earlier than you start, please guarantee that you’re aware of these matters

Stipulations

  • A Google Cloud account.
  • A Google Cloud undertaking within the account.
  • A GCP VPC with three subnets precreated, see necessities right here
  • A GCP IP vary for GKE grasp assets
  • Use the Databricks Terraform supplier 1.13.0 or larger. All the time use the newest model of the supplier.
  • A Databricks on Google Cloud account within the undertaking.
  • A Google Account and a Google service account (GSA) with the required permissions.
    • To create a Databricks workspace, the required roles are defined right here. Because the GSA might provision extra assets past Databricks workspace, for instance, personal DNS zone, A information, PSC endpoints and many others, it’s higher to have a undertaking proprietor function in avoiding any permission-related points.
  • In your native improvement machine, you should have:
    • The Terraform CLI: See Obtain Terraform on the web site.
    • Terraform Google Cloud Supplier: There are a number of choices accessible right here and right here to configure authentication for the Google Supplier. Databricks would not have any choice in how Google Supplier authentication is configured.

Bear in mind

  • Each Shared VPC or standalone VPC are supported
  • Google terraform supplier helps OAUTH2 entry token to authenticate GCP API calls and that is what we now have used to configure authentication for the google terraform supplier on this article.
    • The entry tokens are short-lived (1 hour) and never auto refreshed
  • Databricks terraform supplier relies upon upon the Google terraform supplier to provision GCP assets
  • No adjustments, together with resizing subnet IP deal with area or altering PSC endpoints configuration is allowed publish workspace creation.
  • In case your Google Cloud group coverage has domain-restricted sharing enabled, please be certain that each the Google Cloud buyer IDs for Databricks (C01p0oudw) and your personal group’s buyer ID are within the coverage’s allowed checklist. See the Google article Setting the group coverage. Should you need assistance, contact your Databricks consultant earlier than provisioning the workspace.
  • Make it possible for the service account used to create Databricks workspace has the required roles and permissions.
  • In case you have VPC SC enabled in your GCP tasks, please replace it per the ingress and egress guidelines listed right here.
  • Perceive the IP deal with area necessities; a fast reference desk is out there over right here
  • Here is a checklist of Gcloud instructions that you could be discover helpful
  • Databricks does assist world entry settings in case you need Databricks workspace (PSC endpoint) to be accessed by a useful resource working in a special area from the place Databricks is.

Deployment Information

There are a number of methods to implement the proposed deployment structure

  • Use the UI
  • Databricks Terraform Supplier [recommended & used in this article]
  • Databricks REST APIs

No matter the method you employ, the useful resource creation movement would appear to be this:

Deployment Guide

GCP useful resource and infrastructure setup

This can be a prerequisite step. How the required infrastructure is provisioned, i.e. utilizing Terraform or Gcloud or GCP cloud console, is out of the scope of this text. Here is a listing of GCP assets required:

GCP Useful resource Kind Function Particulars
Mission Create Databricks Workspace (ws) Mission necessities
Service Account Used with Terraform to create ws Databricks Required Function and Permission. Along with this you may additionally want extra permissions relying upon the GCP assets you might be provisioning.
VPC + Subnets Three subnets per ws Community necessities
Personal Google Entry (PGA) Retains visitors between Databricks management airplane VPC and Prospects VPC personal Configure PGA
DNS for PGA Personal DNS zone for personal api’s DNS Setup
Personal Service Join Endpoints Makes Databricks management airplane providers accessible over personal ip addresses.

Personal Endpoints have to reside in its personal, separate subnet.

Endpoint creation
Encryption Key Buyer-managed Encryption key used with Databricks Cloud KMS-based key, helps auto key rotation. Key might be “software program” or “HSM” aka hardware-backed keys.
Google Cloud Storage Account for Audit Log Supply Storage for Databricks audit log supply Configure log supply
Google Cloud Storage (GCS) Account for Unity Catalog Root storage for Unity Catalog Configure Unity Catalog storage account
Add or replace VPC SC coverage Add Databricks particular ingress and egress guidelines Ingress & Egress yaml together with gcloud command to create a fringe. Databricks tasks numbers and PSC attachment URI’s accessible over right here.
Add/Replace Entry Degree utilizing Entry Context Supervisor Add Databricks regional Management Airplane NAT IP to your entry coverage in order that ingress visitors is simply allowed from an enable listed IP Record of Databricks regional management airplane egress IP’s accessible over right here

Create Workspace

  • Clone Terraform scripts from right here
    • To maintain issues easy, grant undertaking proprietor function to the GSA on the service and shared VPC undertaking
  • Replace *.vars information as per your atmosphere setup
Variable Particulars
google_service_account_email [NAME]@[PROJECT].iam.gserviceaccount.com
google_project_name PROJECT the place information airplane shall be created
google_region E.g. us-central1, supported areas
databricks_account_id Find your account id
databricks_account_console_url https://accounts.gcp.databricks.com
databricks_workspace_name [ANY NAME]
databricks_admin_user Present not less than one consumer e-mail id. This consumer shall be made workspace admin upon creation. This can be a required area.
google_shared_vpc_project PROJECT the place VPC utilized by dataplane is positioned. In case you are not utilizing Shared VPC then enter the identical worth as google_project_name
google_vpc_id VPC ID
gke_node_subnet NODE SUBNET identify aka PRIMARY subnet
gke_pod_subnet POD SUBNET identify aka SECONDARY subnet
gke_service_subnet SERVICE SUBNET SUBNET identify aka SECONDARY subnet
gke_master_ip_range GKE management airplane ip deal with vary. Must be /28
cmek_resource_id tasks/[PROJECT]/places/[LOCATION]/keyRings/[KEYRING]/cryptoKeys/[KEY]
google_pe_subnet A devoted subnet for personal endpoints, really useful measurement /28. Please evaluation community topology choices accessible earlier than continuing. For this deployment we’re utilizing the “Host Databricks customers (shoppers) and the Databricks dataplane on the identical community” choice.
workspace_pe Distinctive identify e.g. frontend-pe
relay_pe Distinctive identify e.g. backend-pe
relay_service_attachment Record of regional service attachment URI’s
workspace_service_attachment Record of regional service attachment URI’s
private_zone_name E.g. “databricks”
dns_name gcp.databricks.com. (. is required in the long run)

If you don’t want to make use of the IP-access checklist and want to utterly lock down workspace entry (UI and APIs) exterior of your company community, then you definately would wish to:

  • Remark out databricks_workspace_conf and databricks_ip_access_list assets within the workspace.tf
  • Replace databricks_mws_private_access_settings useful resource’s public_access_enabled setting from true to false within the workspace.tf
    • Please notice that Public_access_enabled setting can’t be modified after the workspace is created
  • Just be sure you have Interconnect Attachments aka vlanAttachments are created in order that visitors from on premise networks can attain GCP VPC (the place personal endpoints exist) over devoted interconnect connection.

Profitable Deployment Test

Upon profitable deployment, the Terraform output would appear to be this:

backend_end_psc_status = "Backend psc standing: ACCEPTED"
front_end_psc_status = "Frontend psc standing: ACCEPTED"
workspace_id = "workspace id: <UNIQUE-ID.N>"
ingress_firewall_enabled = "true"
ingress_firewall_ip_allowed = tolist([
"xx.xx.xx.xx",
"xx.xx.xx.xx/xx"
])
service_account = "Default SA hooked up to GKE nodes
[email protected]<PROJECT>.iam.gserviceaccount.com"
workspace_url = "https://<UNIQUE-ID.N>.gcp.databricks.com"

Submit Workspace Creation

  • Validate that DNS information are created, comply with this doc to know required A information.
  • Configure Unity Catalog (UC)
  • Assign Workspace to UC
  • Add customers/teams to workspace by way of UC Identification Federation
  • Auto provision customers/teams out of your Identification Suppliers
  • Configure Audit Log Supply
  • In case you are not utilizing UC and want to use Databricks managed hive then add an egress firewall rule to your VPC as defined right here

Getting Began with Information Exfiltration Safety with Databricks on Google Cloud

We mentioned using cloud-native safety management to implement information exfiltration safety on your Databricks on GCP deployments, all of which might be automated to allow information groups at scale. Another issues that you could be need to think about and implement as a part of this undertaking are:

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles