Migrate from Google BigQuery to Amazon Redshift utilizing AWS Glue and Customized Auto Loader Framework


Amazon Redshift is a broadly used, absolutely managed, petabyte-scale cloud knowledge warehouse. Tens of hundreds of consumers use Amazon Redshift to course of exabytes of knowledge daily to energy their analytic workloads. Prospects are searching for instruments that make it simpler emigrate from different knowledge warehouses, comparable to Google BigQuery, to Amazon Redshift to benefit from the service price-performance, ease of use, safety, and reliability.

On this publish, we present you the way to use AWS native providers to speed up your migration from Google BigQuery to Amazon Redshift. We use AWS Glue, a totally managed, serverless, ETL (extract, remodel, and cargo) service, and the Google BigQuery Connector for AWS Glue (for extra info, seek advice from Migrating knowledge from Google BigQuery to Amazon S3 utilizing AWS Glue customized connectors). We additionally add automation and suppleness to simplify migration of a number of tables to Amazon Redshift utilizing the Customized Auto Loader Framework.

Answer overview

The answer gives a scalable and managed knowledge migration workflow emigrate knowledge from Google BigQuery to Amazon Easy Storage Service (Amazon S3), after which from Amazon S3 to Amazon Redshift. This pre-built answer scales to load knowledge in parallel utilizing enter parameters.

The next structure diagram exhibits how the answer works. It begins with organising the migration configuration to connect with Google BigQuery, then converts the database schemas, and eventually migrates the information to Amazon Redshift.

Architecture diagram showing how the solution works. It starts with setting-up the migration configuration to connect to Google BigQuery, then convert the database schemas, and finally migrate the data to Amazon Redshift.

The workflow accommodates the next steps:

  1. A configuration file is uploaded to an S3 bucket you have got chosen for this answer. This JSON file accommodates the migration metadata, specifically the next:
    • An inventory of Google BigQuery tasks and datasets.
    • An inventory of all tables to be migrated for every challenge and dataset pair.
  2. An Amazon EventBridge rule triggers an AWS Step Features state machine to start out migrating the tables.
  3. The Step Features state machine iterates on the tables to be migrated and runs an AWS Glue Python shell job to extract the metadata from Google BigQuery and retailer it in an Amazon DynamoDB desk used for monitoring the tables’ migration standing.
  4. The state machine iterates on the metadata from this DynamoDB desk to run the desk migration in parallel, based mostly on the utmost variety of migration jobs with out incurring limits or quotas on Google BigQuery. It performs the next steps:
    • Runs the AWS Glue migration job for every desk in parallel.
    • Tracks the run standing within the DynamoDB desk.
    • After the tables have been migrated, checks for errors and exits.
  5. The information exported from Google BigQuery is saved to Amazon S3. We use Amazon S3 (regardless that AWS Glue jobs can write on to Amazon Redshift tables) for a number of particular causes:
    • We will decouple the information migration and the information load steps.
    • It presents extra management on the load steps, with the power to reload the information or pause the method.
    • It gives fine-grained monitoring of the Amazon Redshift load standing.
  6. The Customized Auto Loader Framework robotically creates schemas and tables within the goal database and constantly masses knowledge from Amazon S3 to Amazon Redshift.

A couple of further factors to notice:

  • In case you have already created the goal schema and tables within the Amazon Redshift database, you may configure the Customized Auto Loader Framework to not robotically detect and convert the schema.
  • If you need extra management over changing the Google BigQuery schema, you need to use the AWS Schema Conversion Software (AWS SCT). For extra info, seek advice from Migrate Google BigQuery to Amazon Redshift utilizing AWS Schema Conversion instrument (SCT).
  • As of this writing, neither the AWS SCT nor Customized Auto Loader Framework assist the conversion of nested knowledge sorts (document, array and struct). Amazon Redshift helps semistructured knowledge utilizing the Tremendous knowledge sort, so in case your desk makes use of such complicated knowledge sorts, then you might want to create the goal tables manually.

To deploy the answer, there are two fundamental steps:

  1. Deploy the answer stack utilizing AWS CloudFormation.
  2. Deploy and configure Customized Auto Loader Framework to load information from Amazon S3 to Amazon Redshift.

Stipulations

Earlier than getting began, be sure to have the next:

  • An account in Google Cloud, particularly a service account that has permissions to Google BigQuery.
  • An AWS account with an AWS Identification and Entry Administration (IAM) person with an entry key and secret key to configure the AWS Command Line Interface (AWS CLI). The IAM person additionally wants permissions to create an IAM position and insurance policies.
  • An Amazon Redshift cluster or Amazon Redshift Serverless workgroup. In the event you don’t have one, seek advice from Amazon Redshift Serverless.
  • An S3 bucket. In the event you don’t need to use one in all your current buckets, you may create a brand new one. Notice the title of the bucket to make use of later if you cross it to the CloudFormation stack as an enter parameter.
  • A configuration file with the checklist of tables to be migrated. This file ought to have the next construction:
    {
    "migrationlist1" : 
    		[{"project":"YOUR-GCP-PROJECT-NAME",
    "dataset":" YOUR-DATASET-NAME ",
    "table":[
         		" YOUR-TABLE-NAME-1",
            		" YOUR-TABLE-NAME-2",
    		…
    		" YOUR-TABLE-NAME-N",
        		]
    }]
    }
    

    Alternatively, you may obtain the demo file, which makes use of the open dataset created by the Facilities for Medicare & Medicaid Companies.

On this instance, we named the file bq-mig-config.json

    1. Configure your Google account.
    2. Create an IAM position for AWS Glue (and notice down the title of the IAM position).
    3. Subscribe to and activate the Google BigQuery Connector for AWS Glue.

Deploy the answer utilizing AWS CloudFormation

To deploy the answer stack utilizing AWS CloudFormation, full the next steps:

  1. Select Launch Stack:

This template provisions the AWS assets within the us-east-1 Area. If you wish to deploy to a distinct Area, obtain the template bigquery-cft.yaml and launch it manually: on the AWS CloudFormation console, select Create stack with new assets and add the template file you downloaded.

The checklist of provisioned assets is as follows:

    • An EventBridge rule to start out the Step Features state machine on the add of the configuration file.
    • A Step Features state machine that runs the migration logic. The next diagram illustrates the state machine.
      Diagram representing the state machine deployed by the solution stack.
    • An AWS Glue Python shell job used to extract the metadata from Google BigQuery. The metadata shall be saved in an DynamoDB desk, with a calculated attribute to prioritize the migration job. By default, the connector creates one partition per 400 MB within the desk being learn (earlier than filtering). As of this writing, the Google BigQuery Storage API has a most quota for parallel learn streams, so we set the restrict for employee nodes for tables bigger than 400 GB. We additionally calculate the max variety of jobs that may run in parallel based mostly on these values.
    • An AWS Glue ETL job used to extract the information from every Google BigQuery desk and saves it in Amazon S3 in Parquet format.
    • A DynamoDB desk (bq_to_s3_tracking) used to retailer the metadata for every desk to be migrated (dimension of the desk, S3 path used to retailer the migrated knowledge, and the variety of staff wanted emigrate the desk).
    • A DynamoDB desk (bq_to_s3_maxstreams) used to retailer the utmost variety of streams per state machine run. This helps us reduce job failures as a consequence of limits or quotas. Use the Cloud Formation template to customise the title of the DynamoDB desk. The prefix for the DynamoDB desk is bq_to_s3.
    • The IAM roles wanted by the state machine and AWS Glue jobs.
  1. Select Subsequent.

Screen caption showing the AWS Cloudformation Create stack page.

  1. For Stack title, enter a reputation.
  2. For Parameters, enter the parameters listed within the following desk, then select Create.
CloudFormation Template Parameter Allowed Values Description
InputBucketName S3 bucket title

The S3 bucket the place the AWS Glue job shops the migrated knowledge.

The information shall be truly saved in a folder named s3-redshift-loader-source, which is utilized by the Customized Auto Loader Framework.

InputConnectionName AWS Glue connection title, the default is glue-bq-connector-24 The title of the AWS Glue connection that’s created utilizing the Google BigQuery connector.
InputDynamoDBTablePrefix DynamoDB desk title prefix, the default is bq_to_s3 The prefix that shall be used when naming the 2 DynamoDB tables created by the answer.
InputGlueETLJob AWS Glue ETL job title, the default is bq-migration-ETL The title you need to give to the AWS Glue ETL job. The precise script is saved within the S3 path specified within the parameter InputGlueS3Path.
InputGlueMetaJob AWS Glue Python shell job title, the default is bq-get-metadata The title you need to give to AWS Glue Python shell job. The precise script is saved within the S3 path specified within the parameter InputGlueS3Path.
InputGlueS3Path S3 path, the default is s3://aws-glue-scripts-${AWS::Account}-${AWS::Area}/admin/ That is the S3 path during which the stack will copy the scripts for AWS Glue jobs. Keep in mind to switch: ${AWS::Account} with the precise AWS account ID and ${AWS::Area} with the Area you intend to make use of, or present your personal bucket and prefix in an entire path.
InputMaxParallelism Variety of parallel migration jobs to run, the default is 30 The utmost variety of tables you need to migrate concurrently.
InputBQSecret AWS Secrets and techniques Supervisor secret title The title of the AWS Secrets and techniques Supervisor secret during which you saved the Google BigQuery credential.
InputBQProjectName Google BigQuery challenge title The title of your challenge in Google BigQuery during which you need to retailer short-term tables; you have to write permissions on the challenge.
StateMachineName

Step Features state machine title, the default is

bq-to-s3-migration-state-machine

The title of the Step Features state machine.
SourceS3BucketName S3 bucket title, the default is aws-blogs-artifacts-public

The S3 bucket the place the artifacts for this publish are saved.

Don’t change the default.

Deploy and configure the Customized Auto Loader Framework to load information from Amazon S3 to Amazon Redshift

The Customized Auto Loader Framework utility makes knowledge ingestion to Amazon Redshift less complicated and robotically masses knowledge information from Amazon S3 to Amazon Redshift. The information are mapped to the respective tables by merely dropping information into preconfigured areas on Amazon S3. For extra particulars concerning the structure and inside workflow, seek advice from Customized Auto Loader Framework.

To arrange the Customized Auto Loader Framework, full the next steps:

  1. Select Launch Stack to deploy the CloudFormation stack within the us-east-1 Area:

  1. On the AWS CloudFormation console, select Subsequent.
  2. Present the next parameters to assist make sure the profitable creation of assets. Be sure to have collected these values beforehand.
Parameter Identify Allowed Values Description
CopyCommandSchedule cron(0/5 * ? * * *) The EventBridge guidelines KickoffFileProcessingSchedule and QueueRSProcessingSchedule are triggered based mostly on this schedule. The default is 5 minutes.
DatabaseName dev The Amazon Redshift database title.
DatabaseSchemaName public The Amazon Redshift schema title.
DatabaseUserName demo The Amazon Redshift person title who has entry to run COPY instructions on the Amazon Redshift database and schema.
RedshiftClusterIdentifier democluster The Amazon Redshift cluster title.
RedshiftIAMRoleARN arn:aws:iam::7000000000:position/RedshiftDemoRole The Amazon Redshift cluster connected position, which has entry to the S3 bucket. This position is utilized in COPY instructions.
SourceS3Bucket Your-bucket-name The S3 bucket the place knowledge is situated. Use the identical bucket you used to retailer the migrated knowledge as indicated within the earlier stack.
CopyCommandOptions delimiter '|' gzip

Present the extra COPY command knowledge format parameters as follows:

delimiter '|' dateformat 'auto' TIMEFORMAT 'auto'

InitiateSchemaDetection Sure The setting to dynamically detect the schema previous to file add.

The next screenshot exhibits an instance of our parameters.

Screen capture showing the stack detailes page with the input parameters filled with example values

  1. Select Create.
  2. Monitor the progress of the Stack creation and wait till it’s full.
  3. To confirm the Customized Auto Loader Framework configuration, log in to the Amazon S3 console and navigate to the S3 bucket you supplied as a worth to the SourceS3Bucket parameter.

You need to see a brand new listing referred to as s3-redshift-loader-source is created.

Screen caption of the Amazon S3 console showing the folder you should be able to see in your S3 bucket.

Take a look at the answer

To check the answer, full the next steps:

  1. Create the configuration file based mostly on the stipulations. You can too obtain the demo file.
  2. To arrange the S3 bucket, on the Amazon S3 console, navigate to the folder bq-mig-config within the bucket you supplied within the stack.
  3. Add the config file into it.
  4. To allow EventBridge notifications to the bucket, open the bucket on the console and on the Properties tab, find Occasion notifications.
  5. Within the Amazon EventBridge part, select Edit.
  6. Choose On, then select Save modifications.

  1. On AWS Step Operate console, monitor the run of the state machine.
  2. Monitor the standing of the masses in Amazon Redshift. For directions, seek advice from Viewing Present Hundreds.
  3. Open the Amazon Redshift Question Editor V2 and question your knowledge.

Pricing concerns

You might need egress prices for migrating knowledge out of Google BigQuery into Amazon S3. Assessment and calculate the price for shifting your knowledge in your Google cloud billing console. As of this writing, AWS Glue 3.0 or later prices $0.44 per DPU-hour, billed per second, with a 1-minute minimal for Spark ETL jobs. For extra info, see AWS Glue Pricing. With auto scaling enabled, AWS Glue robotically provides and removes staff from the cluster relying on the parallelism at every stage or microbatch of the job run.

Clear up

To keep away from incurring future prices, clear up your assets:

  1. Delete the CloudFormation answer stack.
  2. Delete the CloudFormation Customized Auto Loader Framework stack.

Conclusion

On this publish, we demonstrated the way to construct a scalable and automatic knowledge pipeline emigrate your knowledge from Google BigQuery to Amazon Redshift. We additionally highlighted how the Customized Auto Loader framework can automate the schema detection, create tables in your S3 information, and constantly load the information into your Amazon Redshift warehouse. With this method, you may automate the migration of total tasks (even a number of tasks on the time) in Google BigQuery to Amazon Redshift. This helps enhance knowledge migration instances into Amazon Redshift considerably by way of the automated desk migration parallelization.

The auto-copy function in Amazon Redshift simplifies computerized knowledge loading from Amazon S3 with a easy SQL command, customers can simply automate knowledge ingestion from Amazon S3 to Amazon Redshift utilizing the Amazon Redshift auto-copy preview function

For extra details about the efficiency of the Google BigQuery Connector for AWS Glue, seek advice from Migrate terabytes of knowledge rapidly from Google Cloud to Amazon S3 with AWS Glue Connector for Google BigQuery and discover ways to migrate a considerable amount of knowledge (1.9 TB) into Amazon S3 rapidly (about 8 minutes).

To be taught extra about AWS Glue ETL jobs, see Simplify knowledge pipelines with AWS Glue computerized code technology and workflows and Making ETL simpler with AWS Glue Studio.


In regards to the Authors

Tahir Aziz is an Analytics Answer Architect at AWS. He has labored with constructing knowledge warehouses and massive knowledge options for over 13 years. He loves to assist prospects design end-to-end analytics options on AWS. Outdoors of labor, he enjoys touring and cooking.

Ritesh Kumar Sinha is an Analytics Specialist Options Architect based mostly out of San Francisco. He has helped prospects construct scalable knowledge warehousing and massive knowledge options for over 16 years. He likes to design and construct environment friendly end-to-end options on AWS. In his spare time, he loves studying, strolling, and doing yoga.

Fabrizio Napolitano is a Principal Specialist Options Architect for DB and Analytics. He has labored within the analytics house for the final 20 years, and has just lately and fairly unexpectedly turn into a Hockey Dad after shifting to Canada.

Manjula Nagineni is a Senior Options Architect with AWS based mostly in New York. She works with main monetary service establishments, architecting and modernizing their large-scale purposes whereas adopting AWS Cloud providers. She is captivated with designing huge knowledge workloads cloud-natively. She has over 20 years of IT expertise in software program growth, analytics, and structure throughout a number of domains comparable to finance, retail, and telecom.

Sohaib Katariwala is an Analytics Specialist Options Architect at AWS. He has over 12 years of expertise serving to organizations derive insights from their knowledge.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles