In today’s fast-moving data landscape, businesses are continually exploring ways to maximize the value derived from their data. Recently we’ve been having numerous conversations with executives interested in a Redshift to Databricks migration. This transition is often driven by a forward-looking strategy to embrace a more flexible, scalable, and AI-ready data architecture.
It’s important to note that both Databricks and Redshift are powerful data platforms, and here at Qubika we have clients working with both. The choice that an enterprise makes will clearly depend on their specific requirements. So this article doesn’t look to analyze which is better than the other, but rather provide details of how a migration can take place and some best practices should the decision be made – and our experts at Qubika can of course provide guidance here.
An overview of a typical Redshift to Databricks migration process
The journey of a Redshift to Databricks migration isn’t one-size-fits-all. The specific approach will depend on several key factors:
- Existing architectural landscape. This includes your reliance on other AWS services, existing third-party tools, and the open-source technologies already integrated into your data ecosystem.
- Variety of workloads. Whether your focus is primarily on ETL processes, Business Intelligence (BI), Machine Learning (ML), or a mix of these, each workload type requires a distinct approach and resource allocation.
- Criticality of use cases. Migrating highly critical systems demands extra care, meticulous planning, and rigorous testing to ensure seamless continuity.
- Ongoing initiatives. Existing projects and their delivery timelines need to be carefully considered to avoid conflicts and ensure a smooth transition without disrupting ongoing initiatives.
- Migration objectives. Your primary objectives – be it cost reduction, meeting strict cutover deadlines, or managing user change effectively – will heavily influence the chosen migration strategy and execution plan.
A successful Redshift to Databricks migration typically follows a structured, phased approach, closely aligned with Databricks’ best practices, to ensure a smooth and efficient transition:
- Migration Discovery and Assessment: This crucial initial phase involves a deep dive into your existing Redshift environment. It encompasses a thorough analysis of current workloads, a detailed inventory of your data assets and schemas, mapping all dependencies, and benchmarking current performance to establish clear baselines and understand the migration scope. Automation tools such as Databricks Amazon Redshift Profiler can help gather all the relevant information.
- Architecture and Feature Mapping Workshop: Following discovery, a collaborative workshop focuses on designing your new Databricks Lakehouse architecture. This includes defining Delta Lake table structures, partitioning strategies, and robust data governance with Unity Catalog. It’s at this stage you map key Redshift features and optimizations to the Databricks equivalents (for example, for tables, Amazon Redshift Tables map to Databricks Delta Tables), while also determining the optimal compute strategy and data ingestion patterns for the new environment.
- Data Migration: This phase is dedicated to moving your data from Redshift to Databricks. Databricks recommends the approach of 1) migrating your enterprise data warehouse (EDW) tables into the Delta Lake medallion architecture; 2) migrate/build data pipelines that will populate Bronze/Silver/Gold in Delta Lake, and 3) backfill Bronze, Silver, Gold tables as needed.
- Data Pipeline Migration: Here, your existing ETL/ELT pipelines and associated business logic are refactored and moved to Databricks. This includes translating Redshift-specific SQL queries (DDL, DML, views, stored procedures) into optimized Databricks SQL or PySpark code. New data pipelines are built using Databricks capabilities like Delta Live Tables (DLT) for declarative development, ensuring efficient and scalable data transformations.
- Downstream Tools Integration: The final technical phase focuses on re-establishing connections for all downstream consumers. This involves integrating existing Business Intelligence (BI) tools (e.g., Tableau, Power BI), data science platforms, and custom applications with Databricks SQL endpoints and Delta Lake tables. Rigorous data validation and user acceptance testing (UAT) are performed to ensure all reports, dashboards, and applications function seamlessly and accurately on the new platform.
To dive into more detail, we recommend reading Databricks’ detailed guide and documentation about the steps for a migration.
Case Study: A Redshift to Databricks migration to power analytics for a global entertainment leader
Qubika partnered with one of the world’s largest entertainment, sport, and gaming companies on their Redshift to Databricks migration journey. They wanted to move away from their existing Redshift data warehouse solution and embrace Databricks while at the same time instilling robust data engineering best practices.
The company collects vast amounts of user activity data from hundreds of events across multiple verticals. Data sources are equally varied, with several data points being loaded from flatfiles, SQL sources in batch format or streamed through Kafka, reaching volumes of terabytes daily. New data sources are continuously integrated to fuel further analysis and provide richer insights.
Our engagement focused on moving data from a wide variety of sources into their Landing Catalog (Bronze layer) and then into the Foundation Catalog (Silver layer, involving light transformations) using Databricks and PySpark. We leveraged pre-built libraries and orchestrated these processes through Airflow, running on schedule for batch data, or using it to set up a continuous Databricks job to ingest data from Kafka for streaming sources.
A key technical challenge involved creating the efficient and scalable code for these Bronze and Silver layer migrations. Additionally, our team assisted with the ingestion and configuration of new data sources, as well as developing the necessary code for light transformations. The solution leverages Databricks Jobs and Databricks Unity Catalog for robust data management and governance.
The business goal for this Redshift to Databricks migration was to achieve more scalable analytics and uniformity of data models. By concluding this critical migration, the company will unlock significantly greater scalability for their analytical capabilities and achieve greater uniformity across their data models.
3 best practices for a Redshift to Databricks migration
Based on our experience handling Redshift to Databricks migrations, we’ve defined 3 core best practices:
- Embrace Databricks Unity Catalog early. Implement Databricks Unity Catalog from the initial stages of your migration. This establishes robust data governance, fine-grained access control, and comprehensive data lineage, providing a secure and well-managed data environment from day one.
- Maximize automation. When possible, leverage available tools and scripts for schema conversion, data ingestion (e.g., Auto Loader), and code translation. Automating these processes accelerates the Redshift to Databricks migration and significantly reduces manual effort and potential errors.
- Focus on data quality and validation. Implement comprehensive and continuous data validation checks throughout the migration process. Ensuring data accuracy is paramount for building trust in the new platform and supporting confident business decisions.