The following is a guest post by Eddie Segal of Agile SEO, who has authored this piece as part of our Data Champions program. If you're interested in contributing to or learning more about our Data Champions program, please get in touch.
Migrating a data warehouse from a legacy environment requires a massive upfront investment in resources and time. There is a lot to consider before and during migration. You may need to replan your data model, use a separate platform for tasks scheduling, and handle changes in the application’s database driver. Therefore, organizations must take a strategic approach to streamline the process. This article presents a step-by-step approach for migrating a data warehouse to the cloud.
Enterprise data warehouse: On-premise vs Cloud
A data warehouse is any system that collates data from a wide range of sources. Data warehouses are used as centralized data repositories for reporting and analytical purposes. An Enterprise Data Warehouse (EDW) manages and stores business data. This data typically comes from different systems like Customer Relationship Management (CRM), Enterprise Resource Planning (ERP), and physical records.
On-prem data warehouses collect, store, and analyze data on onsite servers. As a result, companies need to manage hardware infrastructure. However, on-prem management is not always a viable option. Over the last couple of years, organizations have started moving their data warehouses to the cloud. Here’s why:
- Upfront costs—on-premises data warehouses require upfront expenses on hardware infrastructure. When working with cloud data warehouses, these expenses are not necessary.
- Ongoing costs—the cloud offers a low, pay-as-you-go model, while businesses with traditional data warehouses must deal with upgrade and maintenance costs.
- Performance—cloud-based data warehouse architectures leverage the Extract, Load, Transform (ELT) process to make data processing much faster than on-premises options. ELT is an alternative to the traditional Extract, Transform, Load (ETL) process for on-premises data.
- Flexibility—cloud data warehouses are designed to work with big data formats and structures. Traditional relational options are simply designed to integrate similarly structured data.
- Scale—The elasticity of the cloud enables you to scale big datasets. In addition, cloud-based data warehousing options can scale down as needed. You can’t easily do it with traditional approaches.
EDW migration challenges
Migrating data from an on-premise warehouse to a cloud-based environment creates several challenges. You may need to replan your data model, use a separate platform for scheduling tasks, and connect custom data applications.
1. Replanning your data model
Cloud data warehouses support different schema and data types. AWS Redshift has the most familiar set of data types since it is PostgreSQL compatible. Google BigQuery, on the other hand, uses STRING instead of VARCHAR, and employs REPEATED array type and RECORD semi-structured objects. Snowflake supports OBJECT, VARIANT, and ARRAY, for semi-structured data.
Moreover, cloud data warehouses encourage a schema denormalization approach for increased performance. The increased storage required to hold denormalized data is fairly cheap. However, running JOINS on tables stored on distributed servers is very expensive, and does not lead to the desired performance improvements. Therefore, you have to keep the cloud and the on-premise data models synchronized during migration.
2. Stored procedures
The layer of stored procedures acts like a repository of miniature data applications. You can use them to save a lot of data and preserve specific knowledge. The ability to write and use stored procedures is often missing and overlooked by cloud data warehouses.
Cloud data warehouses like BigQuery, Snowflake, support user-defined functions, but this is not enough. The common alternative is to use a separate platform for scheduling tasks orchestration or parameterized queries. There are open-source options like Airflow and Luigi as well as commercial cloud-based alternatives.
3. Connecting custom data applications to your data warehouse
Database drivers like ODBC/JDBC are used to connect your application to your data warehouse. These drivers usually behave differently. As a result, changing an application’s database driver requires many different query changes.
Some changes are obvious, since the SQL statements may result in visible errors. Other changes are less obvious, since different ODBC drivers can cause slight data conversions. For example, changes in the timezone format or timestamp. These changes appear only as data discrepancies. You will need more rigorous testing to detect them. As long as knowledge of these adjustments is shared, the whole organization can overcome them faster.
How to migrate your EDW successfully to the cloud
Many organizations think that a cloud migration is a one time trip. In reality, however, the process of data migration to the cloud should be gradual.
1. Copy your data
First, you need to create an initial copy of your existing data warehouse in the cloud. This process requires choosing the right cloud data warehouse for your needs, and then making an initial copy of all your data.
There are two key challenges in this step:
- Choose the right infrastructure—you can choose a smaller data set and migrate it to many different data warehouses for comparison.
- Copy all of your existing data—you may need to transfer hundreds of terabytes of data. Amazon and Google enable you to overcome this challenge, by physically moving your hard drives with USPS and trucks.
You also need to verify the schema and format of the data you want to transfer to the cloud. Then, you should transfer your schema into the cloud data warehouse before loading. Finally, you need to mark the point in time of the exported snapshot, and use it when setting up an ongoing replication process
2. Set up ongoing replication
The next step is setting up an ongoing synchronization process. Any synchronization process should be tested for reliability and latency. These parameters are critical for the success of the organization’s cloud migration strategy. You can build this synchronization on your own, or use a data pipelining service like Fivetran to manage the continuous replication of schemes and data. Once synchronization is secured, you can start migrating the rest of your infrastructure one component after another.
3. Migrating your analytics infrastructure and data applications
After the migration pipeline is all set up, you should start migrating Business Intelligence (BI) and analytics infrastructure. The analytics component is usually not the core of a company’s data infrastructure. Therefore, it makes for a less risky migration target. Examples of BI tools are Looker, Tableau, Periscope, and Chartio.
Migration of legacy data applications introduces more technical challenges. You may need to replace the ODBC driver, and adjust or rewrite queries. To fully utilize the performance advantages of your cloud data warehouse, you may also need to change the data model.
4. Migrate your transformations
The last step is to recreate the transformations that produce your final data models in your new cloud environment. Although the upfront investment in time, labor, and money might be steep, it is far better in the long term to follow an ELT rather than ETL paradigm. Tools such as dbt can allow you to write a SQL-based transformation layer on top of your data warehouse.
A typical data warehouse contains a large amount of data covering many business areas. Migrating all the data at once would almost guarantee failure. You need to take incremental steps to successfully migrate your data warehouse to the cloud, especially when undertaking significant design changes.
An incremental approach enables you to keep operating your on-premise data warehouse, while your cloud data warehouse comes online. During this transition phase, you can use tools like Fivetran to synchronize the data between the old on-premise data warehouse and the new one in the cloud.
Learn more about cloud-based data warehouses in the Fivetran Documentation.