One of the great things about the data industry is that there are always emerging technologies and innovations that promote automation, advanced analytics and ultimately our confidence in our data. This constant churn of innovation presents the vexing challenge of constantly keeping up with and assessing new offerings, how they might fit into your data architecture and ultimately if you should adopt them.
While it’s very easy to get drawn in by impressive speed/performance metrics and claims that one tool can do “everything,” as we design a data architecture we should focus on the core business outcomes you as a company are trying to meet today and in the (near) future.
How far to take future proofing
It’s now widely accepted that most enterprises are multi-cloud or will be in the near future. As a result, architecture decisions depend less on having to predict:
- Your future business use cases, and
- If your current cloud vendor will be able to support those use cases.
Multi-cloud architectures mean you no longer need to default to the service your cloud vendor offers. While seeking additional vendors might increase procurement cycles, ultimately it’ll most likely lead to better decisions. This means less compromises about the performance of your current business needs for the sake of/in fear of potential future use cases.
The key to this shift to multi-cloud is to future proof your architecture and reduce vendor lock-in by ensuring you choose providers who are cloud agnostic.
One platform for all
If the data industry has taught us anything in recent years, it’s that every part of the data lifecycle, including creation, processing, transformation and management has challenges that are very difficult to solve. These challenges require intense focus and significant engineering effort so when considering technologies, consider those that:
- Are best-in-class at the most critical part of your data architecture
- Can integrate well with upstream and downstream processes and can be programmatically controlled using APIs
To use Fivetran as an example, instead of trying to conquer the entire ELT pipeline we really focused on the E and L components of the ELT pipeline. In recognition of the fact that dbt Labs has an exceptional “T” offering, we decided to focus our efforts on integrating with this open source technology so customers can leverage two best-in-class technologies easily in the same data stack.
The data warehouse, lake and lakehouse debate
The question of what data repository to use is where people commonly:
- Get distracted by technologies that are ultimately very impressive for some use cases but would not serve their core business needs, and
- Feel a sense of pressure to choose the technology that is seen to be “the future.”
As Fivetran now supports S3 as a destination, we are well aware of the tradeoffs between different data repositories.
With the modern data stack, you have the flexibility to choose the right technology when you need it because:
- Technologies in the cloud can be deployed quickly and cost effectively with modern data integration tools
- Consumption based pricing keeps costs low relative to the value a tool brings to the intended use case
So do you need a data warehouse, a data lake, a data lakehouse or a combination?
Let’s assess common scenarios for using data lakes and data warehouses first:
I have a data warehouse; do I also need a data lake?
While adding additional technologies can be relatively seamless and there isn’t a huge cost barrier to entry, when you add an additional technology to your stack that isn’t core to your business, you are adding potential points of failure as well as latency.
If you want to add a data lake to your data stack for advanced data science purposes, make sure you’ve invested enough in your data governance and business intelligence to be confident in your data before moving forward. Eric Jones, Head of Analytics at Hyperscience, says “Your first foundational layer for analytics is to be able to answer what happened yesterday really really well.” With this trusted foundational layer, your data science projects are more likely to avoid becoming part of the 87% of data science projects that never make it to production.
If you have a data science use case that can no longer be supported by your data warehouse, one option is a governed modern data lake that combines technologies such as Apache Iceberg and AWS Glue. This approach can enable “broad, flexible and unbiased data exploration and discovery” (Gartner) of structured, semi-structured and unstructured data types while avoiding a data swamp.
I have a data lake; do I also need a data warehouse?
If you have a data lake, but it’s because of legacy rather than strategic decisions, consider migrating to one of the modern data warehouses where the price of storage and compute is consumption based. Data warehouses are optimized for analytics and have many additional benefits.
If your organization has a data lake to service critical data science use cases but also needs self-service analytics, you will need BI reporting at scale and data transformations. This means you might also need a data warehouse.
Doesn’t a data lakehouse cover all use cases?
The data lakehouse represents a form of convergent evolution that the industry is ultimately moving toward from different directions. Data lakes are moving towards supporting more data warehouse-like functionality, such as Databricks SQL, and data warehouses in turn are starting to support more data lake features, such as Snowflake supporting unstructured data.
For some use cases, data lakehouses are not as performant as data warehouses (although they are getting there). While the data lakehouse is potentially a good fit for some companies and performance improvements will surely come in the future, I will refer you back to the “one platform for all” section. Yes, a data lakehouse can on paper support the vast majority of your use cases, but is it the best-in-class for your business critical use case?
New technologies give you options
Continuing technological developments offer the flexibility to accommodate both structured and unstructured data. Fivetran now supports S3 as a destination combined with Iceberg and AWS Glue, ushering in a modern approach to data lakes that supports all data types and governance. These capabilities should be able to support your data efforts as they evolve from simple reporting and dashboards all the way to predictive modeling and machine learning.
[CTA_MODULE]