Analytics — the process of analyzing data to gather insights for better decision-making — has evolved alongside the modern data stack (MDS). That infrastructure shift has significantly reshaped analytics tools, teams and techniques.
The recent Modern Data Stack Conference 2021 — which featured 40+ sessions across four virtual stages — brought together three industry pioneers to discuss the current state of data analytics. Their far-reaching keynote, “An Executive Eye on the Future State of Analytics,” ranged from the future of analytics to a discussion of which is better, the “walled garden” of the data warehouse or the open-source Data Lakehouse. The participants were:
- George Fraser: CEO and Co-Founder, Fivetran
- Ali Ghodsi: CEO and Co-Founder, Databricks
- Rohan Kumar: Corporate Vice President, Microsoft Azure Data
An Infrastructure Revolution in Search of a User Experience Revolution
According to George Fraser, the rise of the modern data stack was “fundamentally driven by changes in cost.” As the prices dropped for storing and working with data, companies “that previously couldn’t have done things with data” now have a data warehouse and use it to perform analytics. Others, who may have worked with data for years, are doing even more.
We’re “living with the consequences of that trend,” according to Fraser. Now, analytics is a topic for corporate boardrooms, not just IT. The obvious next question is: What’s next for analytics?
Rohan Kumar noted how much the usage of data has changed. As an example, he cited his car: Twenty years ago, he went to the dealership when he encountered a problem. “Now, when I walk out of the door, I sit in my Tesla, they know pretty much everything about me in terms of what music I listen to, what speed I drive.”
According to Kumar, “if you look at the friction points,” it’s in the flow of data, which “I do believe still has a way to go.” The industry has done an excellent job making it easier to build a modern data stack. It’s also increased the reliability of the MDS — but the tools to manage that data still need finessing.
Kumar added that at Azure, “We see a lot of discussions” about simplifying processes and tools. The topic often arises when they discuss responsible AI. “If I want to train stuff, how do I do that in a way where the use rights of the data are actually being respected?” Kumar added that his team is “very heavily focused” on solving for collaboration so customers can use their tools and still get their work done.
Fraser summed it up: “In many ways, we’ve had an infrastructure revolution, and now we’re waiting for a user experience revolution.”
Ghodsi added, “Fivetran or Databricks, when we started, we didn’t go buy a data center and wait a year” for it to be built. “We just procured 40, 50, 100 SaaS apps, and we were off.”
But there’s still a legacy of our on-prem past that lingers, he noted: "On-prem legacy stacks are being copied into the cloud. You lifted and shifted your on-prem warehouse into the cloud, and now it's a little bit more elastic with separated compute and storage, but that’s your new data warehouse."
Walled Garden or Lakehouse?
Separating compute from storage has been foundational to how we now deploy analytics, said Fraser. But how will that ultimately work? “One of my opinions is that we are at a crossroads right now.”
According to Fraser, there are two visions for the separation of compute and storage. The first is the “data warehouse vision,” best embodied by Snowflake, where the compute and storage are hidden.
The second is the Data Lakehouse, best embodied by Delta Lake. The data is stored in a format based on open standards; query engines can access storage. “There are pros and cons of both models,” Fraser said. Which way are we headed?
Kumar responded, “What I believe is having a closed ecosystem, where the actual data, the layout is not open, is not going to serve customers well in the long run. I think it constrains the value that customers can get from the data they have."
Ghodsi added, “Let me start by saying that I think Snowflake did a beautiful job of separating compute and storage, price-wise, cost-wise.” But not opening it up could be “problematic."
Ghodsi thinks pricing will be a stumbling block: “I don’t want to pay SQL rates to pull data out of your data warehouse. Let’s say I want to run TensorFlow or a machine learning model. I don’t want to pay SQL rates to suck that data out of a data warehouse so I can do machine learning on it. I want to pay ADLS, S3 rates.”
The performance also can’t be optimized for TensorFlow, according to Ghodsi: “Okay, so then it could optimize TensorFlow, but what about Pico? What if I’m dealing with images or video? It’s going to be very hard for the closed, walled garden of a data warehouse to support access to the whole ecosystem.”
Wait, Fivetran and Databricks aren't competitors?
Reading a question from the audience, Fraser said, “Someone says that before now, they would’ve thought that Fivetran and Databricks were competitors.” Why aren’t they?
Fraser explained that Fivetran moves data from the places where it lives to the areas you need to analyze it. “So you can use Fivetran with Databricks to do things like replicate your Salesforce data into Delta Lake or replicate your SQL server database into Delta Lake — or you name it,” he said, adding, “For us, Databricks is a target. And I think Databricks themselves use Fivetran.”
“I love Fivetran,” Ghodsi said. “With the advent of the Lakehouse, which is essentially one place in the data lake where you can store all your data — you can do machine learning on it, but you can also do BI on it — so you would need the same data there. So how do you get your Salesforce, Marketo, NetSuite, Google Sheets data into your data lake? Fivetran is awesome for that. And we actually spent a lot of time making that integration really seamless.”
What kinds of modernization challenges will Azure Data face?
Fraser shared another question from the audience, this one concerning Azure adoption: “People who are adopting Azure are longtime Microsoft customers who have been using tools like SQL server since the early 90s, and so for them adopting Azure is partly about modernizing. And what kinds of modernization challenges do you see in the context of Azure Data?”
Kumar responded, “If I look at a significant portion of the consumption that gets driven in Azure, it’s through a lot of our enterprise customers looking at their workloads. And it’s interesting because, in a lot of these cases, they invested in applications that they’ve built up — or things that they’ve written.”
Companies aren’t changing their apps, necessarily — they are using the cloud to operate internationally without opening on-prem data centers around the world. “The other aspect, I think, foundationally what’s driving this is the budget,” Kumar said. “They want to invest in better analytics. But that doesn’t necessarily mean it comes with significantly more money to do that.”
How does governance work in a data lake?
Fraser read another question from the audience on data governance in a data lake: How is it done? According to Ghodsi, Delta Lake enables you to “create a curated lake where you might have bronze, silver and gold tables. The bronze ones might be your raw data, what you would normally have in your swamp, but you can have metrics that make them cleaner. You can set up structures and quality and say, ‘This is the schema that I want.’ You can build up higher-level abstractions for your governance.”
Fivetran on Azure
Fraser closed the panel discussion by mentioning that Fivetran is now on Azure, “which we’re all very excited about. It’s now possible for Fivetran customers to run Fivetran workloads on Azure in the same regions as their destinations, which can be really important, especially for enterprise customers.”
Kumar concurred: “I think it’s going to be the beginning of a fantastic partnership."