How will data tools and technology leap forward in the near future?
That was the big question during an executive keynote panel at the Modern Data Stack 2020 Conference. Fivetran CEO George Fraser gathered four leading experts to discuss the advances we may see in just a few short years.
The industry luminaries joining the board included:
- Michell Ufford, CEO and co-founder of Noteable and formerly a key member of the data team at Netflix
- Martin Casado, general partner at Andreessen Horowitz
- Bob Muglia, a Fivetran board member and the former CEO of Snowflake
- Tristan Handy, CEO and founder of Fishtown Analytics, creators of the open-source data analytics engineering tool dbt
Here’s what the panel had to say on three key topics.
Do Data Lakes Have a Place in the Modern Data Stack?
Fraser: In a world where we have data warehouses that use object storage to store their data and give you some of the advantages of data lakes, do data lakes still have a place?
Ufford: The data lake will have a place. I don't think it's going to go away. I just think it's not going to look how it looks today. Technologies like Snowflake eliminate the need that we had back when we first started creating data lakes. But I think you'll start to see a shift towards a decentralization of your data teams, or a data mesh.
Casado: In my experience, you end up with pluralities of technologies that can do what the other one does, architecturally. But in the end, you've got products and companies that optimize around use cases. And I think the operational AI use case is a large one, and it's growing faster. So actually, I think over time, you can argue that it's the data lake that ends up consuming everything.
Muglia: Actually, no, I don’t think data lakes will still have a place. But that's a long-term perspective. It's an arc of time. And you have to look at the evolution of how infrastructure changes over time to take on new capabilities. I think that five years from now, data is going to sit behind a SQL prompt by and large, and then over time evolve into relational. Relational will dominate and SQL data warehouses will replace data lakes.
Handy: For a lot of reasons, I believe that an organization will store their files one time. You will not have a data warehouse copy of the file and a data lake copy of a file, which you see in some architectures today. So that requires you to have an open-source file format that is shared between your data warehouse use cases and your other use cases. Those have to also start to converge so that different use cases can take advantage of the same stuff.
How Will Machine Learning Influence Analytics?
Fraser: How do we bring the world of machine learning, Python and Scala together with our world of analytics, SQL and BI tools? There are essentially three competing visions: 1) That you’re going to put machine learning into SQL, like what BigQuery is doing, 2) that you put SQL into Python or Scala, which is the Databricks vision, or 3) that you use Apache Arrow, where everyone implements an interchange format and everything talks together. Which do you think will win out?
Ufford: What I would like to see is something like Arrow. But ultimately at the end of the day, you'll continue to see specialization here. The things that you want to do if you're trying to do deep learning are just fundamentally different than predictive models, for example.
Handy: I think a lot about the Arrow version of the world. And I think that in the fullness of time, that will end up dominating. And it’s for the reason that tools end up evolving to the personas that they serve and the use cases that they serve.
Casado: I also believe in the Arrow future. You're going to have a heterogeneous, fragmented system. It's just always been that way in computer science. Therefore, you do need to have open interfaces.
Muglia: I’m going to be the radical and say that we’re approaching an era where we’ll have a hybrid architecture and hybrid will dominate for the next three to five years. You will see hybrid systems being built by every major vendor. All of them will have a full predictive stack and a full declarative, relational SQL stack built in using some kind of interface like that — but that's only until relational actually solves the broader set of problems, which will come.
What Are the Use Cases for the Modern Data Stack?
Fraser: What do you think are some of the most interesting or surprising use cases that may start to get pulled into the orbit of the modern data stack in the next couple of years?
Muglia: I think it’ll be around complex data. For example, I was talking to a company in the medical field yesterday, and the rich amount of data that exists in images and doctors' notes is opaque to our systems today. It will not be in five years. The modern data stack will be able to extract all of that useful information. To me, that's the gigantic transformation, into the types of applications that will be created in the years to come.
Handy: In my last job I ran marketing for a company. The problem you run into there is that you're constantly writing code to push data back and forth between systems because the different operational systems do different things, and you need the same data in all of them. No one has yet rearchitected the systems, but I think there's a lot that's going to play out there.
Muglia: What you're really talking about, Tristan, is the advent of the modern data application, which is an operational application that leverages data to actually make decisions autonomously for the business. We’ve seen very few of those today and significant examples are mostly in the future, but boy, will they be significant in the future.
Thank you to the entire executive panel for joining us at the Modern Data Stack Conference!