There is a fantastic talk on YouTube by Hannes Mühleisen, one of the creators of DuckDB, about the relationship between machine learning (ML) and database management systems (DBMS). In two slides, he persuasively argues that you can't shoehorn ML into your DBMS, because the ecosystem is just too big:
It's a persuasive argument for those of us who've spent some time in both communities, and I recommend watching the entire talk. The strategy advocated by DuckDB is essentially the opposite of this picture, in which ML absorbs the DB. DuckDB is an in-process DBMS that is designed to be "easy to absorb." It's a great strategy, and it's working well for a lot of people.
However, there also exist users for whom Hannes' "Perception" slide is actually right. When we think of the various people and teams making use of ML and DBMS, we can place them on a spectrum based on the composition of their work.
On the left side of the spectrum is your classic analytics team, whose primary tool is a SQL data warehouse. The output of their work is reporting, usually in a BI tool, as well as answers to ad-hoc questions. ML is not a big part of their world, and when it is it's often a box-checking exercise. "Our digital transformation strategy says we're going to do ML, so let's find some ML to do!" For these users, special SQL syntax for creating and evaluating ML models is a great way to sprinkle a little ML onto a fundamentally DB-centric, SQL-powered workflow. BigQuery ML is a great example of this approach.
On the right side of the spectrum are the users Hannes is addressing. This is a team of data scientists, whose primary tools are machine learning libraries, usually programmed in Python. The output of their work is highly specific to the company; it might be an internal dashboard forecasting something, it might be a part of the company's product, or it might be a prototype for a product feature that will be rewritten by software engineers before going to production. If this team uses a DBMS, it's probably just storing metadata about files that contain the actual data. For this team, storing everything in a data warehouse is more trouble than it's worth. An embedded DBMS like DuckDB is perfect for them because it gives them the benefits of SQL without having to go to the trouble of replatforming their entire workflow onto a data warehouse.
Many users exist somewhere in the middle of the spectrum. The simplest example is when a company has both an analytics team and a data science team doing different things with the same relational data. If both teams can share the same database, they can eliminate a lot of duplicative effort and benefit from each other's work. For these teams, the right solution is a DBMS that is interoperable with the ML ecosystem. The hardest part of interoperability right now is that the two systems need to be able to transmit large amounts of data at high throughput to each other. There are two solutions to this problem that are being used in the real world right now. The first is Apache Arrow, which is a common data format and protocol that allows these systems to exchange relational data efficiently. For this to work, every participant in the system must adopt the Arrow protocol, but the pace of adoption is accelerating and Arrow has clearly reached critical mass. The other solution is the Lakehouse, an architecture in which your DBMS and your ML tools share a common file format for data and metadata. Once again, every participant in the system must adopt the common format for this to work. The creators of Lakehouse have managed to partially bootstrap this problem by adopting existing, popular file formats so that existing tools can easily be taught to interact with the Lakehouse. Both Arrow and Lakehouse solve the problem; the main difference between them is philosophical. In Arrow, the DBMS is at the center of the system and the ML tools are clients. In Lakehouse, all participants are equal, with a file system and format as the central "server."
I have a somewhat unusual perspective on this space, having been on both ends of the spectrum at different points in my career. In a previous life as a research scientist I was very much on the "files are my DBMS" team. In graduate school we had a lab database that I did not even bother to use because it was more trouble than it was worth for the work I wanted to do. At the biotech company I joined afterwards, I actually set up the original data infrastructure, which used a MySQL database to store metadata about what was stored where, but all the data was stored in files. Now, as CEO of Fivetran, I speak with customers of all types but I find most of them are doing classic analytics and reporting on relational data.
I expect all 4 fundamental approaches in my diagram to succeed. The open question is how popular each will become, and the answer will be driven by the type of work people are actually doing in the real world. If most people just want to do analytics and reporting, we can expect DBMS with a sprinkling of ML to become the dominant paradigm. On the other hand, if most data teams shift their focus to ML, we can expect the integrated data warehouse to fade in importance in comparison to the ML tools and SQL systems that are natively designed to work with them, like DuckDB and Lakehouse. There are a lot of vendors who would like to convince you that one way or the other will win, but the reality is that vendors are passengers on this journey. What matters is the problems people are actually trying to solve.
Read more articles by our co-founder and CEO on the Fivetran blog.