One of the most popular new databases at the moment is DuckDB. With millions of downloads per month and two startups created around it, the open source column store has achieved feathery heights usually reserved for bigger, older projects. But what’s surprising is how it got there.
In many ways, DuckDB represents the antithesis of your typical big data management product. For instance, instead of developing a distributed data store to handle big data, as scores of others have done, the creators of DuckDB bucked the herd mentality and went “unapologetically single node,” according to Hannes Mühleisen, who led the Database Architectures group that created DuckDB at the Centrum Wiskunde & Informatica (CWI) research center in Amsterdam, Netherlands.
As a database researcher who spent his whole life in academia, Mühleisen didn’t like how difficult it was to use modern big data management systems for data science and advanced analytics, he told Datanami.
“If you try installing Hadoop somewhere, it’s very difficult,” he said. “We thought, maybe we can design a data management system for analytics that’s more friendly to the user while at the same time…being state-of-the-art and having the latest in algorithmic and technological advances in terms of performance.”
In other words, Mühleisen wanted to create an analytical database that had the performance of Formula One race car but was as user-friendly as a Toyota Corolla. When he and his team sat down to create such a system, DuckDB is what emerged.
A New Kind of Database
So, what is DuckDB? As previously mentioned, it’s unabashedly single node.
“We said we will not do distributed at all,” said Mühleisen, who is also the co-founder and CEO of DuckDB Labs, which creates the core database tech and provides tech support. “The data sets that everybody always talks about [are] terabyte scale and petabyte scale, thousands of nodes. But actually, the datasets that 99% of us are using tend to be much smaller. And if you don’t have to go distributed, you’re simplifying the user experience a whole lot.”
If you run at Google scale, then of course you’ll need to go distributed and “build these crazy things” like MapReduce, he said. “But for the rest of us, it’s really not very often about petabytes,” Mühleisen said. “It’s more about, hey here’s a file that’s super annoying and I want to read this and do some aggregation.”
The next characteristic of DuckDB is allegiance to good old SQL. While the NoSQL movement is still going strong and many people want to use Python and dataframes to query data, Mühleisen and his crew recognized that SQL wasn’t broke, and therefore didn’t need fixing.
“SQL has been called dead so many times I can’t remember,” he said. “But we decided that we’re going to do SQL. And it turns out it was a good idea because loads of people just know SQL.”
Like other OLAP-style databases, DuckDB features a column store (for efficient aggregations) and vectorized processing (for better performance). It’s designed to execute SQL queries incredibly fast. But it’s not a database for data warehousing, such as Teradata or RedShift. It’s not a place to park all of your data to create that “single version of the truth.”
In-Process Analytics
Where other OLAP databases zig, DuckDB zags. To that end, it functions more along the lines of an embedded analytic application than your data warehouse.
“DuckDB has this different angle,” Mühleisen said. “It’s more like something that you put into a workflow rather than something that you sort of run on its own servers. It’s like SQL Lite in many ways. It’s a library. It’s not like you install it and you’re running a server. It’s like you actually glue it to your application.”
Weighing in at just 50MB, DuckDB runs on a wide variety of systems (Linux, Windows, etc.) and is offered in a variety of packages. There are Python, R, and JavaScript packages. NASA is using it for something (they haven’t said what), and FiveTran is using it as part of their Apache Iceberg writing process, Mühleisen said.
The goal with DuckDB is to provide lightning-fast analytical processing right within an application. For example, when paired with a dashboard, the C++ database can provide millisecond response times on that dashboard.
“They take advantage of the capability of DuckDB to kind of run wherever you want it to run, to move the query processing closer to the to the user, which has a huge impact on the user experience,” Mühleisen said.
DuckDB is all about analytic processing, not for processing transaction. You’re not going to process a million rows of data a second with this like you might with a Postgres database. But if you need to read a billion rows a second, that it can do very well.
If a user needs an in-process OLTP system, Mühleisen recommends they look at SQLite. And vice versa, if a SQLite user needs analytics, Mühleisen hopes that they think of DuckDB.
“We sometimes call ourselves SQLite for analytics,” he said. “We may have actually invented a new class of system…It’s this idea that you don’t have a separate database server, that DuckDB is just glued to whatever other application that you have, and it’s doing analytics.”
DuckDB also has a good story to tell in terms of analytics efficiency. The database often replaces small Spark clusters on the order of 10 nodes with a single node of DuckDB, Mühleisen said. Similarly, people often run into overhead issues when they’re trying to “stuff too many rows” into Pandas.
Decidedly Different
There are two other things that separate DuckDB from the big data masses. First, the team of engineers behind the database at DuckDB Labs is based in Amsterdam, away from the hustle and bustle of Silicon Valley. It’s not exactly a technological backwater–Amsterdam’s Center for Mathematics and Computer Science housed the team that created Python, the world’s most popular programming language. But being off the beaten path has turned out to be an advantage for DuckDB, Mühleisen said.
“I think it also helped us to do something that was nonconventional,” he said. “Had we been in San Francisco, we wouldn’t have had the freedom to just basically be like, we’ll just ignore all this sort of common wisdom and do something that we think is right, and actually be successful at it.”
The second thing is the company has eschewed venture capital money. While the second DuckDB startup–Seattle, Washington-based MotherDuck, which has created a serverless version of DuckDB and has the backing of Mühleisen and DuckDB Labs co-founder and CTO Mark Raasveldt–has raised $52.5 million through the fall of 2023 at a $400 million valuation, DuckDB Labs has not taken a dime.
That’s not for lack of trying on the part of the venture capitalists. “We did get a lot of interest from VCs,” Mühleisen said. “Everybody wanted to talk to us. We had Andreessen. We had Sequoia. We had everyone talk to us. We ended up not taking any VC money at all.”
As DuckDB instances spread across the world, the momentum has picked up. Mühleisen says the project benefited from evangelists who sang the praises of the approach DuckDB was taking into a new area.
“I think what also helped [is] maybe there is simply not a lot of tech in that space to begin with,” he says. “This space isn’t very crowded and I think we ended up making a good sort of compromise–not a compromise, but a new way of combining things that that really hit a nerve.”
The sudden popularity of DuckDB has certainly been a fun ride for Mühleisen, who has spent his whole career as database researcher up to this point. “It’s pretty wild to see all that happening,” he says. “As somebody who makes software, you kind of expect that nobody will care about your thing, right?”
Not this time, Hannes.
Related Items:
Is Big Data Dead? MotherDuck Raises $47M to Prove It
Pandas on GPU Runs 150x Faster, Nvidia Says
Starburst Brings Dataframes Into Trino Platform
The post DuckDB Walks to the Beat of Its Own Analytics Drum appeared first on Datanami.
Go to Source
Author: Alex Woodie