Every week I am learning something new and exciting about DuckDB, the past two have packed a wallop. I am now trying to get my own flock up and flying here in Los Angeles County and a few spots around the country - with a loose network of friends. Basically you subscribers, wherever you are.
This is my path, which I see from a high level and 20 plus years in this business. What I’m seeing is this. DuckDB is the biggest news since Vertica. It takes integration to the next level. The bad news is the good news which is that a bunch of different sets of minds are taking the open source ecosystem to the next level. But we have to pick our path carefully and be aware of the alternatives. It’s going to be a while before we have the experience and the nerve to identify best practices. So understand that in some way I am shooting from the hip with a series of educated guesses.
As soon as AWS announced S3 Tables, I knew it would be significant in the very same way I knew Iceberg was. Now they are together and just yesterday, DuckDB Labs announced support for the Iceberg REST API. I’m not sure if or how this might overlap with the Unity Catalog from Databricks, although I am getting the idea that Databricks are pretty much on the cutting edge of where I thought Vertica would be. But I am now willing to suggest that Vertica is a beautiful shipwreck, and I’m glad I jumped ship last year. However, the waters here are still cold.
So here’s what it looks like to me. I have been going through dozens of projects for which I still have sample data and metadata and I am converting my old Vertica DDL into Postgres DDL and DuckDB DDL. My preferred data at rest is currently DuckDB because I can quickly deal with rationalizing whatever the heck I built in 2007 - I often leave a trail of development SQL. But this is slow going. I have been maintaining some of these small analytical sets on CockroachDB which still may be a part of the picture - I have yet to get into the transactional weeds of that vs Posgtres RDS, but I know that’s the place to look.
In the meantime, making a catalog of all of the DDL and structured up data I have would be nice. So that puts a priority on this Iceberg REST catalog. And once again Rust is going to have to take a backseat to Python, dammit. I may not need all of that speed, and there is Daft out there.
Speaking of which, I am likely to leave Polars behind. I know there are some distributed applications for everything that has been done in Spark before but here’s a bet. I bet that large ML jobs will be revolutionized by the smart folks who have figured out Smallpond and 3FS. In other words, now I have names for my speculation that the LLM gods would have reinvented ETL / ELT and whatever massive data management has been required. Considering that Elon built a giant datacenter in 6 months, these are quantum leaps. Whether or not LLMs have hit a wall, there are spoils. Nevertheless, what I know is that the overwhelming majority of hardcore business rules and subsequent decisions do not require petabytes of data crunching. Decision-makers don’t have the time nor the inclination… This is part of the DuckDB manifesto anyhow. So long as I can, and SMBs can afford to build 2U machines with 512GB of RAM, and dish out a fleet of 16GB workstations, then I’m in what used to be mainframe territory. I’m being pessimistic when I say that’s 80/20. This is what I have been calling the Kermadec architecture, so dibs on that. In other words I want to be able to build turnkey monolithic systems with cloud overflow that calves out Icebergs like nobody’s business. I don’t want to deal with distributed data issues. That’s for the Smallpond and 3FS guys, and it will be 2 years before AWS makes an 3FS offering. That’s another nickel bet.
BTW, my homelab is called Southwall. I may be talking about that from time to time. In addition to my 3 Ubuntu NUCs I have 3 Mac Minis and Ubiquity network hardware and a couple of Synology NAS boxes. Maybe 10TB free right now. Obviously I have an AWS account, I’m doing some local Minio, and I have B2 object storage out there too. I have some Google Object store but I’ve been listening to Daniel Beach too much - starting to doubt, despite its ease, that GCP is all that. I will not insult Microsoft because I’m a humble little ugly duckling and I really did love Window XP Professional once upon a time. And Thinkpads with the red nipple too.
I’m going to keep getting my hands dirty, but also working out some practical kinks in the plumbing. I am remembering that maybe I’m like Barenboim. I’m capable on the keyboards but I sure the hell ain’t fast as Volodos or as dynamic as Hiromi or a goddamned virtuoso like Lang Lang or Yuja Wang - or any of those. I’m slow. I’m quirky and self-taught like Monk. I think I’m a pretty damned good composer. I hope to live as long as Horowitz, so there.
Next Steps
I’m setting up Podman just in case I have to run Iceberg REST Catalog from a container. But I’d like to keep that on bare metal. Somebody told me VMWare is free nowadays. Maybe that. But I really want a catalog.