On my hiatus from full time work, I’m studying several things. Journaling this will help me be responsible to myself, because I have small feet and often step into rabbit holes.
Right now I’m configuring stuff up. So I’m running MacOS with brew
and warp
. I am beginning to be impress with warp’s ability to handle multistep tasks that might have ordinarily required me to build interactive python scripts or experiment with other shells. No more. Brew is fine. I only mention it because for a short moment I reconsidered using nix.
But that’s not going to happen.
I’m still using poetry
and pyenv
for all of my python stuff. I am starting to think that python is the program of choice for all LLM based AIs. So it rather sucks if that is your primary language, because they’ll get smarter. Still, I don’t like CoPilot or anything that autocorrects my typing. So on that score I’m using zed
as my primary IDE although I use vscode
for cross-platform stuff.
Speaking of which, I decided to buy Parallels. So in addition to my seldom used NUCs I now have a Ubuntu and Kali image locally. Still can’t cut and paste the way I like though.
Virtualization
My aim is to be able to build producers and consumers into containers and/or AWS Lambdas. The aim is to create high performance ETL / ELT engines that are customizable for every streaming feed. I want to do it in Rust, period. Sooner or later, Polars is going to kill Pandas. I suspect it will be all I need for Kermadec data volumes, which I have yet to determine. So if I’m needing a fleet of producers working in parallel then I’m in KSQL territory and I’m not sure I want to be there. I’m working at the Batcave level. One superhero consuming well-sourced data, not scraping kibbles and bits of chatter from satellite ELINT.
Depending on which of the services I pick, I’ll be running local containers. I’d rather not. But I also don’t want to be futzing around on Ubuntu for everything. I’ll build it on Mac and how well it translates. Thus Rust again. Cargo gives life.
The MDS
I forget actually what MDS stands for, so maybe this sentence is temporary filler. But I am looking at this DuckDB architecture stack and figuring out my favorites. Oh duh, Mother Duck Stack.
Hey wait a minute. This changed from last week. I see Estuary is new. I like Estuary.
Transformations
My transformers of choice are monolithic database ELT first and foremost. Behind that would be dbt, because collaboration. I presume that the conversion between local and hosted would be minimal.
Speaking of transformations. I published Murder By Numbers the other day (yesterday?) thinking about how I might get Strata data in the pipeline. Turns out something called ReadStat is just what the doctor ordered. It handles the major stat pack formats (dta|por|sav|sas7bdat|xpt|zsav|csv)
so I can inline that binary. I almost never use anything but homebrew
to manage external packages, and probably have compiled C locally twice in the past 5 years. I’m really impressed with warp
’s ability to handle that multi-step compilation cycle nicely.
Ingestion
I expect that Fivetran does it all. This guy was a competitor to Full 360 back in the day. Hard to say if their business model will keep working on a guy like me. I’m sure they’re not cheap. Anyway, I’m the guy who builds producers and ingestors from scratch as I said above. Nevertheless, I’m investigating CloudQuery, Meltano and Airbyte. I have a groaning feeling that the AI guys have figured out massive data management on a scale unknown to enterprise computing and I want to know more about that. I want the ability to sip from those rivers and understand their rules.
Sources & Data Management
I may be wrong, but if S3 doesn’t do everything I need, I question whether or not I need it. The same goes to any major hyperscaler object store. So Azure blobs and whatever GCP calls theirs. BTW, I like GCP. Really. It’s a breath of fresh air. More on that later.
Reverse ETL
WTF is reverse ETL? Whatever it is, I think I’ve got to have some of it. We shall see. I’m guessing it’s relational writeback. This will be particularly useful in my upcoming collab.
Business Intelligence
I already know that I like Metabase. It’s mature, intuitive, reasonably priced and good looking. But it’s not a spreadsheet and I think we all must at some level concede the primacy of the spreadsheet. However, I’m betting that Applied Olap’s Dodeca is the king of managed spreadsheets. I don’t even think they have any competition. My experience with PowerBI, consequently reminds me not to bet against Microsoft. It’s something of a hodgepodge of two merged products at the moment, but MS Fabric is going to win a lot of seats. A lot of captured seats. But I have learned the data back door. Thanks Darwin Analytics.
Cube is nice, but limited. Hex is confusing. Preset is good and simple, Superset is good and not simple but free. Tableau is Grendel on a hoard of gold. The rest I don’t know.
Data Tools
Only PuppyGraph looks interesting to me. I tried to play with Neo4j and was frustrated and disappointed. I’m still very curious about Relational.ai, but I’m not sure that I will need complex rules-based logic at Kermadec scale.
Orchestration
I can’t imagine abandoning Airflow, but I want to, for the same reason Superset feels meh. Lack of elegance. So I’m playing with GitHub Actions for the moment. CI/CD is kind of remote from my concerns at the moment. I’m looking to squeeze performance out of single, hand-tuned instances. Getting data flinked in and flinked out is my biggest hurdle right now.
I’m sorry, it has be a long time since I published anything over here. I have been so busy and distracted that almost no new information about tech has gotten my attention. Today, on the other hand, I learned a whole lot and I really got excited about it so I need to share.
The first pointer is to this dude who was on the kickoff team for Google Big Query, which I have recently come to have a bit of affection for. It’s really tough for me to come to grips with the fact that I have had Vertica taking up all of my mindspace for so many years. Anyway, he has published a screed called The Small Data Manifesto. Now it turns out that I have been collecting small data for years, and anybody who used to check Johns-Hopkins daily COVID report knows for a fact that just that much is all you need. Same with day traders. There may be big data churning in the background, but end users only need so much.
Query SpaceOver the years as I looked at design patterns for warehouses, marts, ETL and BI front ends, I’ve mostly been drawn to performance. I can remember when a 7 second response time from DB2 was the gold standard. We are long past the days when everything was expensive. Now everything is cheap and we have succumbed to risk homeostasis.