The Full Data Stack 05

Product thinking is back, and Substack writers are killing it

Aug 31, 2025

Hey y’all, I’m Hoyt!

Each week, I share news and findings in the world of Data and AI. I span the entire data stack and am trying to help seasoned data professionals move in to the new world of Data + AI.

Not Subscribed? Come join! ⊂(◉‿◉)つ

Happy EOW!

I was travelling to Chicago for work this week but managed to catch A LOT of great stuff. The two trends I’m seeing is that Product Thinking for Data + AI is really starting to catch and also Substack writers in this space are putting up some stellar content. The number of Substack writers that are in this week’s FDS is much more than I normally have. Let’s hop in and see what I found!

Interesting things I found this week in Data and AI:

I’m a Senior Product Manager specifically for Data, which was a role I pivoted into from Data Scientist back in the lockdown days of COVID. I had been incredibly frustrated at a Fortune 25 as a DS and desperately wanted to get into a role that had an actual say in what was getting created. But here’s the thing…managing a product’s lifecycle is really different from building a PoC on your local computer. Because when you finally have the keys to the car, it means you need to actually deliver it somewhere. Not only that, you need a map at the beginning that offers you at least a few ways to get to your destination. Because you don’t know if any of the routes will have roadblocks or not. Metaphors aside, putting the Product Lifecycle on top of Data requires things from Data Teams that they aren’t usually used to. Namely, YOU NEED TO THINK A LOT UP FRONT ABOUT WHAT YOU’RE MAKING. Now imagine doing that not only with some deterministic outcome like a data model or a dashboard, but also with something like Agentic Software. Putting the Product Mindset on probabilistic tools like Agents and LLM’s adds an entirely new layer to “Data Product Mindset”. I’ve been really pleased to see a lot of product thinkers out there starting to wrap their heads around how to wrangle in the scope of GenAI and try to create things that are actually useful and valuable. The Elite AI Assisted Coding Substack is starting to be a go to for that type of Systems and Product thinking. I am connected with Eleanor Berger on LinkedIn and have been extremely impressed by the different projects/products she has been building and evangelizing. In their latest interview they talk to the co-founders of SpecStory, who go into great detail on how they leverage Agentic development to create their business outputs. Honestly, this is one of the best breakdowns of how building really works with Agents I’ve ever come across. It’s the future of Product Thinking and any Data PM’s out there best go check it out!
When we hear people talk about “Data Products” there’s a good chance they won’t be able to give you a second or third level explanation of what that term means. It has, unfortunately, become an industry buzzword with little definition beyond the two words. I do, however, still like and prefer that term. I think it makes sense because there is a software product and then there is a data product. There are subtle and vast differences between those two depending on what you are creating. So I was very happy to see a new Substack by the author of Managing Data as a Product, Andrea Gioia, who’s second post gets into the actual details what “Data Product” really means to him. He also happened to subscribe to the Full Data Stack so I’m feeling very seen right now.
While I am a Product Manager, there’s an important word to put at the beginning of my title. That word is “Technical”. I was an IC in Data for years and have strong opinions about architecture, codebase and platform decisions for data products. I also do a lot of analysis on my end after launch. My go to library for that is Polars, because it is the apex analysis library and no I won’t even have a discussion about it. The syntax is your favorite parts of Pandas and SQL put together and it absolutely shreds through data. The one limitation is that it starts to see performance hits once you get into data sizes larger than one node (server) worth of data. This is the amount of data where you would start to think about using Spark clusters instead. The Polars team knows this, so they have been building out a much more efficient way to utilize your machine’s hardware to get better performance over those type of workloads. If you have a GPU on board, their GPU engine is now fine tuned to get everything it can out of it. The Data Engineering writing GOAT Daniel Beach does his thing in his latest article about the GPU performance. As usual, he has code and outputs along with thoughts and opinions that only a Midwestern road hardened Senior Data Engineer would have.
If you couldn’t tell from the subtitle of this week’s FDS, Substack writers in the Data + AI space are starting to really ramp up their content. I’ve been saving, liking and restacking more and more. This seems even more acute now that people are starting to realize LinkedIn is tanking both their algorithm and their ability for people to connect. Back to Substack killing it, there have been Data stalwarts on this platform for a number of years. Yordan Ivanov has been on of those people and tirelessly works to help others build their Data career while also being a Head of Data himself (not easy to do folks!). He recently cowrite an article on the fantastic Pipeline to Insights Substack about semantic models and why they are gaining traction in Data Modelling. I myself also recently wrote about Semantic Models and why they are useful in Data + AI but Yordan’s article is a reason I have decided to simply stick to just the weekly Full Data Stack newsletter. Instead of trying to add more to the conversation, I am looking for those who are out there already writing it all. Semantic models are so crucial to the next generation tools in Data + AI and this is a great read.
While I’m at it, giving all the Data writing OG’s their flowers, Abhisheck Choudhary (@Ubuntu on X) has always been ahead of the game and continues to drop posts like this on X about how to actually use AI tools in Data Engineering.
A little known fact about me. I F*****ING LOVE DATA LAKEHOUSES. No seriously, I really do like the idea of a Data Lakehouse. I constantly Gush about Bauplan and DuckLake because I think the Data Lakehouse concept is very forward thinking and has real world value if used properly. Netflix totally agrees with me, but in a very Netflix way, must go way overboard on how they use it. A recent blogpost for LanceDB (a very nice multimodal vector database platform) shows how Netflix is leveraging them to build a Multimodal Lakehouse for all their media. It covers Media Data Lake vs. Traditional Data Lakes and treating unstructured blobs as first-class citizens with uncompromised random access I/O. Even if you don’t know what that means it sounds REALLY COOL and we should all read this article and hope to keep some nugget of insight. But the truth is Data folks should start to get more accustomed to Multimodal Data because that is going to start to be a requirement for analysis.
The reason I know that Data + AI is not just a trend, and that we are going to see more and more adoption of AI tooling with Data is because serious and smart players in the Data Startup space are taking time and money to add them to their products. Bauplan has added an MCP server to their serverless Data Lakehouse and I honestly have to find some time to try it out. We now see MCP servers for them along with Rill Data who has a fantastic MCP server for their projects and semantic layer and it works like a charm. The best part about Bauplan is how they don’t just write articles about theoretical use, they almost always make a Github repo for them too.
I am constantly looking for Github repos, Substack posts and personal blogs from people who aren’t just talking about how you should try to use Data and AI tools but are giving you real ways to use them. This traverses across the entire landscape of Data from a repo that uses Rust and Llamaindex to build a semantic search over documents to orchestrating hourly SQLMesh jobs. I am less interested in what data influencers think and more interested in the boots on the ground people out there making things and sharing it in public. Think of the Full Data Stack as the hipster music blogs of the early 2000’s (Gorilla vs Bear, My Old Kentucky Blog, Pitchfork and Brooklyn Vegan). If you don’t know what I’m talking about that’s good because that is the point of being a hipster. Don’t worry here’s a quick summary of the vibe.
A second potentially known fact about me. I F*******ING LOVE PYTHON. And while I do straddle the fine line between hipster and nerd, I tend to fall on the nerd side of things. So much that I am willing to find time to watch an hour and a half documentary on the Python programming language. I am obsessed with origin stories. I watched similar Youtube documentaries about React.js and the TR-808 drum machine. This new one on Pythons’s background is fascinating because Python was originally intended to work as a natural language UI for computers (sound familiar?). Python as an open source software library must be protected at all costs, so thankfully films like this allow for more people to realize just how important it is. As it is noted in the very first minutes of the documentary, “Python is literally on Mars” (via the Mars rover machine). If you don’t feel like reading all the above links, but you have a Sunday lunchtime to burn, then put this up on Youtube and enjoy the journey of my favorite programming language.

If you like what you read then please subscribe! This is a weekly Newsletter only and I take a lot of time through the week to make sure it isn’t a boring tech bro regurgitation of everything else you read.

If reading isn’t your thing, I also have a Youtube that I am finally getting more content up after a 1 month hiatus. A video is worth a million words sometimes and I like to get my hands on things vs writing about it.

Youtube: https://www.youtube.com/@thefulldatastack

The Full Data Stack

Discussion about this post

Ready for more?