The Next Generation Data AI Tool Kit
The four AI technologies any Data professional needs to understand
Hey y’all, I’m Hoyt!
Each week, I share news and findings in the world of Data and AI. I span the entire data stack and am trying to help seasoned data professionals move in to the new world of Data + AI.
Not Subscribed? Come join! ⊂(◉‿◉)つ
How can you effectively build your Data AI skillset?
First, let's acknowledge the current landscape of Data and AI. It's quite chaotic, and skepticism is rampant. New tools are frequently developed (or vibed) and just as quickly abandoned. In recent months, we've started to gain clarity on how Data and AI will interact and the tools available to construct those systems. The swift pace of innovation suggests that what is effective today might not hold relevance tomorrow. Nevertheless, there are certain technologies and patterns emerging that appear to have the necessary staying power through the first Data AI wave we are experiencing.
I am calling this new stack of tech the SLAM stack. Mainly because that was the best acronym I could come up with. But it has a ring to it and I personally think it has marketing potential.
What sets the SLAM stack apart from other AI stacks is its emphasis on establishing guardrails for AI, rather than adopting a reckless 'YOLO' approach and waiting to see the outcomes. In the realm of data, particularly analytics, we aim for our generated responses to be as deterministic as possible. This stack enables us to manage that probabilistic uncertainty and deliver solutions that genuinely add value.
Now, let’s explore each technology within the SLAM stack to clarify why this new toolkit is so beneficial.
Semantic Layers
A key element in the realm of Data AI is the Semantic Layer. This layer serves as a conduit between raw data and user-friendly insights, facilitating the creation of meaningful connections within datasets. Think of it as a programmable interface defined via code so that a user can query data under engineered context restraints to achieve better results. Especially when using an LLM/Agent.
A traditional Semantic Layer involves:
A table (datasets)
Dimensions in the table you would use to query (columns names)
Measures you would create using the data (aggregations)
Joins that you use with other tables to get calculations (join logic)
Time dimensions you use to do time filtering (time ranges)
When you use a semantic layer you won’t be writing literal SQL. You will write Python and give the dimensions, measure, joins and time dimensions you want to use. Then a polyglot library like Ibis will write the SQL and process it as needed.
This allows you to more effectively use an LLM with Data directly because you don’t rely on the LLM to generate SQL (not what it should be doing). Instead it is used along with an Agent to understand your question, create the semantic layer in python (it’s good at that) and return the result in a more useful way than just a table.
A list of great semantic layers:
Boring Semantic Layer
Rill Data
Cube.js
Large Language Models
You can't really have an AI toolkit without using LLMs. This isn't because LLMs are required in every stage of the Data AI process. On the contrary, to achieve a Senior Data AI enthusiast status requires an understanding of WHEN to leverage an LLM. That's mostly due to the fact that this is where costs will be accrued most because you will be leveraging a 3rd party LLM. You will not be building your own foundational models, and you probably won't running your own GPUs on-premises to do inference. So the real skill will be in understanding when an LLM needs to be introduced. In the Data AI SLAM stack, this is normally going to happen in the beginning (understanding the user's prompt) and at the end (interpreting the data output and generating a more robust answer to the user). It's best to use the LLM for what it is good at (generating words) vs asking it to do things it's not really designed to do (turn your prompt into SQL). In Data, the LLM is the inference and human aspect (listening and responding) that gives Data AI that magical feeling for the user.
The main list of LLM's:
Claude Sonnet
ChatGPT 4
Google Gemini
Deepseek R1
Agents
Agents have been around for a minute and I feel are the most hyped. That aside, they are extremely powerful. They can automate tasks, browse the web or run tools. In terms of Data AI, the Agent is tasked with using the correct tool it feels it needs to answer the user's question. The Agent will look at the semantic layer and any other tools available (MCP servers, local files or websites) and decide what action to take to get the correct outcome. While you can technically build your own Agent, I suggest instead leveraging one from the Big 3 Command Line Interface tools or IDE Agents. Choose one and get acquainted with it!
The CLI Agents currently out there:
Claude Code
Google Gemini CLI
OpenAI Codex
Cursor Agent
VSCode Copilot
MCP Servers
MCP (Model Context Protocol) probably has the most uncertain future in the SLAM stack. That is because it was specifically created and defined by Anthropic (creators of Claude Sonnet). Since its inception, both OpenAI and Google have come out with their own protocol. However, MCP has built a strong early adoption user base and quite frankly just works. What do MCP Servers actually do? They act as a bridge between AI models and external systems like databases, APIs, or other tools. It allows AI to interact with these systems in a standardized way, enabling AI models to perform tasks beyond their training data. Think of MCP servers as roadmaps that Agents will use to determine the best route to get the right answer for the user. In terms of Data AI, MCP servers are fantastic at funneling the Agent down a specific route that usually leads to an API endpoint to call a specific tool that will run the query you need to get a near deterministic answer. The MCP server is how we keep the Agent from going off the rails and acting with more intent. Instead of letting a probability machine run around trying anything it wants to.
At the moment, anyone can create a MCP server and there are many out there. They span across mutliple SaaS platforms and also general ones like local file storage. For Data AI there is a small but powerful list of options:
Rill MCP Server
Google MCP Toolbox
Anthropic Fast MCP
Without throwing the kitchen sink at you I wanted to just touch on these for technologies so I can move forward to do more deep dives, link to use cases and give my own work to you so you can deepen your understanding and look for your own opportunities!
I have some other links of my own work in case you want to check out some use cases:
Connect Cursor to BigQuery MCP
Check out Rill’s MCP server in action
Mmmm is Google Gemini CLI an Agent or a tool that helps you build agents.
Got me thinking ….