The Crucial Role of Semantic Layers in Data + AI
Giving LLM's the right rules to abide by when reading Data
Today I want to give a high level explanation of a Semantic Layer. This is a very important part of the SLAM Data stack. What is it, where does it come from and what problem does it solve for Data + AI? Let’s get going!
What are the Issues with an LLM Reading Data Directly?
Many LinkedIn memes mock the poor performance of LLMs on structured datasets, but they aren’t wrong. LLMs are trained on unstructured data, like text and images, while structured data has specific use and context. This creates an issue with how LLMs normally operate. They read a user’s natural language input and probabilistically return a set of tokens they think the user will approve of. Early attempts, particularly the "Text to SQL" approach, were explored in various white papers but ultimately fell short. This isn't due to LLMs' inability to write SQL. They can analyze a query and make updates. Where they struggle is taking an ambiguous business question, examining structured data, and creating proper SQL to run a query that answers the question.
There are four key insights into why an LLM struggles with writing correct SQL when looking directly at a database:
1. Lack of Contextual Understanding: LLMs generate responses based on patterns learned from vast amounts of unstructured text. However, they lack the ability to understand the specific context and relationships within a structured database, which is crucial for writing accurate SQL queries.
2. Inability to Interpret Schema: A database schema defines the structure of the data, including tables, fields, and relationships. LLMs do not inherently grasp this schema, making it challenging for them to construct queries that align with the database's design and constraints.
3. Error-Prone Syntax Generation: While LLMs can produce text that resembles SQL, they often generate syntax that is incorrect or inefficient. This is due to their reliance on probabilistic patterns rather than a deep understanding of SQL syntax and best practices, leading to potential errors in query execution.
4. Misunderstanding of Table Joins: LLMs struggle to determine the most effective way to join tables, as they do not possess a true comprehension of the relationships between different data entities. This can result in suboptimal join conditions or the omission of necessary joins, ultimately affecting the accuracy and performance of the generated queries.
In the early stages of Data + AI, users were attempting to give the LLM data directly. That could result in something useful, but overall the above list of limitations held it back from showing much promise. What we needed was an unstructured explanation of the data. Something that could show dimensions, measurement logic, and join structures of tables we wanted to pass to the LLM, but in a way that an LLM could better understand.
Enter the Semantic Layer
First, I think it is worth quickly explaining the history of Semantic Layers.
1990s: Early Data Warehousing
Tools like BusinessObjects, Cognos, and MicroStrategy introduced semantic layers to help business users navigate SQL-heavy data warehouses.
These layers mapped raw tables to friendly terms like “Revenue” or “Customer Region.”
Enabled drag-and-drop report building without needing to know SQL.
2000s: Rise of OLAP Cubes
Platforms like Microsoft SSAS and Hyperion Essbase used pre-modeled cubes with dimensions and measures.
The cube structure essentially was the semantic layer, powering fast queries and consistent business logic.
This introduced the concept of reusing metrics across dashboards and reports.
2010s: Modern BI Tools and Thin Semantic Layers
Tools like Tableau, Looker, and Power BI shifted semantic modeling closer to the visualization layer.
Looker introduced LookML, a version-controlled semantic modeling language.
However, definitions became fragmented across tools — the “semantic layer” was often siloed.
Cube.js emerged toward the end of this period as an open-source headless BI platform, decoupling the semantic layer from front-end tools.
It offered APIs to serve pre-modeled data to custom apps and dashboards.
This represented an early push toward “semantic layer as infrastructure.”
2020s: The Semantic Layer as a Service
The emergence of centralized semantic modeling platforms like dbt Semantic Layer, Transform, Cube, and MetricFlow focused on defining metrics and business logic once, serving them to many tools via APIs or SQL generation.
This marked a shift from visualization-centric semantics to platform-native, headless layers.
Rill Data introduced real-time semantic modeling and dashboards built on DuckDB.
It emphasized speed, simplicity, and local execution.
This integrated modeling and exploration in one tight loop for fast feedback.
2023–Present: LLMs + Semantic Layers
LLMs created demand for natural language interfaces to data.
Semantic layers now act as the ground truth for AI agents and copilots.
They help map vague human questions to precise, governed queries and metrics.
Semantic layers are becoming the interface between AI and structured data.
New open-source semantic layers like Boring Semantic Layer are built with Python and stand on their own.
How does this help an LLM talk with Data?
A semantic layer helps an LLM read a database effectively and consistently, reducing hallucinations and improper SQL syntax. It does this by acting as in intervening step between the LLM and the Data. The LLM is the interface that allows a user to ask a question about Data using natural language. From there the LLM reads the semantic layer model and sees the requirements necessary to query the data. What the LLM then does is generates a query using the semantic layer attributes, as opposed to just coming up with its own SQL. What this does is keeps the query it writes very consistent (near deterministic). From there the semantic layer will rewrite the query code into actual SQL and then query the data. The output is returned and the LLM then takes the data table and gives a natural language response to the user.
Given the initial limitations of LLM -> Data we noted at the beginning of this article, here is how a semantic layer addresses them:
1. Enhanced Contextual Awareness: A semantic layer provides the necessary context for the LLM, allowing it to understand the relationships and hierarchies within the data as well as metric logic. This leads to more accurate query generation.
2. Schema Interpretation: By defining the database schema in a way that is understandable to the LLM, a semantic layer enables the model to construct queries that align with the database's structure, reducing the likelihood of errors.
3. Correct Syntax Generation: With a semantic layer guiding the LLM, the model can generate query syntax that adheres to best practices, minimizing the chances of producing incorrect or inefficient queries.
4. Improved Table Join Logic: A semantic layer clarifies the relationships between data entities, allowing the LLM to determine the most effective way to join tables. This results in more accurate and efficient query execution.
The final takeaway
The integration of a semantic layer with LLMs represents a significant advancement in how we interact with structured data. A semantic layer enhances contextual awareness, interprets schema effectively, ensures correct syntax generation, and improves table join logic. As the real world potential of Data + AI grows, the role of semantic layers will likely grow along with it, serving as a bridge between human language and the structured data that drives decision-making.