The Overview
You can read my initial overview of what a Last Mile Data Pipeline is on my LinkedIn HERE.
TLDR; A Last Mile Data Pipeline is the final transfer stage of data to a stakeholder/customer facing analytics platform or UI of some sort. "Last mile" is a nod to fulfillment logistics and the final stage of the delivery process. Specifically, where a product or package is transported from a warehouse or distribution center to the end customer's doorstep.
The Anatomy of a Last Mile Data Pipeline
A last mile data pipeline generally consists of three basic steps:
Read data from a source
Serialize and transfer the data across a network
Parse the data and render in some kind of UI
There are a number of ways you can optimize each stage, but first you need to understand the parties involved in last mile data pipelines. That is the publisher (the data storage layer) and subscriber (the data consumer layer).
First let’s start with the subscriber. This layer is normally your data product. This can be an analytics platform, an enterprise data tool or simply code that is calling an API, for example. The main takeaway here is that the subscriber has a number of forms it can take. However, the subscriber tends to have some criteria about the data it is consuming. Namely, it usually wants to the filter the data first in some way. It could be by date range, or it could be by some type of category filter. Ideally, the subscriber would like to pass filters to the publisher so that it can get only certain data back, and then start working with that data in its UI. The subscriber may want to make a call to multiple different data sets so that it can bring all the necessary domain data into one page and leverage client side code to make dashboards, graphs, tools etc. The data does not need to all be in one table, because the subscriber knows where all the domain data is stored and therefore it can put together all logic necessary to merge and munge it on the client side before rendering it in whatever way it wants to.
The publisher, on the other hand, has much less criteria. The publisher is simply a storage layer that holds domain data that the subscriber needs. Now, to better control the data itself, it is best to have the publishers all hanging out in the same place. Normally on a single vendor (AWS, GCP, Azure) that can accommodate both data warehouse tables and cloud storage buckets. This is because analytics production models many times will sit in a data warehouse, or as a file type of some sort in vendor cloud storage. There’s more to the story here in terms of how they end up in either situation, but that is out of scope for last mile data pipelines. What we care about most is what happens once that data has been sent to the publisher.
Why Do File Formats Matter?
Now that we understand the stages of a Last Mile Data Pipeline, and the parties involved, let’s discuss what is being transferred between each party. In last mile data pipelines, you are transferring data that the publisher is storing as some type of file or table format. File formats have different pros and cons depending on which one is used. When we are trying to optimize last mile data pipelines, there tends to be a clear North Star we are striving towards. That North Star is speed. Specifically, the speed of reading data on the publisher side, transferring data across a network and parsing data on the subscriber side. File formats play a huge role in the efficiency of all three of these stages. Generally speaking, there are a few file formats that are most leveraged in last mile data pipelines. They are as follows:
CSV
JSON
Parquet
Each of these file formats have their own optimal use cases. However, many times they are used for suboptimal use cases. In the case of reading file formats for speed, there is a clear distinction between them.
CSV and JSON
One of the big pros of CSV and JSON is that they are human readable. Which means you can view the source file and read what’s in it on the publisher side without needing to decode it. It also makes the subscriber happy when they can read the response from the file transfer quickly and easily.
However, these file formats can have issues with optimized analytic workloads because both CSV and JSON are “row based” in nature. That means that if you want to look at a single value of a column, you will actually need to read in the entire row of data for that value:
Notice what is happening above. You may have only wanted that specific value, however CSV and JSON are structured so that you have to read the entire row to get that value.
Now what would that mean if we wanted an entire column of data? Well, it would be something like this:
To read in the values of an entire column would require you to read the entire object. This creates a major read bottleneck when the column you want to read has thousands, or millions of rows of data. You are essentially waiting for the entire object to be read just so you can keep a small amount of it.
That is not optimal!
Now we need to quickly add a note here. CSV and JSON are very speed efficient for reads when the data set is smaller (I usually see a rule of thumb as under a GB). I will also add that libraries like pandas have multiple options on how to more efficiently read CSV. However, there is still a speed difference between reading CSV vs the other file formats.
Apache Parquet
The Apache Parquet file format, on the other hand, is a columnar storage format optimized for analytical workloads, offering high efficiency in compression, encoding, and query performance. It is stored in binary (not human-readable), with metadata embedded for schema and statistics (e.g., min/max values).
When you read from a Parquet, the entire file will need to be decompressed so it can be read. However, decompression is not the same as reading and generally happens very, very quickly. Where parquet excels in analytical workloads, like the example above with CSV, is that it is structured by columns (not rows). So if you need to SUM( ) up a single column of a parquet, you will ONLY read that column. All other columns will be left alone. It can be visualized this way:
It is also easy to filter the data from a single column when reading a parquet. This is possible because parquets also keep metadata about the data in the footer of the file. So when code is used to read a parquet file, it first looks to the metadata to understand where the column is and what values to look for, especially if there is a filter being used (i.e. SELECT * FORM Parquet WHERE X >= Y). This improves read speed drastically. Especially when it is a very large dataset.
The File Format Optimization Takeaway
Leveraging parquets as your default publisher file format is an optimization, even if your data is relatively small. That way, if your data suddenly explodes overnight, your read speed will still be fast at scale, as opposed to using CSV or JSON files directly.
Note again: There are some instances where parquets might create more bottlenecks than a CSV or JSON. No file format is bulletproof, but this optimization should handle most all use cases.
Why does Serialization Matter?
via Gemini:
Data serialization is the process of converting a data structure or object into a format suitable for storage or transmission, allowing it to be reconstructed later. This is often done by flattening the data into a stream of bytes.
Let’s briefly talk about serialization. The above definition says enough, but let’s explain differences in type of objects that are serialized. When we serialize an object, we are normally confined to what the object allows. For example, a JSON object is a popular payload response from an API call. That is because it is very human readable and widely used across multiple programming languages. This makes it easy to have one service in a certain code base send a message to another service in a different code base. However, serializing JSON has its drawbacks. Specifically, JSON does not get serialized in a binary format. Instead the serialized output is typically a plain text string that represents the structured data in a lightweight format. Strings use more memory than integers (i.e. binary) and therefore create much larger object at scale to try to pass along a network. JSON also requires you to serialize the entire object . This creates a huge bottleneck when you want to try to pass large datasets across a network from a publisher to a subscriber.
On the other hand, file formats like parquet are compressed file formats that can hold large amount of data in small file sizes. These files take up much less memory and can also be “chunked” when being sent over a network. Meaning that they can be sliced at the row level, across all columns and then sent through a network in chunks of arrays. This allows for you to transfer even large file sizes across the network, and efficiently render them on the subscriber side UI.
The Serialization Optimization Takeaway
Parquets are compressed binary objects that can be chunked and sent across a network more efficiently than a JSON or CSV.
Working Together
Now that we understand both parties involved, and their high level criteria, let’s talk about the ways they can work together:
Subscriber pulls data from publisher data warehouse table via an API, with a payload response:
This is a fast and easy way for the subscriber to get data from the publisher. Normally, this would be either in the form of a SQL query being sent via an API or from python code that let’s you pull data.
To optimize this, you need to confirm what the response payload format is. Many times, the API call response will return a JSON object, which will result in network bottlenecks and slow page load times on the UI side when trying to read large amounts of data from the publisher.
Look at the Data Warehouse API to see what the options are. For example, in the Big Query API, there is an option to read the data in the Apache Arrow format. This uses a chunking strategy, similar to how I described with the parquet files. Except it can do this with proprietary columnar format data warehouses like Big Query. This is much more efficient to read large amounts of data and render very quickly on the subscriber UI.
Subscriber reads a file format from cloud storage via a cloud storage API:
You have any type of file type stored in S3, GCS or Cloudflare and can access them via a python API.
From the above discussion around file formats and serialization. Sticking with parquet as your default file type is an optimization.
Given that, you have a couple options on how this can work.
You could serialize the entire parquet file and transfer over the network. Then on the subscriber side you parse data as needed. This would also be similar to a local analytical database like DuckDB.
You leverage Apache Arrow and a python library like Polars to do lazy evaluation on the parquet files, which allows you build an optimal plan to decompress and filter the parquet data on the publisher side before you transfer to the subscriber. This let’s you transfer ONLY the data you need for the subscriber side UI. Which offers maximum flexibility on the subscriber side and I believe is the best optimization for this pattern.
Subscriber pulls from a publisher transactional table like MYSQL or Postgres.
MYSQL and Postgres tables are row based transactional databases. Historically, you would use an open source database connection like JDBC to grab data from these types of publishers. JDBC, specifically, returns rows of data. Not optimal for our analytics workloads.
To optimize this, you would wan to look at leveraging new technologies like ADBC from Apache Arrow. This gets data more efficiently reads from these tables by returning the result as Arrow data and returns chunks of them over the network.
Just the Beginning
This is just the beginning of the journey in designing, and optimizing, last mile data pipelines. The goal here was to get to the heart of what the term means, to see the main players in the pattern and the look at a few scenarios where optmization would come into play. I’m very excited to continue on writing about this subject!