The Apache Arrow Ecosystem
The foundation, transport layer, query protocol and client interface
Apache Arrow is like an onion. As you peel away one layer, another sits below, ready to be assessed. This is the reason I love it so much. As a layperson who works in data, Arrow gives me out-of-the-box tools to just get something going. But it also offers an entire ecosystem that I can use to build a service or application from the ground up.
We normally know Arrow as the library that all other libraries need to function (everything needs PyArrow). It has relegated itself to an “under the hood” technology for most of its lifespan. The dirty secret in data tech, though, is that Apache Arrow is actually a fully functional set of tools that you can use right now to handle most of your data service or application needs.
One of my all-time favorite images on this topic is from the legend Dipankar Muzamdar, Director of Data/AI at Cloudera. He put the Apache Arrow ecosystem subprojects together in a nice stack so we can conceptualize how it all works together.
Here is the image and high level mental models about each:
Arrow: The foundational format that all your system’s data inputs, in-memory storage, serialized to disk as Arrow IPC (also called Feather), and all data output. This allows your system to move data more efficiently intra-process or from an I/O perspective since it shares memory buffers between sources and destinations without the need to make a copy first.
Arrow Flight: This is a protocol for high-performance data transport between clients and servers using Arrow as the payload. Instead of shipping CSV or JSON over HTTP, Flight streams record batches over gRPC (more on that below), so both sides can keep data in Arrow’s columnar memory layout end-to-end. It allows you to stream a large query result from a data system to a Python client for analysis in near real-time without converting to CSV or JSON.
Arrow Flight SQL: This is an extension to Arrow Flight. It is a framework that allows you to use SQL queries to request and transfer data via Arrow Flight. Without Arrow Flight, you would be writing gRPC code directly to get data moving. Arrow Flight SQL gives you the ability to use SQL syntax to call Arrow Flight gRPC and return data to your client from the server.
ADBC (Arrow Database Connectivity): These are drivers and a driver manager that allow you to connect directly to a list of databases. It is meant to replace ODBC and JDBC, which are older driver APIs and not geared toward columnar data sources, making them inefficient with modern analytical workloads. ADBC reads columnar data systems and transfers data zero copy to another columnar destination. So think BigQuery data transferred to DuckDB locally. There are a number of drivers for many databases, including Postgres. ADBC will take row-based data like Postgres and convert it to Arrow to then send it to your columnar destination. It essentially allows you to connect to almost any database without needing to use the database’s native connection API. So much less clutter.
How they work together in an application
Just understand that at the heart of it all is the Arrow data format. When systems decide to adopt the Arrow specification, then it makes it all that much easier to move data between two systems. So the first thing you might try is to simply leverage the PyArrow library to create an Arrow table and test some basic operations on the Arrow table in-memory. For details on how to do that, you can read my previous Substack about Arrow tables, which will get you up and running.
Another suggestion is to go buy Matt Topol’s book “In-Memory Analytics with Apache Arrow”. This is the book that will be the single most important reference guide other than the official Apache Arrow documentation. But it specifically gives fantastic breakdowns on how memory works with Arrow as well as how it is optimized for memory usage. It also references all of the above ecosystem tools and offers a few examples in different languages.
Given that we know Arrow is the foundation we want to use for a data service, let’s build out some infrastructure using Flight, Flight SQL, and ADBC and tie them all together in a nice application.
First Create an Arrow Flight Server to Read Data
The first thing we will do in our application is create a data service that will allow us to read from an S3 bucket with Parquet data. Arrow Flight is perfect for this.
It’s important to understand that Arrow Flight is an RPC service, not REST. It doesn’t involve you creating a bunch of unique endpoints. Instead, it is a framework of endpoint names with criteria for how you use the endpoints to make RPCs via gRPC between a server and a client. So think of it less like URL routes like “/users” and more like methods that you add to a Python class that you call. They need to conform to both an endpoint naming convention from the Arrow Flight protocol and also need to meet the RPC specifications for each endpoint. Arrow Flight gives you a framework to create these calls.
I’ll be pulling literally from Matt’s Python example from In-Memory Analytics with Apache Arrow. This is a literal copy of what Matt shows, but why try to reinvent the wheel here. You’ll notice that we use a class here and add methods. As well, there is a walkthrough for making a server in Go, a programming language I am beginning to really love.
The code below reads as follows:
Put some Parquet files with data into a public S3 bucket.
Create a Python class called Server that inherits from flight.FlightServerBase. FlightServerBase is from PyArrow Flight and provides the server framework for implementing a Flight RPC service. By inheriting from it, Server can implement Flight methods like list_flights.
Create the list_flights method. This inherits PyArrow Flight’s list_flights RPC, which allows clients to discover available datasets. It searches S3 for Parquet files and returns metadata about each one.
Parameters:
self - the Server instance
context - Flight request context (contains metadata like authentication, deadlines, etc.)
criteria - bytes object that filters which files to list (e.g., b’2009’ to filter by year)
Set the path as the bucket name for your S3 bucket. I have a public bucket with data that doesn’t mean anything, so I don’t need to set up authentication.
Create a selector that recursively searches the path [fs.FileSelector(path, recursive=True)].
Query S3 and return a list of all files and directories found [get_file_info()].
Then, loop through all files in the path directory using a generator function. From each file, create a FlightInfo object that has the Arrow schema, file path, and the number of rows in each file.
Finally, start the Flight server locally, connect a client to it, list all available datasets, and print their paths and row counts.
server.py
from reprlib import recursive_repr
import pyarrow as pa
from pyarrow import fs
import pyarrow.parquet as pq
import pyarrow.flight as flight
class Server(flight.FlightServerBase):
def __init__(self, *args, **kwargs) -> None:
super().__init__(*args, **kwargs)
self._s3 = fs.S3FileSystem(region='us-east-1', anonymous=True)
def list_flights(self, context, criteria):
path = 's3-public-bucket'
if len(criteria) > 0:
path += '/' + criteria.decode('utf8')
flist = self._s3.get_file_info(fs.FileSelector(path, recursive=True))
for finfo in flist:
if finfo.type == fs.FileType.Directory:
continue
with self._s3.open_input_file(finfo.path) as f:
data = pq.ParquetFile(f)
yield flight.FlightInfo(
data.schema_arrow,
flight.FlightDescriptor.for_path(finfo.path),
[],
data.metadata.num_rows,
-1
)
if __name__ == '__main__':
with Server() as server:
client = flight.connect(('localhost', server.port))
for f in client.list_flights(b''):
print(f.descriptor.path, f.total_records)The output looks like this:
Arrow Flight uses RPC (remote procedure calls), which I won’t get into in this article. Essentially, it lets you call a function on another machine like it’s local. Think, “Call this function over the network and give me the result.” When you see gRPC, that refers to a specific RPC framework created by Google. It is spec’d a little differently from standard RPC. It uses HTTP/2 and protocol buffers, which means it’s fast, lightweight, and great for streaming.
The above lists the files in the directory, but Matt’s example in the book shows how we can get data from the Parquet files as record batches. We make a couple of updates to it, like so:
server.py
# Add the additional info to the list_flights generator function
def list_flights(self, context, criteria):
path = 's3-public-test-data'
if len(criteria) > 0:
path += '/' + criteria.decode('utf8')
flist = self._s3.get_file_info(fs.FileSelector(path, recursive=True))
for finfo in flist:
if finfo.type == fs.FileType.Directory:
continue
with self._s3.open_input_file(finfo.path) as f:
data = pq.ParquetFile(f)
yield flight.FlightInfo(
data.schema_arrow,
flight.FlightDescriptor.for_path(finfo.path),
[flight.FlightEndpoint(finfo.path, [])], # added endpoint info
data.metadata.num_rows
# Create a do get method to then read the parquet as record batches and return them
def do_get(self,context,ticket):
file = self._s3.open_input_file(
ticket.ticket.decode('utf8'))
pf = pq.ParquetFile(file, pre_buffer=True)
def gen():
try:
for batch in pf.iter_batches():
yield batch
finally:
file.close()
return flight.GeneratorStream(pf.schema_arrow, gen())
# Update the main function at the bottom of the script to read the record batches and return them in the terminal
if __name__ == '__main__':
with Server() as server:
client = flight.connect(('localhost', server.port))
for f in client.list_flights():
print(f.descriptor.path, f.total_records)
flights = list(client.list_flights())
data = client.do_get(flights[0].endpoints[0].ticket)
print(data.read_all())New output:
The above is just updates and additions to the original code. This allows us to actually return Arrow data in memory that we could then do even more with. In the end, what it does is:
Connects to a public S3 bucket anonymously
Exposes Parquet files as “flights” via list_flights
Streams Parquet data back to clients via do_get
The RPCs that Arrow Flight enacts are very useful when you need to retrieve a single table or some other unique response. But when we want to actually do aggregations on data BEFORE it is returned, we need something a little extra. That’s where Arrow Flight SQL comes into play.
If you want to see a deeper dive into this simple server, I have a Youtube video on it:
Using Flight SQL to Work with a Query Engine to Read Data
Arrow Flight SQL was one of the harder things to understand for me at first. But it clicked once I broke down what it is and what it isn’t.
Arrow Flight SQL is a transport protocol. It defines how a client and server communicate about SQL:
Here’s a SQL string to execute
Here’s a ticket representing the result
Here’s the schema of what I’ll return
Here are the catalog metadata RPCs (GetTables, GetSchemas, etc.)
Flight SQL isn’t a query engine. It doesn’t actually run the queries. The server you write has to wire those incoming SQL commands to a real engine. DuckDB is the most natural fit here because:
It speaks Arrow natively (zero-copy to/from Flight record batches)
It can read S3 Parquet files directly
It handles full SQL execution (parsing, planning, execution)
The Arrow Flight server we made above simply allows you to call RPCs to the S3 bucket where our data is and return something. So it acts simply as a function library. This works well when we have a specific thing we need, but incorporating DuckDB will allow us to get the benefits of a nice query engine to run optimized analytical workloads.
We will create a new script that will move from the pure RPC calls in the initial script into DuckDB integration. This is also where we will create both a server script and a client script. Below is how our updated application will work at a high level.
Note: I am testing this with a public S3 bucket I have, so I don’t need to fiddle with credentials.
We create a server script that we will run in one terminal, and then run a client script that we will run in a second terminal. When we run the client script, it will encode and send a query string to our Arrow Flight server via gRPC. The server will decode the query and pass it to DuckDB to execute as an Arrow table. The Flight server then streams the Arrow table returned by DuckDB back to the client and prints it in the terminal.
server_sql.py
import duckdb
import pyarrow.flight as flight
from pyarrow.flight import FlightServerBase, RecordBatchStream, FlightDescriptor
from typing import Iterator
import os
from dotenv import load_dotenv
# Load environment variables from .env file
load_dotenv()
class DuckDBSqlServer(FlightServerBase):
def __init__(self, location="grpc://0.0.0.0:8815"):
super().__init__(location=location)
# Create DuckDB connection
self._con = duckdb.connect()
self._setup_s3_access()
def _setup_s3_access(self):
"""Configure S3 access for DuckDB."""
try:
# Check for AWS credentials in environment
access_key = os.getenv('AWS_ACCESS_KEY_ID')
secret_key = os.getenv('AWS_SECRET_ACCESS_KEY')
region = os.getenv('AWS_REGION', 'us-east-1')
if access_key and secret_key:
# Create secret with explicit credentials for private buckets
self._con.execute(f"""
CREATE SECRET (
TYPE S3,
KEY_ID '{access_key}',
SECRET '{secret_key}',
REGION '{region}'
)
""")
print(f"S3 access configured with credentials from environment")
else:
# For public buckets, try credential chain first, then allow anonymous access
try:
self._con.execute("""
CREATE SECRET (TYPE S3, PROVIDER CREDENTIAL_CHAIN)
""")
print("S3 access configured with credential chain")
except Exception:
# Allow anonymous access for public buckets
print("S3 configured for public bucket access (anonymous)")
except Exception as e:
print(f"Warning during S3 setup: {e}")
def do_get(self, context, ticket):
"""
Execute a query and return results as Arrow batches.
"""
try:
query = ticket.ticket.decode('utf-8')
result = self._con.execute(query).to_arrow_table()
return RecordBatchStream(result)
except Exception as e:
raise flight.FlightServerError(f"Query failed: {e}")
def get_schema(self, context, descriptor):
"""
Return the schema for a query without executing it fully.
"""
try:
query = descriptor.command.decode('utf-8')
result = self._con.execute(query).to_arrow_table()
return flight.SchemaResult(result.schema)
except Exception as e:
raise flight.FlightServerError(f"Schema retrieval failed: {e}")
def get_flight_info(self, context, descriptor):
"""
Get info about a flight (query).
"""
try:
# Extract query from descriptor command
query = descriptor.command.decode('utf-8') if isinstance(descriptor.command, bytes) else descriptor.command
result = self._con.execute(query).to_arrow_table()
# Create endpoint with keyword arguments
ticket = flight.Ticket(descriptor.command)
location = flight.Location(f"grpc://localhost:8815")
endpoint = flight.FlightEndpoint(ticket=ticket, locations=[location])
return flight.FlightInfo(
schema=result.schema,
descriptor=descriptor,
endpoints=[endpoint],
total_records=len(result),
total_bytes=result.nbytes
)
except Exception as e:
raise flight.FlightServerError(f"Flight info retrieval failed: {e}")
if __name__ == "__main__":
server = DuckDBSqlServer()
print(f"Server running on localhost:{server.port}")
server.serve()The client script connects to localhost and passes the bucket name. It constructs a SQL query that reads Parquet files from the S3 bucket and wraps the query in a Flight descriptor using FlightDescriptor.for_command(). It then calls get_flight_info( ) to get flight metadata, uses do_get( ) to fetch the actual data using a ticket, and reads all rows into a PyArrow table.
It then displays:
Row count
Schema (column names and types)
Data previewclient.py
client.py
import pyarrow.flight as flight
import sys
import os
from dotenv import load_dotenv
# Load environment variables from .env file
load_dotenv()
def main():
# Connect to server
client = flight.connect(("localhost", 8815))
# Public S3 bucket - use environment variable or default
# Example public buckets:
# - s3://s3-public-test-data/streaming_data_001.parquet
# - s3://s3-public-test-data/nyc-taxi-parquet/year=2024/month=01/*.parquet
# - s3://nyc-tlc/trip-data/*.parquet
bucket = os.getenv('ARROW_FLIGHT_BUCKET', 's3://s3-public-data/*.parquet')
# Execute query reading from public S3 bucket
query = f"""
SELECT
*
FROM read_parquet('{bucket}')
LIMIT 100
"""
print(f"Executing query on public S3 bucket:")
print(f"Bucket: {bucket}\n")
# Create a descriptor with the query as command
descriptor = flight.FlightDescriptor.for_command(query.encode('utf-8'))
try:
# Get flight info for the query
info = client.get_flight_info(descriptor)
# Get the data using the ticket
reader = client.do_get(info.endpoints[0].ticket)
table = reader.read_all()
print(f"✓ Successfully retrieved {len(table)} rows")
print(f"\nSchema: {table.schema}")
print(f"\nData preview:")
print(table)
except Exception as e:
print(f"✗ Error executing query: {e}")
sys.exit(1)
if __name__ == "__main__":
main()The output looks like this:
That was an example of how we can make direct RPC calls to a Flight server using SQL query syntax with Flight SQL and execute them with DuckDB. As you saw, this involves a server-client relationship, but it is very scalable and can power distributed systems. But what if you just wanted to query a database or data warehouse and bring data in that way? The smoothest way to handle that is with ADBC.
Talk Directly to a Database with ADBC
While Arrow Flight allows for a way to make direct RPC calls via gRPC to a server, it is also a more “infrastructural” library. Albeit, one that can be the bedrock of many incredible data tools. But in day-to-day work (analytics, data science, BI) we tend to want to have a way to easily access different data and bring it into our local computer to work with. We aren’t trying to set up a client/server system and, in fact, may be using something like Power BI or an analytical frontend like Streamlit to call the database of our choice. This is great for ad-hoc work and more structured analytics.
Simply look at my Substack home page and you’ll see a few different articles I wrote about ADBC. I highly suggest you start with my intro to ADBC here. Without getting into what database drivers are, the big takeaway is that ADBC allows you to call any number of databases (both row- and columnar-oriented) and bring data into your local environment for use in an application. It has an extremely low barrier to entry and it is, in my opinion, the de facto way to read, write, and transfer data.
When we work with different databases and data warehouses, the current paradigm is to use the specific APIs for each. BigQuery has an API, Snowflake has an API, and there’s an API for Postgres. Historically, the tool used to centralize all those different connections was JDBC or ODBC. However, modern data needs have moved further away from single-row (transactional) workloads and more toward columnar (analytical) workloads. Both J/ODBC are not well suited to reading from columnar data systems like BigQuery and MotherDuck. This is where ADBC comes in and elegantly gives us the data connection tool of the future.
While ADBC can read from external data systems like Postgres and Databricks, it can also read from local DuckDB instances. It reads as Arrow and passes data along as Arrow, so it has “zero copy” transfer from one system to another. So you can easily read from Postgres (row-oriented data that gets turned into Arrow) and place it into a DuckDB instance locally to use in an application or for general ad-hoc purposes. I have specifically used it to read data from multiple sources into a Streamlit app and to read from a local DuckLake instance to train an XGBoost ML model.
For our current pattern of reading Parquet data on S3, you might find it embarrassingly easy to do that with ADBC. You will need a few things:
dbc installed, which is the best driver manager for ADBC drivers
uv tool install dbc
The DuckDB driver for our connection
dbc install duckdb
PyArrow and the ADBC library for the Python implementation
uv pip install adbc_driver_manager pyarrow
Something a bit different about ADBC vs. Arrow Flight is that it uses downloaded drivers specific to each data system you want to connect to, along with a library. This is why you need the above dbc tool installed as well as the Python libraries.
From here, we create an in-memory instance of DuckDB and connect to it. We then use DuckDB to read the S3 data and return the Arrow table like we did with the Arrow Flight server. This can all be done with a single script!
client_adbc.py
from adbc_driver_manager import dbapi
import os
from dotenv import load_dotenv
# Load environment variables from .env file
load_dotenv()
def main():
# Public S3 bucket - use environment variable or default
# Example public buckets:
# - s3://s3-public-test-data/streaming_data_001.parquet
# - s3://s3-public-test-data/nyc-taxi-parquet/year=2024/month=01/*.parquet
# - s3://nyc-tlc/trip-data/*.parquet
bucket = os.getenv('ARROW_FLIGHT_BUCKET', 's3://s3-public-data/*.parquet')
# Execute query reading from public S3 bucket
query = f"""
SELECT
*
FROM read_parquet('{bucket}')
LIMIT 100;
"""
print(f"Executing query via ADBC (Direct DuckDB):")
print(f"Bucket: {bucket}\n")
try:
# Connect to local ducklake instance
with dbapi.connect(
driver="duckdb"
) as con, con.cursor() as cursor:
# Now run your actual query
cursor.execute(query)
table = cursor.fetch_arrow_table()
print("\nQuery results:")
print(table)
except Exception as e:
print(f"✗ Error executing query: {e}")
raise
if __name__ == "__main__":
main()The output looks like this:
I’ve already linked a number of other examples above (definitely check out the Streamlit app repo I made), which should really help you see all the things ADBC can offer. If you go through these quick starts as well, you will see that you can even use ADBC to connect to a Flight SQL server when it is used for a data system like Starrocks. So you will even build upon a Flight SQL server with ADBC as well. It all shows how composable and wonderful the Apache Arrow ecosystem is.
This is Really Cool, So What’s Next?
The Apache Arrow ecosystem is a wealth of open source riches for us to experiment with. I’m in awe of what it offers us. That said, we will continue to dive deeper into what Apache Arrow can achieve and look at how we can push the limits of what we can do with it. We will look at compute operations over streaming data and start to dig into using Arrow with languages other than Python. I’m excited to dig even deeper into this world!
Hi, my name is Hoyt. I’ve spent different lives in Marketing, Data Science and Data Product Management. Other than this Substack, I am the founder of Early Signal. I help data tech startups build authentic connections with technical audiences through bespoke technical content and intentional distribution. Are you an early stage start up or solopreneur wanting to get creative with your technical content and distribution strategy? Let’s talk!










You're gonna make Matt's book jump up on the order of books I want to read!!