Thoughts and Projects

I have had the wonderful opportunity to design and use really scaleable Iceberg tables in production, so I’m going to attempt to write some thoughts around it down. Disclaimer, I am heavily biased towards time series data.

Backstory

I have seen a lot of tools, and also created a lot of tools to enable users to query massive amounts of time series data at scale before Iceberg. The first tool I ever saw that did this well is called Sensorcloud. It basically allowed sub second query times on billions and even trillions of data points but it had some huge flaws when I attempted to use it at my current org. The first major flaw is that data had to be uploaded in order. If I uploaded data for a certain signal today, that means I simply could not upload any data that exists before today which is a huge bummer and disables backfilling as an option entirely, and forces you to upload data in order which leads into the next issue of scale. I primarily focused on making aircraft data visible and there were ~20,000 unique signals being reported by the telemetry and instrumentation systems. Sensorcloud could only handle so many connections uploading data at a time, and you could only work with one signal at a time in an upload, so this led to a permanent snowballing issue where we could not perpetually upload data as we got it, and when a test ended it would take a lot of time to upload data which left engineers unhappy!

This led to a period of about 6 months where I was focused on writing a system that could plot time series data extremely fast in a web browser. I took frontend inspiration from Plotly Resampler and built a backend API and metadata service backed by Litestar and DynamoDB that could take a SQL query from the frontend, send that query to DynamoDB which would return a list of files that would have the data you are looking for in a given trace, and then push the file scans down to polars for fast reads, and then use the Plotly Resampler backend tsdownsample to return a downsampled version of the traces back to the frontend. All the data was partitioned by signal name in S3 hive-style partitioning and worked great in a little demo, but didn’t scale well with multiple writers and high volumes of data needing to be ingested, nor did it optimally compact the parquet files and maintain a proper catalog over them. The project slipped away from me and a year and a half later I pushed Apache Iceberg really hard internally and we got it set up. This system was called Hyperplot and a slightly more robust demo exists in a codebase I cannot fully share quite yet. Note that the linked implementation does not use DynamoDB at all. Once my team discovered the power of Iceberg, we forgot about our home-baked implementations.

Iceberg Is Magic

Iceberg is magic. It nails down what BigQuery did, which is fully separate storage from compute, and is extremely interoperable, so anyone that is able to use an existing implementation of the spec, or write their own (which could be a fun exercise one day) can take advantage of it. We basically implemented the same table design from the Hyperplot project where every signal gets its own partition. This turned out to be incredible. Queries took between 4-10 seconds for a single trace with Athena, but for the first time ever, every signal that came off of the plane was visible to every person at the organization which is a extreme win for multiple reasons.

Hidden Partitioning, FileScan Tasks, Parquet Row Filters

Something careful to consider is the tradeoff between parquet row filters vs scanning files. If you partition your table by hour(timestamp), and many of your queries result in querying more than an hour of data, or even overlap an hour – you will end up with more small parquet files and more file scan tasks, which is contrast in the case of partitioning by month(timestamp) which will result in iceberg optimally compacting data files, and pushdown with parquet row filters. I choose to believe that less files is the faster option as maintaining a bunch of small parquet files is definitely a nightmare.

Pay Attention to Access Patterns

The number one thing I am thinking about when creating a table, whether it is in Redshift, Postgres, or Iceberg is access patterns. Are people querying this database? Are dashboards querying this data? How are those people and machines querying this database? In the case of Redshift, this is a much more challenging question as only people who understand how the table is distributed and sorted should be writing queries, but with Iceberg – speed varies greatly across query engine which is a luxury that most databases simply don’t have. In my case, most people wanted to see a single trace or multiple traces on a plot, so it becomes simple – every trace/field gets its own partition.

Maintenance

Maintenance plays an extremely huge role in Iceberg tables. Ensuring that your data files are compacted assures that queries are relatively similar across partitions, and even within partitions. Not only is compaction important but the other arguments for the RewriteDataFiles are just as important. One can specify how data is sorted using a variety of strategies. In my case, just calling sort() with empty arguments works great as the sort order is defined as a table property, which RewriteDataFiles uses if you specify sort() with no additional arguments. Another very important maintenance action is expiration of snapshots. In my case, there is no real benefit to keeping older snapshot versions as nobody cares about older snapshots. Everyone wants the most up-to-date and relevant data.

That has been my experience, and have not had a huge issue with orphaned files quite yet as the main tables I work with as of writing this use a single writer. I’m sure there a host of other optimizations I can make, but just see the official docs.

Query Performance

As mentioned above, when using AWS Athena on the tables I owned, query times would take anywhere from 4-10 seconds for a single trace. Generally this is fine given that most of this data needed to be custom exported if any consumers wanted to see it, and a subset was always made available in a traditional data warehouse. I got a little bored one day and decided to try out different query engines against the tables and was able to discover that Trino, when running locally on my Mac performed better than every other query engine I tried, which was remarkable. After spending a while fighting with EKS and Helm charts and Observability – Trino was deployed on EKS and query times went from 4-10 seconds to less than 1 second response times. Now, this is huge because not only is it faster for hundreds of people loading dashboards and simply exporting data, but this means that every signal in that table is available with zero augmentation needed to persist it to a “more performant” solution like a data warehouse. To put it simply, more data is available faster, with less maintenance and supervision required from data engineers who would own the ETL process from data lake -> data warehouse. The Trino discovery also opens doors to remove data warehouses from the workflow entirely for other datasets, which would reduce costs overall as we are no longer paying for dedicated compute. Trino on EKS scales up and down extremely reliably.

Iceberg IO

As it currently stands, Iceberg is extremely interoperable. I can define a table in Spark, Trino, Python, Rust (others I am forgetting) and query those tables in even more tools. To me, this begs the question “what is the best way to interoperate with Iceberg tables?”. And the obvious answer is “it depends”. Personally I am of the opinion that DDL should be treated as IaC. One can sort of mimic this with Alembic, but it doesn’t really support Iceberg in the ways one would need it to. Until there is a full way to support that somewhere, EMR jobs controlled through the CDK work pretty well. This means that all DDL and maintenance generally happens in EMR with PySpark with spark.sql calls, where Iceberg is the most supported as the Java impl is considered “upstream”. Same goes for writing to a table, generally I am using EMR + Pyspark to write to Iceberg tables with very few exceptions where it is just easier to use the PyIceberg impl on relatively small data. As for querying, anything goes as long as it can read the Iceberg spec correctly.