Thoughts and Projects

I have spent a lot of my time recently optimizing my organization’s data warehouse on AWS Redshift, and thought it would be a decent exercise to concisely summarize my learnings, and touch on how it compares to BigQuery, which is what my org was actively using before our migration.

First off the only blanket statement I can say is that BigQuery is the more managed solution. The compute scaling just magically works – which is awesome as it is one less thing to actively think about, other than setting limits so someone can’t write a query that bankrupts your business. But at some point, if you are actually managing 1+ trillion points in a data warehouse, you should probably know what you are doing. I never envisioned myself as a DBA (and most of the work I do isn’t), but the past month or two of time I have sunk into getting Redshift right has been worth it for myself, my team, and my organization.

My team made the choice to migrate from GCP to AWS and that meant going from BigQuery to Redshift. At first this was great, it meant we had a fresh start in a lot of cases and could continue to slowly wean off of GCP as we migrated to equivalent AWS services, leaving technical debt we had accrued from moving fast behind us. Redshift was working great for us in several databases, until we backfilled our largest database on GCP over to Redshift, with an entirely new, and more scalable schema. That database quickly blew up in size, from a few billion rows to almost 180 billion over the course of a few weeks. As this migration took place we noticed the performance of that database was extremely poor which led a couple of us to figure out how to optimize it.

I will be totally honest, because BigQuery just worked with our 2,600 column table, we expected Redshift to do the same with our new and improved schemas (that weren’t 2,600 columns!). A lot of digging eventually led us to the following blog/guide which worked wonders. Over the next few weeks, we began to iterate through a few versions of more optimal DDL for our tables, where major performance boosts were found through setting appropriate distribution styles, sort keys, and column compression encodings.

While AWS does say that Redshift analyzes queries on your cluster to apply optimal distribution styles, sort keys, and compression encodings, we found these suggestions/automatic applications to be largely non-existent or incorrect for extremely large tables where it mattered the most. This is likely because we pretty much exclusively wrote to Redshift before exposing databases to our larger organization which could have thrown off whatever analysis Redshift was doing.

After we applied all of our changes to the DDL and rewrote the tables by performing a deep copy into the new ones, I am happy to say that our warehouse went from essentially unusable, to performing queries over hundreds of billions of rows with millisecond response times. This leaves me to conclude that while BigQuery “just worked”, that came at the cost of pretty poorly managed databases, bad schemas, and not really knowing what was going on under the hood. While we may have to manage more on Redshift, we can now be confident in our understanding of the warehouse as a whole, and not always respond to the problem of speed with more compute.

Back to Home