Heroku’s postgres service is quickly emerging as a scalable, worry-free DaaS for tech companies of any size, and its synergy with Heroku is a very strong reason why lots of startups are able to remain on the platform for much longer before considering any AWS/colo migrations, and the expensive ops hires that may come with them.
At Pixlee, we’re strong believers in the Pareto Principle (aka the 80-20 rule), which states that for many situations, 80% of the results come from 20% of the work. It’s a brilliantly simple principle (which has inexplicably generated a best selling book that basically paraphrases the concept for 300 pages) that has proven time and time again for us that if you work hard at tackling low-hanging fruit, you will be greatly rewarded.
Relational database optimization is a typical exhibit for the pareto principle, since there are countless categories and subcategories of optimization strategy, and it’s very easy to become mired in detail (if you have time to kill, read the classic Selinger paper on System R, the grandfather of SQL path optimization). Thankfully, modern day RDBMSs have gotten very good at handling the ugly nitty gritties by default, leaving us only a few big picture optimizations to worry about.
Most of the talking points in this article apply as good advice for any RDBMS, but are particularly pain-free when using the Heroku Postgres service. I highly recommend it to startups considering their database options.
An Overview of Heroku Postgres
Essentially heroku postgres is a managed cluster of Amazon EBS instances that abstracts away all of the pain points of RAID EBS maintenance (see Heroku cofounder Orion Henry’s 2009 blogpost to give you an idea). What you get is a set of cli tools, automated backups, 99.9% uptime, and easy follower db setup. The downsides: no superuser perms, and the high RAM tiers are pricey.
Heroku postgres’s documentation pages are well written and easy to set up. So we’ll assume that you’re in a situation where you have your heroku app set up and happily hooked into postgres. You know that your database is far from optimized, but where do you start?
Use NewRelic to spot slow queries
Heroku’s NewRelic addon is great at identifying ill performing database queries. After registering the addon and waiting a few minutes, you’ll get a graph that looks something like this:
It’s quite self-explanatory – each color represents a spot of blame for request time. In this case, you can see that the bulk of the execution time lies in the database.
Going to the database tab, you can identify problem spots, get the SQL trace and the EXPLAIN plan, which looks something like this:
This is your starting point for any optimization you want to make. The most important part of this page is the EXPLAIN plan, which is critical in showing you what parts of a database query is costing the most time.
Explain and Analyze are the knife and fork of relational database performance optimization, so it’s staggering how many developers neglect to test their nonperformant queries with it.
Essentially, Explain displays the execution plan that PostgreSQL will use to execute a query, and Explain Analyze actually executes the query. Analyze of course increases runtime, but gives accurate costs (in milliseconds).
Let’s take a look at a multiple join analyze:
What we’re looking at is a simple mutiple join query with a result set of about 3000 rows.
Here is the result, with no indexes applied aside from the primary keys:
Definitely at first glance, a handful to process! But it’s actually quite simple – essentially, SQL takes your query and the information it gathers about the database (running VACUUM ANALYZE helps keep it up to date) to guess at a best plan for the query.
In this case, we see that the query took 283 ms to execute, which doesn’t seem too bad. But what exactly is going on inside the query plan?
Most of the work involved with this query is simply finding a way to join together the separate tables. Join algorithms are a separate blog post in and of itself, but for a TL;DR, they range from the most brute force (a nested loop that compares the columns of the 2 tables) to much more efficient algorithms like hash joins and bitmap heap scans, which take advantage of precomputed indexes on a table column to intelligently search a subset of the disk pages.
The key numbers are shown in the “actual time” text, which shows “<startup time>..<total time>” per operation in milliseconds. We see in this query that a huge portion of the time on that sequential scan through album_photos. This is a hint that we are able to construct an index on album_photos.photo_id, that will improve the album_photo-photo join.
PostgreSQL allows several kinds of indexes to be built (again, a separate blog post) – B-Trees, Hash indexes, GiST indexes and GIN indexes. In an overwhelming majority of cases, btrees are optimal (this has to do with the fact that btrees by nature are very effective in improving hard disk seek efficiency). So let’s create an index on album_photos.photo_id and run the EXPLAIN again.
In this case, because there are key indexes on the join relations, it doesn’t have to do any expensive sequential scans and is able to use a very cheap bitmap heap scan. This query therefore is pretty much as optimized as it’ll get, with a total time of 32.9 ms – 1/8th the time of the original query!
Index aggressively, then remove the unused
Explain Analyze tells you where in your database you may be able to optimize by creating an index. Of course, it’s possible to over-index your database, taking up needless memory. Luckily, heroku has developed a neat little CLI for database management: pg extras
pg extras allows you to check index usage, index size, and most importantly, unused indexes. So if you’re unsure of whether an index will help, you can always create it, wait for some regular processing, and run pg:unused_indexes to see whether it has been hit.
pg extras can also tell you your cache rate, which tables have suffered the most sequential scans, what your long running queries look like, and whether any queries are write blocking.
Intelligent query optimization
Once you have vacuum analyzed your database with the proper set of indexes, your bottleneck will likely be how you write your queries. Most of the principles of query optimization are painfully obvious, yet easy to neglect. Here are some pointers to get you started:
- Nested queries are slow. In lots of instances, complex nested queries can either be devolved into joins, or split into two simple queries with some trivial calculation in between. A lot of this is case-by-case, but again, EXPLAIN is your friend.
- Commits are expensive. If you can pipeline some statements together before commit, do so.
- Write explicit joins at all times, instead of relying on WHERE clauses. It lets you be more sure of how Postgres will handle the query plan, and Postgres is deprecating nonexplicit joins soon.
- Stop selecting with .*. Text fields in particular are extremely expensive to grab in large amounts, and that runtime doesn’t show up in EXPLAIN. Don’t be lazy and write out exactly the columns you want to grab. Ex gr. in the EXPLAIN ANALYZE example above, adding a photos.description (a text field) to the select query adds nearly half a second to the overall execution time (all suffered in row retrieval).
- Limit and offset aggressively. When grabbing lots of rows, an expensive DB call with many text/large varchar columns can block frontend execution. If possible, paginate, limit and offset.
This about covers most of the low hanging fruit that will help you drastically improve db throughput, with very little time expensed. Good luck!
EDIT1: Craig from Heroku recommends datascope for live performance monitoring. It takes advantage of pg_stat_statements extension released in PostgreSQL 9.2 and above to give live updates on actual operations on the database, and is very easy to set up (I got it running in 5 minutes).
Dec 4, 2013 Edit: Pixlee’s databases are now hosted on Amazon RDS Postgres – a decision based on value for money, as well as the superb advantages in Provisioned IOPS and Multi-AZ failover. That being said, much of the advice in the article applies to Postgres as a whole (additionally, RDS PG does not currently support pg_stat_statements).