Today, the main sessions of OSCON kicked off and after the keynotes I attended Josh Berkus’ talk entitled “Performance Whack-a-mole“. Since MusicBrainz relies on Postgres for its database back-end and I’ve had to deal with database performance issues, I’m keenly interested to learn the tricks of the trade. And since Josh is one of the key guys behind Postgres, this talk was getting the tips straight from the horse’s mouth.
Josh started off by giving a little more background of database performance problems (”Moles”) and how to play the game to whack the moles. Josh mentioned that you can only see the largest Mole and that the other smaller Moles are obscured by the largest mole. For instance, if the database server has poor I/O because of some hardware/driver issue, then you will not see problems caused by other less severe database problems like missing indexes or poor query construction.
Josh went on to outline common database applications:
- Web Application: DB smaller than RAM, mostly database reads.
- Online transaction processing (OLTP): DB slightly larger than ram to 1TB, 20%-40% small transactions and some long transactions.
- Data warehousing: 100GB - 100TB databases, large and complex queries, lots of bulk data reads
Each of these applications tend to exhibit common performance problems:
- Web Application: CPU bound. Moles: caching, pooling, connection time
- OLTP: CPU or I/O bound. Moles: locks, cache, transactions, write speed, logs
- Data warehousing: IO or RAM bound. Moles: sequential scans, resources, bad queries.
Of course the above cases are not always the case, but Josh has observed these cases to be the most common problems. Next, Josh outlined the rules for whack-a-mole:
- Most database performance problems are not actually “database” performance problems. Most are application/middleware performance issues.
- Less than 10% performance problems cause 90% of performance degradation. Corollary: We don’t care about 90% of the other moles
- At any given time, it is usually possible to observe and troubleshoot only the “largest” mole. This gives rise to the whole whack-a-mole situation.
- Different application types usually have different troubleshooting needs.
Keeping these rules in mind, Josh proceeded to outline the first step in actual mole whacking: Determine the baseline. The fundamental idea in establishing a baseline is to understand the underlying system and to identify possible problems in that system. Sometimes outdated hardware or outdated kernels/drivers present the root cause for slow database performance and they actually have nothing to do with the database itself. Even if the underlying system checks out, understanding the underlying system is crucial for finding performance problems.
A baseline should establish the hardware setup and gather data about the server, the storage system and the network setup. The baseline should include the OS, its version, its patch level and any local modifications. Most of the time the OS should be brought up to the latest patch level and hardware drivers to eliminate underlying OS troubles. The filesystem should be examined next to ensure that the basic setup and partitions make sense. Next the database itself needs to be examined to ensure that its properly configured for the hardware its running on. Then the middleware needs to be examined next to ensure that the database connection methods/pooling make sense. Finally the application needs to be examined to understand the nature of the application and how the application makes use of the database.
If after establishing the baseline the problem has not been identified, its time to start whacking moles. Josh outlined the following Postgres specifc tools:
- pg_stat_* views, dtrace: Taking a look at the internal postgres tables that give a glimpse into the inner workings of the database gives a minimally invasive approach to find schema, query and lock problems.
- postgres log and pg_fouine: Logging the activity of the database is slightly more invasive, but it allows for introspection of specific types of database activity and can be used to compute the overall statistics on query and database load.
- explain analyze: Useful for troubleshooting specific queries in an effort to optimize the queries.
Use the following OS level tools to establish the nature of your performance problems:
- ps (dbstat): See which processes are running and identify instances of the database taking up too much memory/CPU.
- mpstat: See activity for your CPUs. Determine if your database is CPU bound or if all processors are being utilized.
- vmstat, free: See if RAM is saturated and if the system is swapping.
- iostat: See if IO is saturated or if one resource is causing bottlenecks.
Finally, you can use micro benchmarks to examine if the system is performing as expected:
- bonnie++: Test filesystem throughput and check seek and random write speeds.
- pgbench: test IO and connection processing.
These “micro benchmarks” do no require the database system to be taken offline and thus are much less invasive than serious benchmarks. Serious benchmarks include DBT2, DBT3, but these require the database to be offline, which is much more invasive than the micro benchmarks.
Josh covered a lot of ground in a short period of time and its hard to convey all of the information he provided in the session. In particular I glossed over the details for establishing a baseline, since Joshn provided a whole slew of slides that I can’t all cover here. If database performance tuning appeals to you (voluntarily or by force) I would strongly recommend that you download and take a close look a the slides that Josh provided for his talk.
Personally, I’m looking forward to having more tools at my disposal the next time the MusicBrainz database server starts hiccuping. Thanks for all the details Josh!.





