Beyond that, my main concern is how the updates perform on the rank column, because if this isn't efficient, the rank column isn't a good strategy. As you can see, the update queries cause some performance degradation. This reinforces how important it is to optimize the updates.
Updates to the
rank_in_game column caused an irregular workload, which is a bit smoothed out in the average. As I explained earlier, some updates will be cheap, but others may cause large parts of the
rank_in_game index to be updated, and this will happen randomly in the queries my benchmark script issues. That explains the irregularities.
While it looks like designs #2 and #3 suffer about the same or slightly greater performance penalty from the updates as design #1, I don't think that's really the case. The problem is that my benchmarks aren't realistic. The updates to tables with ranks aren't optimized in my benchmarks, and the updates to the tables without ranks are trivially cheap. In real life I think design #1 would take a greater hit, and smarter updates to the ranked tables would help designs #2 and #3 significantly. However, I cannot prove this without redesigning my benchmarks and running them again.
I also ran some benchmarks against a table with a million rows, but I don't show the results because they took too long to run for design #1. I didn't get a full set of results. However, the results I got show design #1 performs badly as the table grows. At 5 concurrent readers, it could only generate 0.1 leaderboards per second, whereas design #2 achieved about 2.5 and design #3 produced 420.
If you have questions or suggestions, please contact me.
Taking It to the Next Step
The benchmarks show the rank column is a significant improvement over calculating ranks and offsets every time a leaderboard is generated. This naturally suggests that you can do more of the same and get further improvement. Though I haven't designed and run benchmarks for any of the following, here are some ideas.
The next logical step is to add a
pos_in_game column. This makes it more efficient to align random leaderboards on boundaries of rank module 20. The downside is it's harder to maintain. There are fewer "cheap update" queries; most updates cause more work. While this might be a good idea in some workloads, I think it's reaching the point of diminishing returns.
A better strategy is to keep the score tables small and lean, and create separate rank tables. This should result in less data overall, partly because InnoDB's design makes extra indexes on the score table contain the columns from the primary key. While that's sometimes desirable, in this case it makes the indexes larger than I want. Adding the
rank_in_game column increased the data and index size from about 78MiB to around 102.
The rank table should have a primary key on score. Without the redundant data, it will be smaller than the index on the ranking column in the examples above. This lets you update less data. It also makes it easier to do some calculations; you don't need to code special cases for rows that have the same score.
You could also maintain a count column that records how many times a particular score is duplicated in the score table. This column can help you avoid
COUNT() queries on the score and rank tables; they become
SUM() queries on the rank table instead. This might be a significant efficiency if you have many duplicated scores. In fact, you can even do better: you can maintain those sums in the table, too, so for any given score you can see its rank, how many times it occurs in the score table, and how many scores come before it.
I would optimize queries and maintenance on this table with many of the same techniques I explained.
What About the Query Cache?
Smart caching can solve many problems, but MySQL's query cache won't help much for this workload. Updating a row invalidates every cached query that uses the table. This is likely to cause more work than it saves, and use memory that could otherwise cache real data. I have seen a significant performance increase from disabling the query cache in such situations.
In this article I've shown some ways to optimize ranks in MySQL so you can efficiently generate ranked, paginated views of a dataset. This generally a very hard problem to solve. The de-normalizations I showed you have some maintenance overhead, but if you have a high ratio of reads to writes, they may be worth it. Indeed, my benchmarks showed that even simple designs produce constant-time queries that out-perform traditional linear-time queries even with concurrent updates. The benefits become even greater as the data size increases. The alternative strategies I mentioned will probably be even more efficient than the designs I benchmarked.
This approach might help a lot on some workloads, perhaps not at all on others, but I hope I've given you some good ideas on how to optimize rank data.
Please leave your thoughts in the comments. I'm looking forward to learning how you handle this kind of problem yourself. You can also download the example code for this article.
Baron Schwartz is a software engineer who lives in Charlottesville, Virginia and goes by the online handle of "Xaprb," which is his first name typed in QWERTY on a Dvorak keyboard. He blogs about software engineering at http://www.xaprb.com/blog/.
Return to MySQL.
Showing messages 1 through 5 of 5.
2009-11-01 07:08:34 sguerrera [View]
2007-10-18 17:17:59 mwadhera [View]
2007-03-03 09:29:51 nygard [View]