This doesn’t look good, right?
Most open source monitoring tools do filesystem health checking by comparing the current percentage of used space against a set value. If it’s is 90% full, send out a warning page; if it’s 89%, send the all clear.
Notice that I said filesystem, and not actual disk. A single disk that’s 90% full can be a bad thing, because there are fewer free blocks available for writing, which leads to longer write times and file fragmentation. Not all filesystems are restricted to a single disk: there may be a back-end RAID solution, or the filesystem may be a shared filesystem served over NFS.
Unfortunately, you could be the receiver of flapping alert pages where a filesystem sits between 90% and 89%, but it still performs fine. Unlike a broken Ethernet cable, the resolution for a filesystem threshold may not be so easy. Sometimes there are files that can’t be deleted, or there may not be any additional storage to allocate. You may have a filesystem that sits at 91% full for months simply because a new disk shelf won’t arrive until the next budget cycle.
Everything comes down to disk blocks, even SAN and NAS solutions. That brings back the concern regarding fragmentation and performance. But what if your filesystem is a read-only OS image? Or what if it turns out 10% equates to 500 gigabytes on a huge disk appliance? If the filesystem is never being written to, or if the amount of writes equates to 0.001% of the entire filesystem, then where’s the fire?
What about the inverse? What if your filesystem never reaches 90% full? Can there still be problems?
In the above graph, nobody would have been paged by Nagios or other tools, because the filesystem never reached 90%. For the past few months it averaged 40% full, shot up to 75%, and then went back down. A newly released application was behaving incorrectly, and the issue was caught by the programmer. The next morning he stealthily re-released the application and corrected the issue. Nobody in systems administration noticed until the graph was checked in relation to another issue. If the programming error was never discovered, the filesystem would have filled up, probably at the most inconvenient time possible for a systems administrator.
I would like to recommend to people developing filesystem or disk monitoring solutions change their way of thinking about filesystem health. Hard limits on allocated space may still be required, but those warnings should be optional. Measuring fullness makes assumptions about block structure that may not be correct.
At the same time, the monitoring system should compare the standard deviation for the filesystem percentage over the past 24 hours, and compare it to the standard deviation for the past hour. Actually, you’d probably want to compare the first 23 hours out of 24, grab that standard deviation, and compare it to the deviation of the last hour.
If those two deviations aren’t close, then there could be radical changes made to your filesystem that need to be addressed. Maybe files are being added or deleted, either way, it may warrant an investigation. For large filesystems in the terrabyte/petabyte range, using the percentage value may not be granular enough, so you will need to work with the actual value of free kilobytes or blocks.
I take it back. This isn’t a recommendation to monitoring developers, this is a challenge. The first major open source monitoring guy that puts this solution together will have my undivided attention.