An anonymous reader writes: Algolia is a buzzword-compliant ("Hosted Search API that delivers instant and relevant results") start-up that uses a lot of open-source software (including various strains of Linux) and a lot of solid-state disk, and as such sometimes runs into problems with each of these. Their blog this week features a fascinating look at troubles that they faced with ext4 filesystems mysteriously flipping to read-only mode: not such a good thing for machines processing a search index, not just dishing it out. The NGINX daemon serving all the HTTP(S) communication of our API was up and ready to serve the search queries but the indexing process crashed. Since the indexing process is guarded by supervise, crashing in a loop would have been understandable but a complete crash was not. As it turned out the filesystem was in a read-only mode. All right, let’s assume it was a cosmic ray
:) The filesystem got fixed, files were restored from another healthy server and everything looked fine again.
The next day another server ended with filesystem in read-only, two hours after another one and then next hour another one. Something was going on. After restoring the filesystem and the files, it was time for serious analysis since this was not a one time thing.
The rest of the story explains how they isolated the problem and worked around it; it turns out that the culprit was TRIM, or rather TRIM's interaction with certain SSDs: The system was issuing a TRIM to erase empty blocks, the command got misinterpreted by the drive and the controller erased blocks it was not supposed to. Therefore our files ended-up with 512 bytes of zeroes, files smaller than 512 bytes were completely zeroed. When we were lucky enough, the misbehaving TRIM hit the super-block of the filesystem and caused a corruption.
Since SSDs are becoming the norm outside the data center as well as within, some of the problems that their analysis exposed for one company probably would be good to test for elsewhere. One upshot: As a result, we informed our server provider about the affected SSDs and they informed the manufacturer. Our new deployments were switched to different SSD drives and we don’t recommend anyone to use any SSD that is anyhow mentioned in a bad way by the Linux kernel.