In a corporate blog posting Sept. 19, Facebook application operations engineer Sean Lynch revealed the development of a tool, “Claspin,” which generates a heat map of the company’s numerous racks and servers—the better to determine which are “bad” and in need of repair.
According to Lynch, Facebook originally set out to manage the health of its computing resources via two tools: Memcache, and TAO, a caching graph database that performs its own MySQL queries. While the TAO tool generates reams of data from servers and clients, all of it collected into dashboards showing various latency and error rate statistics, it started giving Facebook engineers some scalability issues.
In the wake of that, Lynch turned to creating a tool that could generate lists of hosts, each with rankings for the number of timeouts, for example, or TCP retransmits. The resulting tool listed each server in a tuple, or an ordered list of elements. But the solution was also text-heavy and required a somewhat-trained operator to manage the problem—in that case, Lynch himself. So Lynch settled on a heatmap, with each “pixel” representing a host."
Link to Original Source