Each Google product has a service level agreement (SLA) that dictates how much downtime the product can have in a given month or year. Take 99.9% uptime, for example: That allows for 43 minutes of downtime per month, or about 8 hours and 40 minutes per year. That 8 hours and 40 minutes is what is referred to at Google as an “error budget.”
Google product managers don’t have to be perfect — they just have to be better than their SLA guarantee. So each product team at Google has a “budget” of errors it can make.
If the product adheres to the SLA’s uptime promise, then the product team is allowed to launch new features. If the product is outside of its SLA, then no new features are allowed to be rolled out until the reliability improves.
In a traditional site reliability model there is a fundamental disconnect between site reliability engineers (SREs) and the product managers. Product managers want to keep adding services to their offerings, but the SREs don’t like changes because that opens the door to more potential problems.
This “error budget” model addresses that issue by uniting the priorities of the SREs and product teams. The product developers want to add more features, so they architect reliable systems. It seems to work; according to tracking company CloudHamrony, Google had one of the most reliable IaaS clouds among the major vendors in 2014.