We had webservers, database (master/slave,) and other services split across usa-east and usa-west.
When usa-east started showing problems, we:
*) Took the usa-east webservers out of round robin DNS (ttl 1hr)
*) Verified the slave (in usa-west) was up to date, shut down the master (usa-east,) and converted the slave to master.
*) Updated all webservers to point to the new master.
*) Cranked up new usa-west webservers / updated round robin DNS
I believe Amazon offers mechanisms to do this automatically or we could just always write our own failover scripts, but this is the tradeoff me made. We were willing to trade some service degradation by switching over manually in exchange for avoiding the pitfalls of false-positive detection. Very much an application specific tradeoff, not for everyone, but it worked for what we are doing.
The key was to avoid putting all eggs in the usa-east basket and splitting up across usa-west, even though we incur additional bandwidth fees, ie master/slave replication transfer is full fee between regions.
We were never concerned about cascading failures effecting multiple availability zones in a give region nor did it matter for us - our redundancy requirement was geographical diversity, not partitions within a datacenter. We were thinking natural disaster, but the architecture covered us in this case as well.
The coolest thing to me is just how quickly we were able to shuffle around these resources to avoid a problem area - a couple of hours. There's no way we could have done it so quickly with what we had before - a combination of our own colocated servers and VPS.