The Theory and Motive Behind Active/Active Multi-Region Architectures

The date was 24th December 2012, Christmas eve. The world’s largest video streaming service, Netflix experienced one of its worst incidents in company history. The incident was an outage of video playback on TV devices for customers in Canada, the United States, and the LATAM region. Fortunately, the enduring efforts of responders over at Netflix, along with AWS where the Amazon Elastic Load Balancer service experiencing disruptions resulting in the cause of the incident, managed to restore services just in time for Christmas. If one were to think about the events that ensued over at Netflix and AWS that day, it would be comparable to all those movies of saving Christmas that we all love to watch around that time of year.

This idea of incident management comes from the ubiquitous fact that incidents will happen. This is not an unknown fact and best immortalized by Amazon VP and CTO Werner Vogels when he said “Everything fails all the time”. It is, therefore, understood that things will break but the question that persists is can we do anything to mitigate the impact of these inevitable incidents? The answer is of course yes.