availability | The Blog Pros

December 15, 2021

What Is CAP Theorem?

In short, the CAP theorem is a mathematical theorem describing how our application will behave in the event of network partitioning. It is one of the most important laws currently in existence. Through the course of this text, I will share more information on this theorem and why it is important. By the time you’re done reading, you’ll also know why CAP may not be enough for modern-day systems.

Before we start, because the CAP theorem is inseparably related to distributed systems, I would like to add a quick word about them.

March 26, 2021

Ebbs and Flows of DevOps Debugging

Introduction

Ever since Patrick Debois coined the word DevOps back in 2009, teams and organizations have been clamoring to adopt relevant practices, tools, and a sense of culture in a bid to increase velocity while maintaining stability. However, this race to incorporate “DevOps” in software development practices has resulted in a perversion of the concept. This does not mean that there are no successful practices of teams adopting DevOps practices, but the word overall has become a buzzword. As per the DORA 2019 State of DevOps report, team managers are more likely to proclaim that their teams are practicing DevOps compared to the actual frontline engineers and developers.

Therefore, this piece aims to realign the meaning of DevOps as well as highlight the need for considering debugging as a core element of the practices and cultures that enable DevOps for teams. The argument for debugging as a core component in the DevOps pipeline is a result of the evident need for a shift-left in the way we build and release software, empowering developers to adhere to the intrinsic principle of you build it you run it.

February 17, 2021

The Theory and Motive Behind Active/Active Multi-Region Architectures

The date was 24th December 2012, Christmas eve. The world’s largest video streaming service, Netflix experienced one of its worst incidents in company history. The incident was an outage of video playback on TV devices for customers in Canada, the United States, and the LATAM region. Fortunately, the enduring efforts of responders over at Netflix, along with AWS where the Amazon Elastic Load Balancer service experiencing disruptions resulting in the cause of the incident, managed to restore services just in time for Christmas. If one were to think about the events that ensued over at Netflix and AWS that day, it would be comparable to all those movies of saving Christmas that we all love to watch around that time of year.

This idea of incident management comes from the ubiquitous fact that incidents will happen. This is not an unknown fact and best immortalized by Amazon VP and CTO Werner Vogels when he said “Everything fails all the time”. It is, therefore, understood that things will break but the question that persists is can we do anything to mitigate the impact of these inevitable incidents? The answer is of course yes.

December 2, 2020December 23, 2020

Lessons Learned from the November AWS Outage

Context, Analysis, and Impact

Amazon’s internet infrastructure service experienced a multi-hour outage on Wednesday, November 25th, that affected a large portion of the internet.
More than 50+ companies were impacted, including Roku, Adobe, Flickr, Twilio, Tribune Publishing, and Amazon’s smart security division, Ring, in its region covering the eastern U.S.
Business impacts, as reported by The Washington Post, included:
- New account activation and the mobile app for streaming media service Roku became hampered.
- Target-owned Shipt delivery service could receive and process some orders, though it stated that it was taking steps to manage capacity because of the outage.
- Photo storage service Flickr tweeted that customers couldn’t log in or create an account because of the AWS outage.

Root Cause Analysis by AWS: It started with Amazon Kinesis but started impacting a long list of services. You can read the RCA document by AWS, which is also summarized below:

Lessons Learned

#1: Don't Put All Your Eggs in One Basket

Using a single Cloud Service Provider can be counter-productive in these scenarios.
Think and strategize for Hybrid-Cloud or Private Cloud; or Multi-Cloud, particularly during peak season.

#2: Hope for the Best and Plan for the Worst

Don't just rely on a cloud provider's availability and multi-region fail-over strategy; build your own resiliency and disaster recovery approach.
Practice disaster recovery in production or similar systems by using innovative approaches in active-active setup across the multi-cloud or hybrid-cloud scenarios.

#3: Monitoring and Observability Are Not Static

Be innovative in exploring monitoring and observability patterns. For example, if AWS is reporting an outage on their status page, your monitoring system should get into action and inform the incident resolution team to start analyzing the impact.
Keep ready the services dependency graph; though mostly supported by tools, you should keep it dynamic and prepared to assess the impact when it happens and map it to business functionalities to report it to your business team accurately.

#4: Invest in Emerging Techniques, like Chaos Engineering

This failure indicates that even internet giants like AWS are still maturing in implementing practices like chaos engineering. So, start putting chaos engineering practices into the roadmap.
For example, if a bulkhead pattern could have been utilized in the AWS outage scenario, the outage would have been limited to Kinesis services only.

To conclude, being proactive when outages occur, having a response team equipped for unplanned outages, and improving continuously from lessons learned along the way are essential techniques to help keep the impact limited. Also, having a multi-cloud or hybrid-cloud strategy is food for thought to keep the business running.

September 24, 2020

Availability, Maintainability, Reliability: What’s the Difference?

We live in an era of reliability where users depend on having consistent access to services. When choosing between competing services, no feature is more important to users than reliability. But what does reliability mean?

To answer this question, we’ll break down reliability in terms of other metrics within reliability engineering: availability and maintainability. Distinguishing these terms isn’t a matter of semantics. Understanding the differences can help you better prioritize development efforts towards customer happiness.

January 14, 2020

How to Migrate From Elasticsearch 1.7 to 6.8 With Zero Downtime

December 20, 2019

How Cloud Services Modernize ERP

Combining cloud and ERP

Today’s business environments are comparable to the old spinning plate trick on the Ed Sullivan Show from the 1960s. We all remember the famous scene: Erich Brenn running back and forth frantically keeping the plates spinning on each pole as the “Saber Dance” music plays in the background to enhance the drama. These plates in today’s parlance are the vast amounts of data that a business creates every day and keeping the entire engine running smoothly in terms of data analysis and security. ERP (Enterprise Resource Planning) tools and methods were devised in recent years to tackle the running of day-to-day operations in the office in terms of data, hardware, and software used by team members.

May 15, 2019

The Fundamentals of Cybersecurity

Adoption of the IoT by businesses and enterprises has made mobile banking, online shopping, and social networking possible. While it has opened up a lot of opportunities for us, its not altogether a safe place because its anonymity also harbors cybercriminals. So, to protect yourself against the cyber threats of today, you must have a solid understanding of cybersecurity. This article will help you get a grip on cybersecurity fundamentals.

Let’s take a look at the topics covered in this cybersecurity fundamentals article:

May 9, 2019

Antipattern of the Month: Too Busy

Often, the very people whose involvement is most critical to an initiative are those who are least "available." Senior managers, for example, can have a deep level of domain experience which they have built up over years, and they can exert authority over important decisions. Enterprise architects and Product Owners, in particular, may have accumulated responsibilities over sweeping areas of organizational concern. Such people are notoriously time-poor, and can be unable or unwilling to focus on a single product or team. This means that they often fail to make the appropriate commitment to an Agile role, and do not demonstrate the quality of involvement expected of them.

One symptom is that they might see themselves as being "too busy" to fulfill the role and its responsibilities. A Product Owner, for example, may be "too busy" to attend Product Backlog refinement sessions, or perhaps even Sprint Planning, Sprint Reviews, and Sprint Retrospectives. They may allege that they "trust" the Development Team to make an appropriate delivery without their participation. This is unsatisfactory, as it means abdicating their collaborative responsibilities, and the inspection and adaptation of progress are compromised.