While the end-users of corporate IT services often carry the stigma of being fickle, they can also be quite forgiving. There is a general understanding that technology, for all its glitz and glam, is not perfect, so users can look over the very rare bump in the road. As a service provider, the last thing you want to do is leave your users with the inability to access the service, or worse, have them not trust the service. However, it’s not enough to just work to prevent failures, an equal amount of effort needs to be placed in properly recovering from failures as well.

Part 3 of this series has discussed how crucial it is to understand where and how latency can be introduced into your system given how modern applications are composed to be distributed. For end-users, the lines have blurred between services being completely offline and not being responsive enough, as the catchphrase goes “slow is the new down.” Now that you have leveraged the proper resources to ensure an efficient system, it is crucial to make sure when unavoidable hiccups do happen, they are remediated as quickly as possible.

Properly remediating hiccups plays a significant role in instilling stakeholders’ confidence in the system. That role is just one of several represented by a specific tenet within a standard architecture framework supported by major cloud platform providers. Each of those tenets helps to form the complete picture in determining the proper direction to take on your cloud journey. As discussed previously, the five tenets are: Cost Optimization, Operational Excellence, Performance Efficiency, Reliability, and Security.

In this fourth article of the series, focus will be placed on Reliability. If the experience of your service is perpetually degraded, due to its constant offline state or severe data fragmentation, you will lose the confidence of your stakeholders. This article will examine the factors that can lead to such an undesired situation.

While the concepts presented are universal, the specific examples laid out here will be centered around Microsoft's Azure cloud platform.

Technology Should “Just Work”

Reliability has always been one of the make-or-break attributes for any business. For an eCommerce company, any amount of downtime could lead to a direct loss in revenue due to missed sales. An inability to return the system to the correct state would lead to even more problems. Even in the earliest days of the world wide web users had an expectation of going to a website and getting a response, even if they needed to be a little patient (28/56k modems).

The 2 Levers of Reliability

In today’s world of modern applications, reliability concerns are virtually the same as they have always been for web-based systems:

  • Availability – The ability for your users to access the system when needed. The system doesn’t even need to be offline for it to be considered unavailable, it could be throwing unrecoverable errors or moving so slowly that prevents users from accomplishing their tasks.

  • Resiliency – The ability of the system to recover from failures and continue to function. While this may sound very similar to availability, the nuance is with concerns to the state of the system being properly recovered.

Continuing with the example of an eCommerce company, let’s say you have a blow-up with the backend order processing system, which is the heart of the entire operation. The system goes completely offline and even though is brought back online within a few minutes, any work performed during that downtime is lost. This means new orders are missed completely and orders that were being processed run a significant risk of being restored to an invalid state, potentially leading to issues like customers being over or undercharged, or having orders fulfilled incorrectly because they are missing data.

A system with the proper level of resilience in that situation would have the facilities in place to make sure that when availability becomes a problem, issues don’t cascade to a point where the state of the whole system is compromised. There is not a single entity on this planet that is immune to availability issues. Those that seem like they never do just have sunk a lot of resources into reliability and built their systems with no single point of failure. So, it is quite possible that a global service you consume may have had availability issues, but they only happened in limited geography and impacted users other than you.

Reliability, like many other things, is a balancing scale. It is possible to over-extend or waste your resources trying to create the most stable system in existence. The most important point here is to understand the availability requirements of your business and take the appropriate tactical mitigations that align with the strategy to meet those requirements.

Food for Thought

Here are a few questions to get you thinking about how to handle things in your current environment or how you can set things up in a desirable way from the start (if you are considering the move):

  1. What reliability targets have you defined for your application? Availability targets, such as Service Level Agreements (SLA) and Service Level Objectives (SLO), and Recovery targets, such as Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO), should be defined and tested to ensure application reliability aligns with business requirements.
  2. How are you handling disaster recovery for this workload? Disaster recovery is the process of restoring system functionality in the wake of a catastrophic failure. It might be acceptable for some systems to be unavailable or partially available with reduced functionality for a period, while other systems may not be able to tolerate reduced functionality.
  3. How does your application logic handle exceptions and errors? Resilient applications should be able to automatically recover from errors by leveraging modern cloud application code patterns including, but not limited to, request timeouts for managing inter-component calls and retry logic to handle transient failures with the appropriate back-off strategies.

These three things are far from exhaustive, but the hope is that they inspire you to approach your reliability standards a lot more thoroughly and with more confidence.

Quick Wins

Interested in looking like a hero? Here are a few areas you can explore to increase the reliability of your systems through consistent availability and resiliency:

Local and Geo Redundancy have never been easier thanks to the rise of cloud providers like Azure, AWS, and GCP. Even if you’ve never explicitly leased out data center space for your own hardware and opted to leverage services from a managed hosting provider, having provision for more hardware took time. Cloud providers allow you to spin up local redundancy (multiple instances in the same data center) within seconds and geographic redundancy (multiple instances across regions) within minutes. With the right architecture, your system could be running on servers on the East Coast and West coast without anyone leaving their desk.

Health Monitoring goes hand-in-hand with operational excellence but takes a step further with concerns to reliability. Having proper health monitoring in place will not only increase your system's observability from an operational perspective, but it will also open the door for additional facilities like auto-healing and process recovery when an unhealthy state is detected. Leveraging tools like Azure Service Heath events and Azure Resource Health events, as well as your own systems tooling, can go a long way in minimizing the need for human intervention when certain issues arise.

Leverage Platform Services (PaaS) instead of infrastructure services (IaaS) as high-availability and similar concepts come out of the box with this level of service. So having the correct PaaS configuration in place, which is much simpler than configuring IaaS, will significantly reduce your concerns over availability. On the other hand, configuring basic backups will go a long way in establishing resiliency. Leveraging services like Azure App Services and/or Azure SQL Databases to establish and manage those backups become much simpler as well. Thus, there is a potential for cost savings by going from IaaS to PaaS.

Queue-based Load Leveling is one of the best things you can do to a workload, especially one that deals with a high level of unpredictability in its usage patterns. Use a queue to act as a buffer between a task and a service it invokes to smooth out intermittent heavy loads that can cause the service to fail or the task to time out. This can help to minimize the impact of peaks in demand on availability and responsiveness for both the task and the service. This will require some refactoring of the workload to introduce the queue so it would be up to you to determine the level of effort trade-off.

Are You Fort Knox or Swiss Cheese?

You now see how availability and resiliency go hand-in-hand and why it’s not enough to solely work towards preventing failures, but properly recovering from failures as well. Failures will happen, no matter how much you try to prevent them. But just because they happen, it doesn’t mean your end-users need to be affected by them. Proper remediation will effectively cover this gap.

At this point we have our costs under control, we are excelling in our operations, and our services are performing efficiently and reliably. But before we reach the state of technical nirvana there is one last question, we need to ask ourselves - are our services built like Fort Knox or are they as porous as Swiss cheese? In the fifth, and final, part of this series we will be discussing defense in-depth and how proper security caps everything we have discussed to date.

Reflecting on the questions posed and quick wins provided in this article, how much of this have you experienced already? If none of this was new to you – congratulations because you are well on your way to a highly reliable system! However, if any of this was new, you got yourself a challenge to revisit your approach to reliability, start to dig into understanding your high-risk areas, and take steps to mitigate them.

 
Author Jawann Brady