Building High-Quality Cloud Solutions - Part 2: Operational Excellence

Understanding the health of your systems is a moving target. Dynamic factors such as ageing technology and shifting business priorities require IT organizations to be as agile and responsive as possible, often at a superhuman level. 

But we are just mortals, so how do we keep up? Believe it or not, part of the answer includes the use of robots…well, not literally, but you’ll see what I mean. 

In Part 1 of this series, we discussed proper cloud utilization by examining the environment through the lens of cost management. If you don’t know how to get and keep your costs under control, everything that comes next is largely irrelevant because it won’t be sustainable.  

But that’s not you.  Now that you are confident to keep your costs in line, other doors will open. The “doors” I’m referring to are the tenets represented within a standard architecture framework supported by major cloud platform providers. Each of these tenets helps form the complete picture in determining the proper direction for your cloud journey. As discussed in Part 1, the five tenets are:  

  • Cost Optimization 
  • Operational Excellence 
  • Performance Efficiency 
  • Reliability 
  • Security 

In this second part of the series, we will focus on Operational Excellence. My goal is to highlight why proper monitoring, instrumentation, and most importantly, automation is invaluable. 

While the concepts presented are universal, the specific examples will be centered around Microsoft's Azure cloud platform, mainly because that aligns with the breadth of my experience and recent assessments. 

Dev versus Ops, No-one Wins 

In the old days, development and operations were two separate and siloed departments that would take every opportunity to shift responsibilities of bad business outcomes to the other. They had priorities that were often in direct conflict. Nevertheless, these silos have evolved to become more seamless and integrated, powering organizational adoptions of new practices to improve operational agility. 

Observability and automation are key pillars of operational excellence. A workload’s operational considerations are just as important as the work being performed and should be treated as such from the onset. A solid monitoring architecture is a requirement for turning optimal observability into operational insights. Knowing what, when, where, and how to monitor the various pieces of the application and environment are key to keeping tabs on the overall health of the workload. Monitoring should have a surface area that touches everything from your core application (tracing, exception logging) and immediate environment (App Service, Kubernetes Service, or Virtual Machines) to the foundation of the platform (Azure Service Health). Without that visibility, you will constantly be reactive (firefighting) and focused on the wrong goals.  

With adequate visibility, you can figure out the appropriate ways to ingest and respond to abnormalities, which is a key opportunity for automation. While having an on-call staff is even necessary in some circumstances, many incidences can now be remedied without human intervention. The most extreme issues would still trigger an alert but many of the others should trigger automation with the properly configured runbooks to kill rogue processes, auto-scale overloaded systems, and “auto-heal” unhealthy systems. 

Go on the Offensive 

Having proper automation in place is great for being reactive, but it really shines when leveraged to be proactive. Managing the infrastructure for any type of workload involves configuration tasks, with some being labor-intensive, prone to error, or just downright inefficient. Infrastructure as Code (IaC) has emerged as a popular process for managing infrastructures - such as networks, virtual machines, and load balancers - in a declarative model. That declarative model takes the form of “code” that is executed against a cloud platform (public or private) to establish the declared environment. One of the major problems within IT Operations IaC evolved to solve is environment drift. That’s a substantial topic on its own but can be summed up by pointing out how the configuration of servers – and other environmental resources – often begin to subtly change because of operational tasks like installing patches and deploying applications. The configurations begin to drift away from how they were initially declared. 

Once your infrastructure is configured through code, it is significantly easier to automate and slot into your source control system and deployment pipelines alongside your application code. This provides benefits your applications have been leveraging for years, such as audit trail of changes, version history, continuous builds and testing, as well as automated releases and rollbacks. 

Operational excellence is not a goal to be accomplished but rather something in which you should continually strive. Industry and technology practices will continue to evolve.  Keeping abreast of trends will enable you to better adapt to dynamic factors outside and within your business.   

Food for Thought 

Here are a few questions to get you thinking about how to handle things in your current environment or how you can set things up in a desirable way from the start (if you are considering the move): 

  1. Are availability targets (such as SLAs) or recovery targets (such as RTO/RPO), defined for the application and key scenarios? To assess overall operational effectiveness and application utilization, it is critical to understand system flow. Implementing a health model for the application enables proper adjustment of utilization and improves your ability to meet business needs and cost goals.   
  2. How are you monitoring your resources? It's not enough to console.log or _logger.LogError in seemingly random areas of your applications. A well-thought-out logging framework should be implemented with the appropriate structured format and severity levels to prevent turning app logs into white noise. And logging should not begin and end with the core application; you should be collecting and correlating logs from the various environments and dependencies. 
  3. How do you surface workload data and respond to issues when they occur? Dashboarding tools should be used to visualize metrics and events collected at the application and resource levels to illustrate the health model and the current operational state of the workload. Understanding what makes the state of a system healthy versus unhealthy is paramount in determining how to respond to them when that state changes. 

These three things are far from exhaustive, but the hope is that they inspire you to approach your cloud operations a lot more thoroughly and with more confidence. 

Quick Wins 

Interested in looking like a hero? Here are a few areas you can explore to increase the visibility of your systems properly leverage automation: 

Configure resource-level monitoring. Nothing in the world of technology is infallible, and the Azure cloud is no exception. Making sure that Service Health and Activity logging are properly configured on essential resources will allow for an appropriate response to things like service outages that might affect your workloads. 

Create at least one Log Analytics Workspace. Think of this as a centralized repository where all your logging gets stored, irrespective of the source. Any workload or supporting service that generates logging can be piped into a workspace for broader analysis and appropriate alerting. 

Configure CI/CD Pipelines for deployments. This will improve consistency in those deployments by removing the potential for human error and reducing the operational overhead involved with manual deployments. 

Take a declarative approach to infrastructure management. This can be accomplished through tools such as HashiCorp’s Terraform, Azure’s ARM Templates, or Azure’s Bicep. These tools will significantly reduce operational costs involving provisioning resources and prevent environment drift. 

Build suites of automated tests. This should range from unit and integration tests to security tests and performance tests. By removing as much of the human component from testing as possible, the faster and (more importantly) the more consistent it can be done. 

Are demands being met efficiently? 

Having the right amount of visibility combined with the right amount of automation will put your workloads in an excellent operational place. You’ll be ready to hit your agreed-upon SLAs, RTOs, and RPOs. The visibility and automation will significantly increase confidence in your deployments, leading to more rapid deployments and improved responsiveness to the market. Sounds great, right? Performance Efficiency is where will be taking operations to the next level, ensuring the workload is able to efficiently meet end user’s demands. 

Reflecting on the questions posed and quick wins provided in this article, how much of this have you experienced already? If none of this was new to you – congratulations because you are well on your way to operational excellence! However, if any of this was new, I challenge you to revisit your operational approach and start to dig in to increase visibility and offload tasks to automation. 

Comments

Add Comment

Subscribe to our newsletter