Problem Manager 2.0: Resilience Engineering Advocate

The modern DevOps landscape requires an elevated approach to problem management that centers on reliability, partnership, and customer needs

Laurel Frazier
Jan 18th, 2022

Almost every organization of a specific size and scale that practices ITSM has at least one problem management role in its ranks. In most cases, their role is to facilitate their problem management process, typically focused on uncovering the root cause of incidents and implementing changes to prevent reoccurrence.

In today’s complex cloud environment, problem managers must shift their perspective beyond a focus on root cause analysis, problem prevention, and service improvements to be successful. Incidents are not singular in cause but instead have many contributing factors where there are opportunities to drive improvements. To cultivate an operations engine that is reliable and resilient, they too must be embedded into the entirety of that system.

How might the tasks and responsibilities of a modern problem manager differ from those we’ve seen for years? Many of them will be the same. However, three significant changes will enable that shift from reactive to proactive (or even self-healing). These changes will ultimately contribute to fewer incidents, increased reliability and availability, and deliver a fundamentally better product or service that exceeds customer needs and expectations.

1. Operates as an objective representative for all

The role of an advocate is to offer independent and even-handed support to everyone and is not beholden to one group or leader; instead, they adopt a mindset that reliability is the cornerstone of success for any product, service, or company. By being cross-functional, a resilience engineering advocate operates as an unbiased facilitator that reduces friction and enables cross-team partnership and understanding. This ensures that problems are visible across various teams and that the needs of one team are not ignored or unnecessarily deprioritized by another.

A vital component of a continuous feedback culture within DevOps is defined, consistent avenues for feedback to be collected, vetted, and implemented. This is where a resilience engineering advocate comes in — they own the feedback process, ensuring that the feedback ‘loop’ truly exists. While not all feedback may be implemented, everyone understands their feedback will receive consideration and response. No longer do feedback requests fall into a bottomless pit - never to be seen again, but their voices are heard, and there is an independent and dedicated person or team that manages that process end to end.

2. Serves as the voice of the customer, always

Resilience engineering advocates are the authentic voice of the customer and understand that their customer’s experience is paramount to a product, service, or company’s success. The customer is their north star — their feedback provides a directive for where to focus efforts and decisions. Their partnerships across the business ensure customer feedback collected by all teams, including those outside TechOps, are known and addressed as needed. Advocates actively engage in launch planning to ensure a flawless delivery of a new feature, product, or service. In addition, advocates partner with engineering teams to implement chaos engineering practices to regularly test systems and identify and address vulnerabilities before they become a customer’s problem.

3. Views crisis and failure as growth opportunities

Resilience engineering advocates run toward a crisis — taking opportunities to drive improvements resulting from major incidents. Major incidents should never be seen as failures but rather opportunities that will ultimately drive meaningful improvements across the ecosystem. Advocates transform post-incident reviews into retrospectives that encourage and promote growth through reflection, education, and opportunity in a blame-free environment that fosters engagement versus confrontation.

A modern, elevated approach to problem management

When you put this all together, the responsibilities and areas of ownership for a resilience engineering advocate may look something like this:

  • End-to-end ownership of post-incident reviews & follow-up action plan
  • Enables cross-team feedback implementation and transparency as an unbiased facilitator
  • Tracks, drives, and partners across teams to address learning opportunities and improvements in the full incident lifecycle: avoidance, detection, mitigation, and restoration.
  • Addresses immediate exposures across all areas
  • Serves as the voice of the customer - advocates to improve processes, product, and experiences
  • Tracks and analyzes threat responses to spot trends and implement improvements
  • Works and partners cross-functionally alongside product, business, and technology to ensure prevention and preparedness
  • Identifies patterns and escalates bugs worthy of review
  • Enacts chaos engineering, frequent testing, to expose vulnerabilities and enable better handling of incidents

The TechOps landscape continues to change quickly and has added a ton of complexity for deploying, operating, and maintaining software and its infrastructure. At the same time, customers expect a seamless experience 24/7. Establishing and maintaining reliability for your product or service must begin and end with resilience.