Change and Incident Management
Change Management and Incident Management are well-defined ITSM practices.
We've discussed how Building an Agile Culture has become a necessity for most organizations, and the importance of assessing Operational Readiness for your solution. Change Management and Incident Management are the cornerstones of IT Operations Management; and in an agile environment when changes occur frequently, these are key components of your organization.
Change Management and Incident Management are well-defined IT Service Management (ITSM) practices that are included in the ITIL framework.
While these two topics may not be intrinsically related, I believe that they are so interconnected that they should be rooted in the same premise.
Incident Management is all about having a structured and well-defined process when responding to an event or outage, and Change Management is about having a process to evaluate the risk of making a change in the environment that could potentially cause an incident. If you don't have the means of properly evaluating the impact of an outage on your environment, how can you properly assess the risk of a given change?
To that end, I propose the following framework/workflow for reference. While it may seem a bit complex at first, the goal of this model is to make the processes of assessing impact less of an art, and more of a science, to remove some of the subjectivity that may get in the way when an outage is occurring and judgments are clouded by the urgency of the moment.
No one should ever fault you for having an established and well-publicized process and for following it, they may not agree with some of the details, but you can always invite them for discussions to let them influence changes to it in the future.
Simplicity is a prerequisite for reliability.
The worksheets displayed below are available as a shared Google Spreadsheet.
Outage Impact Assessment
Evaluating the consequence of an outage is generally very subjective. If for instance, the impact is a complete outage of a computer system that has a single user, I am sorry to say that the response of your IT staff will be greatly different if that single user is a new intern or the CEO of the company. A Development system may very well be considered to be a Production system to your developers if they are unable to do their job without that system. But maybe they can develop on their local system? It truly depends on your situation, and you should keep in mind that what is true today may no longer be true tomorrow, you must re-evaluate the importance of that system on a regular basis.
Once again, clarity and transparency are key, and you should have these discussions and come to a consensus on the outcome with the service's users and stakeholders, maybe in a well-defined Service Level Objective (SLO) document. If necessary, incentivize users to provide and maintain accurate information about the value of this system to their team, maybe a production system has more frequent backups but costs more per month.
The importance rating of each service must be performed at the enterprise level, everyone thinks that their Division is more important than the others, and the rating is relative and should remain fairly objective.
Understanding and documenting the dependencies in your environment is also key, especially with microservices. Maybe your PoC service is an upstream dependency of a production system? The C4 methodology (see Create Diagrams from Text) is quite useful, documentation is critical, and you don't need a system's architect to do Visio diagrams all day long, I will prefer a crude but accurate diagram over a "pretty" one any day.
Availability
To be available, a system has to be accessible and usable. If a user cannot access the system, or if the system is slow or unresponsive, it is – from the user's point of view – unavailable. Generally, the term downtime is used to refer to periods when a system is unavailable.
Incident Severity
The Severity is the amount of damage or harm an issue has caused to a system's availability, as viewed from the user
experience, measured in terms of success or failure.
To determine the Incident's severity, choose the highest relevant category:
Incident Impact Type
The Type of Impact of the Incident factors in the frequency of occurrence, measured by how often the issue occurs, as well as the persistence of the issue, measured by observing whether users who experience a problem can learn to avoid or overcome the issue going forward.
Types are:
- Sustained - the issue is constant for all users for an undetermined amount of time
- Temporary - the issue is constant for all users but only for a short and predetermined duration
- Sporadic - the issue only manifests itself “once and a while” or only for some users
The Incident Severity, combined with the Impact Type, is used to select the Impact Type multiplier.
Choose the highest relevant category:
Impacted Systems weights and Incident Category Rating
Systems are assigned an arbitrary Systems Weight based on their relative level of criticality to the Organization.
This weight, combined with the Impact Type multiplier, is used to determine the Impact Score.
Choose the impacted system with the highest score:
The mark of a great ship handler is never getting into situations that require great ship handling.
Incident Management
Once you have a clear understanding of the impact of a full outage on a particular system and that we have a way of quantifying it, you can start defining your Incident Management process.
Scope
The Incident Management process must be followed regardless of the actual cause of the reported outage, whether it is caused by an issue in your internal data center, or with a cloud provider or an ISP.
The Incident Management process is NOT appropriate for user requests or for tracking issues with user's systems or local configuration, it is meant to track issues with shared institutional systems. It is also not appropriate for users to request changes or new applications, a Service Requests process should be implemented for that purpose.
Goal
The goal of Incident Management is the rapid restoration of the impacted service or services. Depending on the Severity and Impact of the outage, a Problem Management process may be necessary to determine the root cause of one or a series of related incidents, both serve a related but distinct goal.
Steps
- Detect
- Respond
- Recover
- Learn
- Improve
The Category Rating of any given Incident or potential Incident is calculated as a factor of the Incident Severity and the Type of Impact that it has on the Impacted Systems. This rating is used in Incident Management(see page 3) to determine the extent of the outage as well as in Change Management2 to determine the potential risk of the actions being performed.
The Impact Score of the most critical impacted system is used to determine the Incident Category Rating, which includes the most appropriate action and communication plan for the situation.
Incident Category Rating and Communication Plan
The Impact Score Rating determines the Incident Category and the steps that need to be followed in order to properly communicate and record the event:
Note that these actions are provided as an example, I'm sure you'll have additional steps to include.
Change Management
The Change Management process should use the exact same Impacted Systems weights and Incident Category Rating that we discussed above, the measure of an outage to your organization hasn't changed, the only difference is that we should include the likelihood of that outage occurring during the change. More on that below, let's first define our terminology.
Scope
The Change Management applies to all changes made to any of the company's IT systems, on-premise or in the cloud.
Goal
The Change Management process uses the same impact rating as the Incident Management process. The goal of Change Management is to properly assess the level of risk of any change made to the Organization's services so that the proper communication channels and approval can be followed. Additionally, the goals are also to:
- Support timely and effective implementation of IT-required changes
- Appropriately manage risk to the Organization
- Minimize the negative impact of changes to/for the Organization
- Ensure that changes achieve desired business outcomes
- Ensure that governance and compliance expectations are met
The Category Rating of any given Incident or potential Incident is calculated as a factor of the Incident Severity and the Type of Impact that it has on the Impacted Systems.
Likelihood
The likelihood is the probability of the hazard occurring and it is often ranked on a five-point scale:
- Certain - 80 to 100% - An issue will very likely manifest itself during or after the change
- Likely - 60 to 80% - It is probable that an issue will occur during or after the change
- Possible - 40 to 60% - An issue can but will most likely not occur during or after the change
- Unlikely - 20 to 40% - There is a remote possibility that an issue may occur during or after the change
- Rare - 0 to 20% - An issue is very unlikely to occur during or after the change
The risk factor of the proposed change is measured by multiplying the highest possible Incident Category Rating by the highest Likelihood of this hazard occurring:
Risk = Likelihood x Severity
When applying changes to the Organization's systems and supporting infrastructure, the IT Team will evaluate the potential impact of the change based on the following Risk Assessment Matrix:
Change Risk Rating and Communication Plan
Depending on the Risk Factor, as well as the Urgency of the change, the IT team will follow the following
steps to inform users of the change:
Tagged with:
Reliability Devops