Automate Incident Management with Code

Celestin Ntemngwa
3 min readFeb 3, 2021

An incident in a production environment is never a good thing to deal with. However, those things happen. Even though not very common. Once it happens, you have to be ready for it. Better planning is key to handling incidents. What if a support model includes a prompt for the system involved to automatically handle tasks such as triage and incident response to an incident. If that is the case, I suppose your collaborators will find themselves working peacefully and with resilient systems and subsequently add more value to the company, probably through better customer services and quality products or services.

Response-as-code

Response-as-code (RaC) is a term that I came across, and it is part of the evolution of software development. Software development has brought about new techniques and “products.” We have Infrastructure-as-Code, where the infrastructure is described in machine-readable language and definition files and store these files in some code repository together with your source code. This gives you a single source for your application and the needed infrastructure for its deployment.

Similarly, with Response-as-Code, plan, you check-in tools and solutions together with your code. The code provides the foundation to automatically identify incidents and resolve problems without involving people such as engineers.

The RaC components include generic components to use across all projects to identify and resolve common problems, project-specific components for issues specific to an individual project. These components are then connected for a comprehensive system.

RaC process starts with a playbook. From the playbook, identify common production problems for your team. Determine if the problem is unique to a system or service or more generic across different services — for example, high network traffic and slows performance. Design a programmatically automatic response. The programmatic response should have an escalation path if the programmatic response fails or does not work in all scenarios. Or if the problem reaches a certain threshold and needs a human to intervene. You could leverage APM application performance monitoring to alert based on specific criteria. You can use webhook or triggers to API to activate a script to fix a problem. For example, AWS Lambda can be used to automatically resolve infrastructure problems(Mackory, 2020).

Identify and design patterns to detect issues.

Identify the critical technology stacks across the organization and compile a library of code solutions that automatically detect common problems. Referencing past issues will help in the future identification of similar issues programmatically.

Build your solutions

You have a collection of potential problem identification and troubleshooting tools and a library of possible solutions by now. For example, an AWS Lambda solution could be used to resolve many AWS related issues. The lambda logic could also be used in other clouds and for on-premise solutions.

Continuously Improve

The RaC process is a continuous one. Share your solutions and continuously improve, adding new features, maintain the code, etc.

Also, there is a lot of free and paid tools to automate the incident process. Some of these free tools include, but are not limited to, TheHive, AlienVault, GRR Rapid Response, Cyphon, SANS Investigative Forensics Toolkit(SIFT), Volatility, CrowdStrike CrowdResponse, and Cyber Triage.

Reference

Mackrory, M (2020). How to Automate Incident Management with Code and Get Better Results. Retrieved from https://thenewstack.io/how-to-automate-incident-management-with-code-and-get-better-results/

--

--