Automating Incident Resolution: The Journey of SingleStore's Slackbot

At SingleStore, we are constantly searching for ways to improve our processes, especially when it comes to incident resolution.

Previously, we touched on how SingleStore Jobs helped us enhance our incident resolution practices. In this blog, we’ll explore how we developed and evolved our incident bot agent and the transformative impact it has had on our Helios® platform. We’ll also touch on ongoing improvements and the exciting future we envision.

Incident resolution in a distributed system

The nature of our Helios platform involves running complex distributed systems at scale. Like any sophisticated system, incidents occur — often stemming from issues like cluster scaling, rebalance failures or unexpected customer workloads. These incidents posed significant challenges to both our engineering and Site Reliability Engineering (SRE) teams.

For example, a common scenario involved customers initiating scaling operations while actively running read/write workloads. These operations, although necessary, created strain on our distributed system, resulting in failures such as rebalances getting stuck or workloads timing out. Debugging these issues was a time-consuming process that lacked a streamlined approach.

Handling an incident

When an incident occurred, the on-call engineer would need to:

Identify the root cause of the issue
Access various internal dashboards and check process lists
Investigate customer workflows and operational events by running several terminal query commands
Manually collect logs, cluster information and troubleshooting details.

While this approach worked, it was inefficient, risky and error-prone. Engineers often had root access to critical systems to perform these operations, introducing potential security risks for production environments. Moreover, the manual nature of incident handling meant it could take anywhere from 15–30 minutes (or longer) to gather the necessary information before even starting to solve the issue.

The birth of the incident bot

Recognizing these inefficiencies, we decided to build a Slack-based incident bot to streamline the entire process. The idea was simple but powerful: when an incident is reported, the bot fetches all relevant information, automates initial troubleshooting steps and surfaces everything directly in Slack for the responder. This solution significantly reduced the time and effort required to diagnose and resolve issues.

How the incident bot works

The incident bot integrates multiple components to achieve its functionality. Here’s an overview of the architecture and tools involved:

OpsAPI.We developed a set of internal "OpsAPI"s to enable safe, audited interactions with cluster operations. It consists of:
- Read APIs. For retrieving system/process status, cluster logs, and global states.
- Mitigation APIs. For performing operations such as triggering a rebalance or restarting a process.
  
  Unlike manual access by engineers, OpsAPI ensures all actions are audited, routed through a secure control plane and tailored for production environments.
Slackbot logic. The bot listens for incidents generated by systems (e.g., Zendesk tickets). Upon detection, it:
- Fetches information from dashboards, logs and the OpsAPI.
- Gathers real-time metrics like CPU/memory utilization, database status and cluster health.
- Automatically runs pre-defined runbooks (e.g., debug scripts or custom workflows).
Dashboards and internal APIs. The incident bot leverages insights from existing dashboards and internal APIs to provide a comprehensive view of the incident. With the incident bot in place, the process has transformed:
- Time savings. Engineers now receive actionable insights immediately, cutting initial response times by 15–20 minutes.
- Efficiency. The bot automates redundant tasks like log collection, reducing cognitive overhead.
- Accuracy. By standardizing information gathering and troubleshooting steps, human errors are minimized.
- Security. All actions are executed via audited APIs, eliminating the need for risky root-level access.

Currently in progress: Continuous improvements

Building the incident bot was just the beginning. Over the last two months, we’ve begun incorporating AI to make the system even smarter. Here are some of the improvements we’re actively working on:

Semantic search. Using AI-powered semantic search, the bot can better match incidents to historical cases and provide context-specific runbooks or troubleshooting recommendations.
Enhanced diagnostics. By analyzing patterns in incidents, the bot can surface deeper insights into systemic issues — helping us identify root causes faster.

Future thoughts: Natural language troubleshooting

Looking ahead, we’re exploring ways to further simplify the incident response process. One of the most exciting areas is enabling the bot to handle incidents using natural language commands. Imagine an engineer being able to type, “Check rebalance processes for cluster X” directly in Slack, and the bot instantly fetching and analyzing the required data.

In addition to this, we aim to:

Leverage predictive machine learning to predict incidents before they occur.
Build richer integrations with other tools like Zendesk and PagerDuty for building logical workflows.
Fully automate certain types of incident resolutions (where safe to do so).

At SingleStore, the journey to improve incident resolution has been both challenging and rewarding. From manual processes to a fully integrated incident bot agent, we’ve come a long way. However, we’re not stopping here — our goal is to continue pushing the boundaries of what’s possible with automation.