In the race to accelerate digital transformation initiatives, organizations are encountering more incidents, more downtime, and longer resolution times. In fact, 90.4% of organizations saw an increase in incidents since the pandemic began, according to a recent Transposit report.
In a completely digital economy, service downtime takes a higher toll. For ITOps teams, working remotely also requires new strategies to organize a team’s response to issues. Amid rising incidents, 94.1% of organizations have increased their focus on service reliability engineering (SRE) practices.
Transposit just released their 2021 State of DevOps Automation report, which surveyed over 500 IT, DevOps, and SRE professionals across the U.S. The report sheds light on the state of remote work as it relates to incident response processes, reveals increased demand for SREs and highlights the automation they’re using to narrow the resolution time gap. Below, we’ll explore the report’s key takeaways and discover what sort of DevOps automation is trending.
Remote Work Slows Incident Response
Try as they might, videoconferencing and group chat solutions like Zoom and Slack just can’t compete with in-person collaboration. According to the survey, 93.6% of ITOps and software teams reported it takes longer to resolve incidents while working remotely.
Since the pandemic began, 97% of teams have adopted remote work, and 90.4% have experienced increased service incidents. It doesn’t bode well that incidents are increasing as incident response time lags.
The report findings frame remote work as a top hurdle to resolving issues — 46.1% reported that not being physically in the same room to collaborate on fixing issues was a top concern. But remote work is only one element; others cite changing processes and a lack of visibility into who’s responsible for maintaining code as another challenge.
Incident Response Hangups
To make up for these concerns, organizations are investing in incident response strategies. The report found that 56.4% of companies are adopting communication and collaboration tools, with automation tools at 52.3% and integration tools at 43.9%.
Though the report showed greater investment in those tools, it also identified a persistent communication gap. When conducting postmortems, it can be challenging to correlate human action to incident responses. More than half (57.4%) of respondents felt it’s hard to piece together human actions and team communication during incidents. This context is essential to make sense of mean-time-to-response (MTTR) metrics.
Interestingly, the report found 96.4% of respondents believe mining insights from human data, like Slack channels or group emails, could help improve incident responses. It appears that having a clearer way to harvest and query incident-related data throughout the entire response process is a universal concern.
SRE Goes Mainstream
As I previously reported, the SRE role continues to evolve and gain favor. The report similarly hypothesizes that, based on the survey results, the SRE title is going mainstream. Out of 295 respondents surveyed, 98% said they increased SRE practices in the last 12 months, and 62.4% said they would be expanding SRE efforts in 2021.
In 2021, SREs will likely remain a hot commodity in hiring. However, these individuals will face mounting complexity in addressing real-time incidents for complex distributed systems. They will also likely be tasked with automating manual processes, as 51.7% of respondents reported that a lack of automation is the top cause of slow incident resolution.
Automation Goals and Barriers
40% of companies surveyed now have at least one dedicated full-time staff member working to create in-house tools or bots to automate incident responses. So, what sort of automation is common in the SRE field? The report shows SREs constructing custom scripts to automate things like:
- Customer communication
- Recording MTTR metrics
- Automating infrastructure
- Corporate communication
- Auto-generating runbooks
- Creating or updating tickets
But, automating these processes is not all that easy. As a top barrier to automation, DevOps professionals collectively cite a lack of documentation. It seems by replacing tribal knowledge with up-to-date, shared documentation, teams could quickly unlock some gains. So, make documentation accessible, and provide copy-and-paste abilities for sample scripts.
SRE and Automation to the Rescue
The study demonstrates how the remote economy has increased both the number of incidents and the mean response times to address them. To fill this gap, organizations increasingly turn to site reliability engineering practices to automate incident responses. By leveling up DevOps with SRE, teams could reduce errors and improve overall availability as development teams progressively deliver new features.