Bootstrapping DevOps for the Internet Computer

All startups face unique sets of challenges and issues at each stage of their existence. The DFINITY Foundation, a Swiss-based not-for-profit with the remit to extend the functionality of the internet, is no exception. Like any new organization, we have had to evolve and adapt our DevOps and other practices in response to different obstacles.

After learning to clearly communicate our vision for an open development platform where software runs directly on the internet, we created a road map to translate this vision into a decentralized architecture with distinct features that could be built and shipped on a periodic cadence. Consisting of five milestones, the road map culminates in a public launch in Q4 of 2020—and we haven’t missed a single deadline despite the disruptions of a global pandemic.

Incorporating a focus on operational responsibility on top of our focus to ship features without losing speed was a major challenge this year ahead of the release of our developer network, called Tungsten. Sustainably bootstrapping operational responsibility for this release, as well as future releases, was imperative.

We’ve all heard horror stories about operational teams that run through a checklist in a perfunctory fashion without identifying the root cause of a problem. Even worse, they are quick to give up and are resistant to launching changes, increasing the complexity of their workload.

Pushing Forward With DevOps

We were determined that this wasn’t going to happen to DevOps at DFINITY, so we embedded these core cultural touchstones into the organization.

Sponsorships Available

1: Create SLOs and SLIs for cross-functional clarity

At the end of 2019, well before there was a service to operate, we started thinking about service level objectives (SLOs) and service level indicators (SLIs). These are critical for providing a common, quantifiable vernacular for R&D and Operations teams to measure how a service should be performing.

An SLI is a clearly defined indicator for one aspect of a service’s performance, ideally focusing on user-visible service behaviors. The SLO sets the target for the indicator, as well as a window of time over which it is measured.

To communicate the SLIs and SLOs, training materials were presented across the organization, including speaking with various engineering teams, discussing best practices at management meetings and giving an overview at an organizationwide meeting. In addition, the leadership of R&D teams signed off on the approach.

To provide clear operational expectations as well as a forcing function, we established an early version of the SLOs as a high-priority OKR.

As an additional consideration, it is crucial that your SLOs reflect aspects of your service that are visible to the user. As Charity Majors, CTO of Honeycomb.io, noted, “Nines don’t matter if your users aren’t happy.” SLOs are not a one-and-done deal; they need to be updated as the service evolves and as your ability to capture the UX improves.

Having SLOs in place gave staffers with operational responsibility clear expectations to fulfill. If the service is performing within the SLOs (and if the SLOs are representative of user happiness), then things are running smoothly.

#2: Engineers developing the software should also help run the software in production

To bootstrap the team with operational and support capabilities, we needed to ensure that we didn’t create a siloed team lacking engineering skills. Operations in this context is a software engineering problem, which meant we needed to involve developers.

The same engineers developing the software should have skin in the game in terms of how the software is running in production. If there are problems, these engineers are in the best position to diagnose them and propose and implement the necessary fixes. Getting organizational buy-in for this took some time.

We emphasized the need for senior engineers to be able to operate the software being built. We didn’t want to fall into the trap of throwing a handful of junior engineers into the deep end without proper supervision or training, so we also established an expectation that all engineers should be able to operate the software irrespective of seniority.

We built two teams of between five and 10 developers in each of our office locations in Palo Alto, California; San Francisco; and Zurich, which were initially staffed with volunteers across the organization. Each week, one team member at each site managed primary operational responsibility for the service instead of their regular project work, and they knew that the entire staff would help if there were any problems. We plan to rotate new people into these ad hoc teams every six to nine months.

#3: Create optimally sized, communicative teams

Operational team size is also important. If the team is too small, people are on-call too frequently. If the team is too large, they will be on-call so infrequently that they will either forget how to operate the production stack or their knowledge will become outdated.

We share status updates through a daily log and informal shared communication channels such as Slack. There is also a weekly meeting to relay responsibility from one person to another.

This has helped us identify capability gaps. With team members based across the world, we’ve come to acknowledge that if a problem requires a team’s expertise outside of its working hours, then rolling back first and investigating later is absolutely fine.

The DevOps team decided early on that we weren’t going to start with 24/7 support, instead providing support during business hours in Europe and the U.S., where the majority of our early-access customers are located.

We’ve been transparent to the wider organization throughout this process, calling out what is and what is not working well, and how we’re working to improve our processes.

Conclusion

We were fortunate that unprecedented global events like COVID-19 didn’t significantly affect our operations. That said, shaping our processes in-person and being physically present for people when they first go on-call would have been valuable, and we definitely felt the impact of losing that. But we’ve successfully bootstrapped our DevOps teams nevertheless, and we look forward to continuing to evolve as a team.