Constantly testing the cloud in systematic and chaotic ways will help companies avoid cloud failures in a time when downtime means disconnect
Professionals fortunate enough to shelter in place and work from home can thank cloud for their ability to shift over so seamlessly during the COVID-19 pandemic.
Imagine a lockdown without the cloud. Availability and uptime of grocery order systems would depend on a company’s on-prem data centers. Streaming services would stutter and stall during peak hours (after dinnertime and other high-demand times. Critical personal and professional interactions would be limited to voice calls instead of rich collaboration apps, video conferencing or dynamic social media. Many companies would simply just be closed to business, unable to respond to market changes and consumers.
Now imagine a COVID-19-induced work-from-home scenario with downtime in the cloud. Teams working remotely would fall apart. Deliveries would be scrambled and lost. More importantly, emergency services would see bigger surges with fewer tools to manage them. Medical research teams around the globe currently sharing coronavirus research and data would lose contact with each other, delaying and slowing efforts to treat and cure the COVID-19 virus.
It may seem that the cloud was built just in time to help with our current predicament. But the concept and foundational systems of cloud computing go back more than 50 years. Popular, practical cloud applications developed more rapidly in the past 20 years, starting with the early iterations of Salesforce.com and more recently with web-based enterprise applications such as Google’s G Suite and Microsoft’s Office 365. It’s the simplicity of these relatively newer tools that have led to their broad adoption. But this surface layer masks a mind-bogglingly complex undergirding that must be fully understood and tested to be maintained.
Clouds are strained right now. IBM endured a cloud outage of roughly two hours on June 9. Network monitoring company ThousandEyes told tech news site Fierce Telecom that the global nature of the outage suggested a control-plane issue rather than a physical failure such as a fiber cut or router failure. Microsoft’s Azure Cloud Data Center capacity in Europe has shown signs of stress as companies shift to remote work and usage of its communication and collaboration platforms, including Teams balloons. To relieve that stress, Microsoft restricted access to free and trial accounts. Google Cloud’s Kubernetes platform and networking services on parts of the U.S. East Coast went dark for a few hours on June 29.
To avoid downtime due to cloud failure, companies must build resilience into their cloud structure by testing it in continuous and chaotic ways.
Conventional testing fundamentally ensures that key applications have been migrated or set up in the cloud and are working properly. This is a great and proven way to ensure that cloud systems meet design considerations and do not change based on application performance. However, conventional testing does not probe for unexpected situations. To do that, advanced operations turn to chaos testing, a technique pioneered by Netflix during its own migration to the cloud starting in 2011. Chaos testing creates turbulent situations that will bring points of failure to light and influence design. Modern cloud systems bring hardware and software together in such complex and fluid ways that standard testing will never suffice. By engaging in systematic and chaotic testing, companies can develop greater resilience in the cloud and through their entire technology systems.
To achieve resilience, IT must find vulnerabilities not revealed in normal “happy path” testing scenarios. Seeking resilience provides a glimpse into a systems’ reliability and recoverability by executing a carefully planned set of disasters. Such a disaster sequence could include:
- Pulling down the network connection for 10 seconds.
- Taking a service or server offline.
- Choking middleware and watching for anomalies.
Enterprises just embarking on their migration to cloud must first study and choose a service model that fits best. Firms looking to migrate a whole workload to the cloud should choose the infrastructure-as-a-service (IaaS) option, which allow teams to rapidly orchestrate their test environments on the cloud and quickly set up systems for storage, backup and recovery.
Organizations already in the cloud or partly migrated must take time to understand their cloud service models. This could prompt the question: Is this model right for my organization in the years ahead?
While operating in the cloud, these companies can start planning testing scenarios to gauge their systems’ resiliency. In doing so, companies should set an objective and time window for running the tests. Then, the results of these tests help inform the requirement design phase of their cloud setup.
With all systems and applications in a virtual cloud context, future digital transformations will become more fluid. Properly tested clouds also handle demand spikes more readily and will handle the viral adoption of an application or a rapid location switch or expansion of workforces.
To become resilient, companies must create resilient IT systems. These systems will rely partly or wholly on cloud infrastructure. Cloud infrastructure grows more resilient through constant, systematic, and chaotic testing. Consistent, creative testing has its own value that allows businesses themselves to become more flexible and resilient enterprises.