Automation has become a necessity for managing systems in today’s age of dynamic and agile cloud-based environments. Operators are looking to optimize with efficiency and minimalism—to do more with less—as the goal. This same line of thinking should be applied to managing the network or system infrastructure.
While network operators face the same challenges as what is seen in compute environments, they appear to have the advantage of fewer pieces to manage. However, as the tendency of building more densely utilized CLOS fabrics, each network node has 10 to 20 times the number of servers under it. Even though there is a lower ratio of network devices to servers to manage, this multiplier means each network node has a higher potential impact—which means we can’t accept the notion that fewer devices result in a lower interest in automation, due to the high risk of outages. Automating these systems should be prioritized, spearheading the movement toward being completely autonomous, lowering the risks of day-to-day operations.
What’s the Holdup?
In the past decade, the transition to mobile has required the shift of applications to be flexible enough to move resources between sites and locations on demand. Led by hyperscale operators, the strategy causes application monoliths to separate into microservices, driving virtualization and containerization to allow for maximized portability. East-west bandwidth requirements expand as applications become more dispensed. This makes the network increase throughput and reliable—inciting the easy choice to adopt CLOS networks. These operators use off-the-shelf components, utilizing reliable (and cheaper) merchant silicon at the core. However, this pressures legacy network management methods by effectively increasing the number of devices from two to four distribution/core devices to 48-plus CLOS nodes. Hyperscalers addressed this issue with homegrown network operating systems (NOS), constructed with network automation to make it easier to manage at scale.
But when small to midsized organizations try to imitate the hyperscale operators, they become restrained by legacy NOS architectures. We’ve been hindered by an establishment of manually driven networks, writing scripts on top of the command line interface (CLI). In the best-case scenario, we have tools such as Ansible or Puppet, which can bring automation and some orchestration to the network. In the worst case, network managers resort to just hiring more engineers. As a result, there remains a substantial gap between network and compute infrastructure.
To display this gap between the industry’s current state and where we should be, let’s look at this graphic which shows the levels of service automation maturity:
Most modern networks operate between Level 0 and Level 2, meaning operators are barely scraping by with a combination of zero-touch provisioning and screen-scraping scripts (and the increase of engineers). They are reacting to changes that occur and aren’t actually able to verify that automation worked, outside of logging in and manually reviewing the result—which absolutely does not scale, due to frequent errors and eventual, and immense, inconsistencies between design and actual implementation.
Should Operators Look for the Solution?
Cloud-native solutions provide significantly better experience, both in maturity and agility. They have sufficiently progressed at using software to manage other software, operating at Level 3 or higher on the above scale. We can leverage this for the network, but it has to start discarding obsolete architecture and modernizing the starting point—the network OS.
Microservices, containerization and orchestration (with Kubernetes) are the foundation of cloud-native architectures and are directly addressing the challenge of automation—such as how network and application services automate with the same level of maturity by looking at service fabric mesh in the public cloud. By utilizing Kubernetes and employing modern containerized microservices architectures into the NOS, we unite NetOps and DevOps. The network is essentially just a complex distributed application and can be automated like other similar applications. We need to widely adopt autonomous networking and accept that a modern NOS architecture is needed to implement a method of building networks that match the same level of automation maturity we see with cloud-native applications.