Today, CIOs are faced with the challenge of making enterprise technologies perform on par with offerings from consumer-facing technology leaders, such as Facebook or Netflix, meaning technology or services being seamlessly available with no performance lag. Of course, these expectations are not always aligned with the realities of managing complex hybrid enterprise systems that were not necessarily designed to deliver this type of performance.
On the other hand, IT operations supporting new-age “digital” business services are dealing with on-demand resources that are supported by software-defined data centers extending into the public cloud by using serverless computing platforms. This is resulting in a “new generation hybrid enterprise” that must deliver the functionality of, and operate as, real-time infrastructure.
Lexmark, a U.S.-headquartered global company that manufactures laser printers and imaging products, wanted to achieve zero latency operational performance and enhance its business processes to achieve incidence or error avoidance.
This article details Lexmark’s IT objective and the solution implemented by Wipro Ltd., a global information technology, consulting and business process services company.
Problem: Lexmark required a technology solution that would provide visibility into its core business process—managed print services. The solution would also need to factor in other supporting systems and functional processes. The objective was to achieve operational excellence by minimizing errors and risks.
Solution: To address this critical IT operations problem, Wipro was able to incorporate unique design principles to create an operationally simple, scalable and focused solution that brings IT and business process owners to a common understanding regarding achieving incident avoidance.
Wipro built a new way of visualizing operations management, based on the fundamentals laid out by IT operational analytics (or AI operations, as it is called now).
Working only with the tools and data that Lexmark already owned, Wipro and Lexmark created a system that measured the heartbeat of Lexmark’s managed print services business. That heartbeat was synchronized to the printer data collection systems, with the thresholds determined entirely by machine learning techniques. Every underlying system was arranged in a “server room.” Only out-of-variance conditions are displayed in the main visualization panel, with the root cause just a single click away. This paper provides insights into the challenges faced by IT enterprises to deliver an on-demand solution to clients, similar to the ones offered by digitally evolved B2C companies.
Digital enterprises that use hybrid environments must respond in real-time and need to achieve incident avoidance, rather than incident management.
If we look behind the scenes at operation command centers today, we see operations that struggle to keep up the pace. Very few operations have been built from the ground up. In the enterprise world, Severity 1 (incidents that are categorized as high impact by business users) incident resolution times, still run for a couple of hours if IT support is not in a position to find the root cause of the problem. The culprit is often the lack of a proper configuration management database (CMDB), insufficient logging or lack of application discovery and dependency mapping. Every IT professional can recognize these failings, yet projects to correct these flaws never seem to get the necessary capital appropriation. Clearly, a new approach is needed to fix this.
Identify the Business Process
The process should start by identifying the key driving force or the heartbeat of the core business operation. Depending on the industry, this heartbeat might be customer orders, service requests or vendor purchase orders. All other systems are connected to this heartbeat. To select the right attribute to monitor as the heartbeat, start with business analysts who have a good feel for the day-to-day struggles of the operation. These analysts often have a series of scripted reports they run to track system health. More advanced analysts might even have automated these with alerts. By leveraging this information and using data discovery techniques, we found that the values can be configured as business KPIs.
In the case of Lexmark, Wipro streamed the KPI values in real time and used machine and wire data to broadcast the underlying monitoring components by using an existing streaming analytics platform that is supported by AI and machine learning techniques. We found that the amount of data needed to assess process health is time sensitive, and normally less than 24 hours of data is displayed during that time.
The platform visualization shown above, tracks the creation of sales orders for a business unit. The green line represents the flow of orders through the system. Any anomaly in an underlying system is reflected directly on this flow. This is achieved by streaming analytics and AI techniques that use the business KPIs.
Apply the Architecture Overlay
The next step is to subordinate the supporting infrastructure to the business process heartbeat. This feat is accomplished without complex business rules and sophisticated orchestration engines. We call this the “server room” and it is a view of the entire infrastructure on one web page, leveraging a lattice chart format that allows for easy comparison. Each of these underlying components will have a direct impact on the business process.
Applying the same streaming analytics and AI/ML techniques that identified the anomaly in business process, we identify what is causing that anomaly and can quickly see the problem areas.
Color coding helps the technician quickly understand the severity of the event. Blue indicates the system performance has not yet violated the set threshold, but an underlying infrastructure issue exists and needs to be proactively addressed. Amber indicates the system has violated the warning threshold. Red indicates the system has violated the critical threshold, a condition that by design we will never allow to happen.
The user can scroll over any given point and a callout box will display the actual level, as well as a brief note on what has gone wrong and the timestamp.
Analysts can now instantly focus their efforts on the specific systems that are in an error state, even if an event has not yet created an incident.
Machine Learning to Define the Thresholds for an Anomaly
What is normal behavior? This is often the most difficult question to answer because common monitoring and management processes rely on a single data value that was set during implementation. But thresholds must be extremely flexible to capture the ever-changing nature of the business process. Batch jobs that run at midnight, end-of-month closing activities, shift changes and a host of other potential variables must be considered when determining the right thresholds.
For any component supporting the business process, we need to understand what is normal for right now. To determine that at Lexmark, we applied machine learning to dynamically set the thresholds for each sub-system at the month, day and hour level based on the healthy business KPIs we established.
Incident Avoidance, Not Incident Management
To fully understand the potential of this approach, it is necessary to change the way system analysts interact with the system. IT professionals who have been doing system support for several years, may be conditioned to react to the work in their queue. A well-honed incident management process is the enemy of proactive action that can prevent emergencies.
By aligning the underlying system activity and thresholds with the business process, the IT professional can truly understand the whole environment, not just their individual technical component. The most effective analysts will extend their knowledge into other business areas. At Lexmark, we discovered that we were able to solve “intractable” performance issues by simply adding hardware or capacity.
Integrated IT operations is not a goal, it is the means by which digital enterprises can be managed in real time. The path to real time management is through service capacity guarantees, performance and availability management and using anomaly detection in the context of the business process to identify issues well before an incident occurs. We advise starting small and building an anomaly detection system for the heartbeat of the process that enables you to see both the forests and the trees.
This article was co-authored by Murthy Malapaka, general manager, Transformation Services Cloud and Infrastructure Services at Wipro Ltd. He helps enterprises digitally rearchitect their operations to run on hybrid IT environments to achieve incident avoidance instead of incident management. He has assumed various technology leadership roles across application and infrastructure architecture domains, specializing in availability and reliability. Murthy has been providing consulting services to CIOs and CTOs in their journey from client server to on demand infrastructure services.