Bringing Machine Learning into your Technology Operations

March 5, 2019

by: Kiran Chitturi

We’ve all been there at some point in our lives, when a production incident causes us and our team a major headache. What follows is some combination of looking at alerts, frantically digging through logs to understand what’s going on in our systems, and deciding on how to respond, all while keeping your customers updated on the status.

There are many challenges with this kind of approach - how to make it accurate, efficient and scalable for larger operations. It’s no surprise we’re always looking for better ways to operate our technology infrastructure.

Artificial Intelligence (AI) / Machine Learning (ML) is the trendiest of technology and many startups and mature software vendors are trying to address Operations’ needs. There are a lot of vendors in AI/ML for the Operations space but few vendors cover multiple operations categories and even fewer cover all of them. It is still early days in this industry and there are a lot of opportunities to apply ML.

When we see AI the reference is usually towards a broader notion of ‘computers performing tasks that usually need human intelligence.’ On the other hand, ML is an approach to AI that learns” intelligent behaviors from data – for example, progressively improving performance on a specific task, like helping detect, categorize or predict.

In Operations, that could mean faster searches, getting to root causes quicker or taking intelligent actions. Machine learning uses models to do this.

Key Observations for AI/ML

Noticeably, the biggest free enterprise use of AI/ML has been industry-specific, i.e., vertical markets/solutions. For example, the NFL applies ML to analyze plays and predict what the next one will be, while Amazon uses ML in for product recommendations, fulfillment centers and automated grocery stores.

Within Technology Operations, AI/ML is more heavily involved in incident/problem management at this point. Companies in the Monitoring space are moving towards being full-stack – monitoring to search/analytics – expanding from the Infrastructure side into the applications side and vice versa. AI/ML algorithms and models for Operations are becoming more well-known.

Operational Categories and Opportunities

In many organizations, when we take a look at Tech Operations, we can generally categorize their activities as follows:

  • Service Desk
  • Logging & Monitoring
  • Automation - Infrastructure / Configuration Management
  • Business Intelligence / Analytics
  • Cost Optimization / Capacity management

Below, I will highlight opportunities in each of these categories and how Sungard Availability Services (Sungard AS) constantly improves our operations.

In the Service Desk space, some of the key advancements of applying ML have been in categorizing the requests and then routing them according to the right assignee. One way to jumpstart this process is by harnessing a SaaS solution of sorts, so new infrastructure would be needed.

It’s important to remember that training data needs to be of good quality. One simple example: the data set we select to use for training needs to have the right ‘category’ field for each item. You will get the most bang for your buck in this venture if you increase integration maturity with upstream (monitoring) and downstream (automation) systems.

You could also consider developing point ML solutions that can be integrated into your Service Desk Platform, like parsing and routing emails or self-service chatbots, smart search/recommendations, etc.

In the Logging and Monitoring space, you should look at your existing platforms and vendors and assess the gaps, if any, and the ML opportunities those platforms provide. You may be able to leverage out-of-the-box (Black Box ML) capabilities like correlation rules, anomaly detection and root cause – which can lead to smarter alerts, such as categorizing alerts, abnormal dips in user traffic, periodic fluctuations, etc.

For a more hands-on approach, with all the data your systems store, you can look at ways to apply ML to perform analytics and gain actionable insights. You will have to leverage either a vendor with an Open Box ML approach or build your own, leveraging some of the ML pipelines that public scale cloud providers offer.

AI/ML is not directly leveraged in the Automation of Infrastructure/Config. Management space, but benefits come from the previous (upstream) categories (ServiceDesk and Monitoring) enabling AI/ML and integrating with your automation processes. You can consider making your current service desk platform act as the hub for much of Workflow automation and implement an Orchestration platform. If this gets paired with an Analytics engine of sorts, you can improve your odds of leveraging AI/ML in this space to the best possible extent.

To make this transition smoother, you can also look at point solutions in the beginning. Existing automations and workflows can use ML (with its feedback loop) to improve effectiveness and efficiency – for example, to tweak 'exceptions' levels, or adjust the level of confidence an AI bot/model has in a particular course of action.

In the Business Intelligence / Analytics space, we have the ‘data wrangling,’ storage and visualization portions, each of which can leverage AI/ML in its own unique way.

  • Data Wrangling - Understanding the format of various complex data sources and mapping the metadata can be challenging and is a perfect place to deploy intelligent discovery using ML capabilities. Consider a data unification platform to help you in this journey.
  • Storage - Having strong data lake integration capabilities and taking a platform approach (using, say, or AWS ML offerings) is key and acts as a true enabler for ML’s success in the whole organization. If you are looking at a public cloud provider like AWS, consider Elastic Search when looking at Redshift as your data warehouse platform – there may be benefits in terms of better performance and more easily transferable skills.
  • Visualization – Here you can look at leveraging ML to provide context-aware suggestions that could help users spot hidden insights. You can even look at auto-generating and prioritizing analytics and insights based on the overall data set and a user’s search criteria. Further enhancements can be in generating more relevant suggestions based on factors such as user behavior and feedback, community trends, industry knowledge, etc.

In the Cost Optimization / Capacity Management space, using ML you can look at right-sizing your workloads and address how and where you deploy your infrastructure. This can help you in leveraging spot instances when deploying your apps. You can also work towards predicting Spot Instances interruption to maintain high-availability. This can pave the way for using AI to self-manage workload and hybrid environments with optimization for cost, compliance and performance. Another way to use ML is to leverage it to build spend optimization into your Agile/DevOps processes

Summary and Key Takeaways

For Operations teams, AI/ML can mean predicting failures, faster search, getting to root cause quicker, preventive maintenance and taking intelligent actions. Taking a deep and honest look at each of the mentioned categories in this blog, you can see the benefits that AI can bring to your team.

The key challenges in most organizations have been getting efficient access to the vast amounts of data, having the skill sets to ask the right questions and harness the appropriate data sets with cost-effective Machine Learning pipelines.

The opportunities and areas that have been called out in each of the categories can help you address scale and optimization problems. These ways of leveraging ML when used in combination with Automation can take you to the next generation in Technology Operations.


Photo background from fabio on Unsplash

Other Posts You Might Be Interested In

Machine learning cartoon: It’s time to study up for the next wave of innovation

Successful companies understand they have to innovate to remain relevant in their industry. Few innovations are more buzzworthy than machine learning (ML). The...

Top cybersecurity and operational resilience challenges in 2020

Nasty new ransomware tactics. SIMjacking. Increased attacks on government and healthcare organizations. Welcome to 2020 — a year that’s expected to be ripe with...

How to Hire for Operational Resilience

How do you hire the best people to help your organization be more resilient? What are the qualities and skills to look for in job candidates? Here's one...