Skip to main content

AIOps is a feature

Amazing that it’s been two months since last writing, but it’s been a bit busy launching new products and getting customer implementations under our belts. Since the well-oiled machine is accelerating, and additional help is inbound, there is more time in the day to analyze the market and the rapid changes occurring. There has been a lot of AI washing, even more so than 10 months ago when this was penned. As expected, AI is the new cloud (these Microsoft AI commercials are driving me crazy). If you don't have AI in your message, you are not relevant. The buyers of technology are eating up the wording even though this is far from reality, as I have explained. The outcomes associated with generically attaching to a word is a good marketing move but creates a mess for the users of technology, and those analysts who cover and define markets. We already are seeing many end users of technology asking why they need so many different "AIOps" tools.

For those who see this term, but maybe don't quite grasp the meaning. The concept of AIOps is a new-ish Gartner market which makes much sense and resonates with the executives of most IT organizations. We are all suffering from information overload, and increased complexity. These are problems which can be better solved by computers than people, whether you call this AI (which it's not) or machine learning, the result and impacts are game-changing for those operating digital businesses:

In this figure, the flow is to collect data, use it to engage with others internally and externally, enacting changes deemed suitable by the algorithms. This automated platform could do this unassisted and ultimately result in increased business value.

The concept of a closed loop for the Enterprise IT Operations has been around for decades. Billions of dollars were spent by CA, BMC, IBM, and HP to create an entire closed loop automation system which looks a lot like what is shown above. If anything we should learn from our historical failures. The main difference in AIOps is that "Big Data" and "Machine Learning" now exit inside a "Platform." The problem is that the typical enterprise uses dozens of automation tools, over a dozen monitoring tools, and trying to collect this data and make sense of it has been nearly impossible. The closest we have gotten the use of generic tools to manage events from monitoring or logs from other components in the infrastructure. So, what did Gartner do? They created 11 categories to describe this group of AIOps tools, but they are so broad that any monitoring tool can fit into this schema.

  • Historical data management: Every monitoring tool must handle historical data, this is required to build reports and dashboards or understand trends.
  • Streaming data management: Most monitoring systems use a combination of batching and streaming to collect, analyze and send monitoring data. Streaming data collection and analysis is a requirement for most APM and NPMD solutions.
  • Log data ingestion: Due to the uptake of solutions built on top of the ElasticSearch project log analytics has become a feature of most monitoring tools. By my count, there have been at least 10 new products introduced this year which collect and analyze logs.
  • Wire data ingestion: Although packet data is a useful data source, it becomes hard to scale in public or private clouds. Nonetheless, many products out there use packet data, but in today's environments, often this data is captured on each OS instance via distributed analysis versus taps/spans, which come with additional challenges or limitations. Interestingly enough, Gartner only mentions taps/spans and not distributed collection and analytics.  
  • Metric data ingestion:  Every monitoring tool must collect time-series metrics.
  • Document text ingestion: This is less well understood in the market since many of the log analytics tools do an excellent job of ingesting human-readable documents and apply natural language processing to that data. Much of the advances are beginning to become table stakes as the ElasicSearch platform now has more advanced capabilities built into it, which continually improved.  
  • Automated pattern discovery and prediction: Just two years ago very few monitoring tools had automated pattern discovery, but today many solutions have capabilities to reduce the human effort involved in problem detection. The challenge in this category is that many data sources once ingested lose the relationship or structure of the data, yet many can infer them from the data, but do so with a high number of false positive correlations.  
  • Anomaly detection: Similar to the last category this has been around for a long time but has become more common since manually managing thresholds is not scalable or sustainable. Some products on the market have had this capability for a decade or longer.
  • Root cause determination: This is probably the area in this model where the most innovation is occurring. There are very few products on the market which do this effectively. Most of the offerings are using rule-based expert systems, but these have many flaws when they see new problems they were not programmed to handle. If this problem were solved well, then we wouldn't see massive multi-hour outages occurring regularly. In the next two years, this area will evolve quickly.  
  • On-premises delivery: This one is fascinating as almost all software is becoming SaaS-delivered. Gartner is even now saying that By 2025, 80% of enterprises will migrate and then shut down their traditional data centers, versus 10% today. As those data centers close, SaaS delivery becomes the de facto method of purchasing the software used to manage applications and the infrastructure running in public or managed clouds.

The primary drivers of any technology are the use cases it solves or the business benefit it creates. Within AIOps the use cases are variant since role and persona are broad, requiring countless differing and sometimes conflicting requirements. The AIOps platform must be infinitely configurable and customizable, which makes it a jack of all trades and a master of none. Specific personae need prescriptive workflows and out of the box, functionality to show value quickly. The net result is that AIOps technologies are in fact features of existing products to meet the ever-changing data requirements. Most of the monitoring tools out there have added more data stores and capabilities to try to become this platform of choice. Adding analytics capabilities on top of the data stores, and ingesting additional data is now a requirement. In my work running technical partnerships at AppDynamics there has been a significant uptick on integrations between products to try to fill this gap and become "the platform of choice," but in reality, they are point solutions solving one or two key use cases.

More recently, in July of 2018, there was quite a shift within Gartner around AIOps. They have re-focused on cross-domain analysis, even further broadening the use cases for the data. When a solution is too broad it can no longer meet the user requirements, this is likely the case for AIOps. Assuming one can extract and store all of the relevant data (which is a challenge unto itself) having a generic platform interpret data without a specific model coming from the originating tool is nearly impossible. Once you lose the meaning of the data, it is generic and disconnected from the business impact. More advanced tools do not collect everything, but self-tune the way and how they collect, this means there are intentional gaps in data based on the design of the tool. AIOps systems cannot process data from tools which do not emit all of the data, the reason all data cannot be captured or exported is to scale the systems. The intelligence of more advanced monitoring tools is distributed within a system (agents, aggregators) and not merely centralized in the backend. Similarly, when IT projects which attempt to build large generic data lakes containing unstructured information do not yield the expected results, not to mention exceeding the estimated costs.

Vendors who overrotate towards AIOps show a limited understanding of the dynamics of the buyer and persona within an organization. Many vendors aspire to sell to new users, but their heritage prevents this from becoming a reality. There are various requirements in a large enterprise, which drive the buyers towards different solution sets. Assuming that:

In five years' time, Gartner envisions that, for a number of leading enterprises, today's disjointed array of domain-specific monitoring tools will have given way to what is fundamentally a two-component system. The first component will be an AIOps platform, and the second component will be some kind of DEM capability.”
“Deliver Cross-Domain Analysis and Visibility With AIOps and Digital Experience Monitoring” Published 5 July 2018 - ID G00352799

A grand vision, but not realistic. The use of DEM has limited deployments today. Also, many marketing organizations use DEM technologies, but they are disconnected from other monitoring tools or teams, often due to siloed teams. None of these tools are integrated. The concept of a generic platform to analyze this data is not realistic or obtainable. Disjointed tools will remain the reality in 5 years unless the enterprise magically retires legacy applications and architectures and moves to common, modern, cloud-based application architectures. We will have an array of domain-specific tools (likely even more tools), but different organizational structures along with different employee skill sets. The change must happen in people and culture before the change in technology can begin to occur, beyond incremental improvements.

Please leave comments on the blog here or @jkowall on Twitter. If you liked or disliked this, that way I can judge what topics are most relevant to you, my readers. Thank you!


Popular posts from this blog

Misunderstanding "Open Tracing" for the Enterprise

When first hearing of the OpenTracing project in 2016 there was excitement, finally an open standard for tracing. First, what is a trace? A trace is following a transaction from different services to build an end to end picture. The latency of each transaction segment is captured to determine which is slow, or causing performance issues. The trace may also include metadata such as metrics and logs, more on that later. Great, so if this is open this will solve all interoperability issues we have, and allow me to use multiple APM and tracing tools at once? It will help avoid vendor or project lock-in, unlock cloud services which are opaque or invisible? Nope! Why not? Today there are so many different implementations of tracing providing end to end transaction monitoring, and the reason why is that each project or vendor has different capabilities and use cases for the traces. Most tool users don't need to know the implementation details, but when manually instrumenting wi

NPM is Broken

As someone who bought and implemented NPM solutions, covered them as an analyst, and now watches the industry, one cannot help but notice that NPM(D) is broken. According to Gartner themselves, the data center is rapidly changing, the data center is going away, m aybe not as quickly as Capp states, but it’s happening. This is apparent by the massive public cloud growth posted by Amazon, Microsoft, and Google in their infrastructure businesses. This means that traditional appliance-based NPMD offerings will not work, nor will traditional ways of collecting packet data. Many of the flow offerings do not handle the new types of flows which these services generate, but most importantly they do not understand the internet, which is the most important part of assuring services in cloud hosted environments. The network itself is not just moving to overlay a-la NSX and ACI, it's moving inside of orchestrated containers, and new proxy/load balancing systems typically built off component

F5 Persistence and my 6 week battle with support

We've been having issues with persistence on our F5's since we launched our new product. We have tried many different ways of trying to get our clients to stick on a server. Of course the first step was using a standard cookie persistence which the F5 was injecting. All of our products which use SSL is being terminated on the F5, which makes cookie work fine even for SSL traffic. After we started seeing clients going to many servers, we figured it would be safe to use a JSESSIONID cookie which is a standard Java application server cookie that is always unique per session. We implemented the following Irule (slightly modified in order to get more logging): (registration is free) when HTTP_REQUEST { # Check if there is a JSESSIONID cookie if {[HTTP::cookie "JSESSIONID"] ne ""}{ # Persist off of the cookie value with a timeout of 2 hours (7200 seconds) p