Wednesday, August 1, 2018

AIOps is a feature

Amazing that it’s been two months since last writing, but it’s been a bit busy launching new products and getting customer implementations under our belts. Since the well-oiled machine is accelerating, and additional help is inbound, there is more time in the day to analyze the market and the rapid changes occurring. There has been a lot of AI washing, even more so than 10 months ago when this was penned. As expected, AI is the new cloud (these Microsoft AI commercials are driving me crazy). If you don't have AI in your message, you are not relevant. The buyers of technology are eating up the wording even though this is far from reality, as I have explained. The outcomes associated with generically attaching to a word is a good marketing move but creates a mess for the users of technology, and those analysts who cover and define markets. We already are seeing many end users of technology asking why they need so many different "AIOps" tools.

For those who see this term, but maybe don't quite grasp the meaning. The concept of AIOps is a new-ish Gartner market which makes much sense and resonates with the executives of most IT organizations. We are all suffering from information overload, and increased complexity. These are problems which can be better solved by computers than people, whether you call this AI (which it's not) or machine learning, the result and impacts are game-changing for those operating digital businesses:

In this figure, the flow is to collect data, use it to engage with others internally and externally, enacting changes deemed suitable by the algorithms. This automated platform could do this unassisted and ultimately result in increased business value.

The concept of a closed loop for the Enterprise IT Operations has been around for decades. Billions of dollars were spent by CA, BMC, IBM, and HP to create an entire closed loop automation system which looks a lot like what is shown above. If anything we should learn from our historical failures. The main difference in AIOps is that "Big Data" and "Machine Learning" now exit inside a "Platform." The problem is that the typical enterprise uses dozens of automation tools, over a dozen monitoring tools, and trying to collect this data and make sense of it has been nearly impossible. The closest we have gotten the use of generic tools to manage events from monitoring or logs from other components in the infrastructure. So, what did Gartner do? They created 11 categories to describe this group of AIOps tools, but they are so broad that any monitoring tool can fit into this schema.

  • Historical data management: Every monitoring tool must handle historical data, this is required to build reports and dashboards or understand trends.
  • Streaming data management: Most monitoring systems use a combination of batching and streaming to collect, analyze and send monitoring data. Streaming data collection and analysis is a requirement for most APM and NPMD solutions.
  • Log data ingestion: Due to the uptake of solutions built on top of the ElasticSearch project log analytics has become a feature of most monitoring tools. By my count, there have been at least 10 new products introduced this year which collect and analyze logs.
  • Wire data ingestion: Although packet data is a useful data source, it becomes hard to scale in public or private clouds. Nonetheless, many products out there use packet data, but in today's environments, often this data is captured on each OS instance via distributed analysis versus taps/spans, which come with additional challenges or limitations. Interestingly enough, Gartner only mentions taps/spans and not distributed collection and analytics.  
  • Metric data ingestion:  Every monitoring tool must collect time-series metrics.
  • Document text ingestion: This is less well understood in the market since many of the log analytics tools do an excellent job of ingesting human-readable documents and apply natural language processing to that data. Much of the advances are beginning to become table stakes as the ElasicSearch platform now has more advanced capabilities built into it, which continually improved.  
  • Automated pattern discovery and prediction: Just two years ago very few monitoring tools had automated pattern discovery, but today many solutions have capabilities to reduce the human effort involved in problem detection. The challenge in this category is that many data sources once ingested lose the relationship or structure of the data, yet many can infer them from the data, but do so with a high number of false positive correlations.  
  • Anomaly detection: Similar to the last category this has been around for a long time but has become more common since manually managing thresholds is not scalable or sustainable. Some products on the market have had this capability for a decade or longer.
  • Root cause determination: This is probably the area in this model where the most innovation is occurring. There are very few products on the market which do this effectively. Most of the offerings are using rule-based expert systems, but these have many flaws when they see new problems they were not programmed to handle. If this problem were solved well, then we wouldn't see massive multi-hour outages occurring regularly. In the next two years, this area will evolve quickly.  
  • On-premises delivery: This one is fascinating as almost all software is becoming SaaS-delivered. Gartner is even now saying that By 2025, 80% of enterprises will migrate and then shut down their traditional data centers, versus 10% today. As those data centers close, SaaS delivery becomes the de facto method of purchasing the software used to manage applications and the infrastructure running in public or managed clouds.

The primary drivers of any technology are the use cases it solves or the business benefit it creates. Within AIOps the use cases are variant since role and persona are broad, requiring countless differing and sometimes conflicting requirements. The AIOps platform must be infinitely configurable and customizable, which makes it a jack of all trades and a master of none. Specific personae need prescriptive workflows and out of the box, functionality to show value quickly. The net result is that AIOps technologies are in fact features of existing products to meet the ever-changing data requirements. Most of the monitoring tools out there have added more data stores and capabilities to try to become this platform of choice. Adding analytics capabilities on top of the data stores, and ingesting additional data is now a requirement. In my work running technical partnerships at AppDynamics there has been a significant uptick on integrations between products to try to fill this gap and become "the platform of choice," but in reality, they are point solutions solving one or two key use cases.

More recently, in July of 2018, there was quite a shift within Gartner around AIOps. They have re-focused on cross-domain analysis, even further broadening the use cases for the data. When a solution is too broad it can no longer meet the user requirements, this is likely the case for AIOps. Assuming one can extract and store all of the relevant data (which is a challenge unto itself) having a generic platform interpret data without a specific model coming from the originating tool is nearly impossible. Once you lose the meaning of the data, it is generic and disconnected from the business impact. More advanced tools do not collect everything, but self-tune the way and how they collect, this means there are intentional gaps in data based on the design of the tool. AIOps systems cannot process data from tools which do not emit all of the data, the reason all data cannot be captured or exported is to scale the systems. The intelligence of more advanced monitoring tools is distributed within a system (agents, aggregators) and not merely centralized in the backend. Similarly, when IT projects which attempt to build large generic data lakes containing unstructured information do not yield the expected results, not to mention exceeding the estimated costs.

Vendors who overrotate towards AIOps show a limited understanding of the dynamics of the buyer and persona within an organization. Many vendors aspire to sell to new users, but their heritage prevents this from becoming a reality. There are various requirements in a large enterprise, which drive the buyers towards different solution sets. Assuming that:

In five years' time, Gartner envisions that, for a number of leading enterprises, today's disjointed array of domain-specific monitoring tools will have given way to what is fundamentally a two-component system. The first component will be an AIOps platform, and the second component will be some kind of DEM capability.”
“Deliver Cross-Domain Analysis and Visibility With AIOps and Digital Experience Monitoring” Published 5 July 2018 - ID G00352799

A grand vision, but not realistic. The use of DEM has limited deployments today. Also, many marketing organizations use DEM technologies, but they are disconnected from other monitoring tools or teams, often due to siloed teams. None of these tools are integrated. The concept of a generic platform to analyze this data is not realistic or obtainable. Disjointed tools will remain the reality in 5 years unless the enterprise magically retires legacy applications and architectures and moves to common, modern, cloud-based application architectures. We will have an array of domain-specific tools (likely even more tools), but different organizational structures along with different employee skill sets. The change must happen in people and culture before the change in technology can begin to occur, beyond incremental improvements.

Please leave comments on the blog here or @jkowall on Twitter. If you liked or disliked this, that way I can judge what topics are most relevant to you, my readers. Thank you!

Monday, March 12, 2018

Misunderstanding "Open Tracing" for the Enterprise

When first hearing of the OpenTracing project in 2016 there was excitement, finally an open standard for tracing. First, what is a trace? A trace is following a transaction from different services to build an end to end picture. The latency of each transaction segment is captured to determine which is slow, or causing performance issues. The trace may also include metadata such as metrics and logs, more on that later.

Great, so if this is open this will solve all interoperability issues we have, and allow me to use multiple APM and tracing tools at once? It will help avoid vendor or project lock-in, unlock cloud services which are opaque or invisible? Nope! Why not?

Today there are so many different implementations of tracing providing end to end transaction monitoring, and the reason why is that each project or vendor has different capabilities and use cases for the traces. Most tool users don't need to know the implementation details, but when manually instrumenting with an API, the implementation must be well understood along with the use cases for the data. When you look at language-specific deployments there is even more variable, what you can do in Java is worlds apart from what is available for Golang and Python. It gets even more complicated as every vendor and open source tool uses a different language to describe what they do and what they call things.

The Enterprise uses a wide range of technologies which must be cobbled together to make their applications work. Some of the custom apps written in Java, .NET and other languages, much of it a decade old. Other parts of the stack are packaged applications such as those provided by Oracle, Microsoft, SAP, and many more. These often work with messaging systems which span both open source and commercial tools using proprietary protocols such as those offered by Tibco, Mulesoft, Oracle, Microsoft, and open source projects such as ActiveMQ and RabbitMQ. Finally, there are modern technologies built in a newer language, for example, Golang or Python which often use native cloud services and PaaS platforms. All of these must work together for high-quality user experience and expected performance, resulting in desirable business outcomes. Don't forget the fact that each of these technologies often has data store and database requirements which must also work for them to function. The result is that identification, isolation, and remediation of problems is a big challenge.

In many Enterprise organizations, each of these "hops" are managed by a different team or subject matter expert who often uses another siloed tool for monitoring and diagnostics. The enterprise APM providers build end to end views across these technologies, both old and new. Unified views are incredibly hard to do technically and culturally, and even more difficult in production, under heavy load, with minimally affecting the performance of the instrumented transactions.

The retrospective to the Enterprise is smaller more modern companies who build things differently. They create customized stacks of open source, develop tools and technologies which can go very deep into the infrastructure layers. This subset of companies are not the Enterprise; they have typically been running in cloud or containers since the inception of the company. They make different decisions and are happy to leverage open source projects which are often customized extensively. They are writing their instrumentation for various purposes including monitoring and root-cause analysis. OpenTracing attempts to standardize the custom coded instrumentation with a standard API and language. Once the developer writes the instrumentation, how a tool or platform consumes data is not part of the standard, it's not a problem OpenTracing is trying to solve.

The result of this decision is that if a user has implemented OpenTracing with a specific vendor and language, let's call it "Tool  A" they must release libraries or implementations for OpenTracing for each language. Yes, each of the dozens of languages needs an implementation from each tool with specific language features to match what the tool does. If I were to switch to a different tool, such as "tool B" that would mean changing the libraries which the vendor hopefully supplied for each language. Implementation of a new tool requires code changes associated with connecting the library and any language specific implementation changes. For a mature language such as Java, this would require changing the library and the implementation at the same time since the propagation formats to the tools are incompatible. OpenTracing doesn't solve the interoperability problem, so what does the "open standard" attempting to solve? Well, for one thing, is that it allows those making gateways, proxies, and frameworks the ability to write instrumentation. That should, in theory, make it easier to get traces connected, but once again the requirement to change implementation details for each tool is a problem.

Enterprise APM tools do a lot more than tracing, for well-understood languages like Java, APM tools collect metrics from the runtime (JMX), from the code (BCI), they capture call stacks, SQL calls, and other details which are correlated back to the trace. Advanced tools can even do runtime instrumentation without code change using BCI. If you were doing this with OpenTracing, it would require code changes to make the API calls. These code changes are then part of your code base and must be maintained and managed by your team. Don't forget that Enterprise APM tools do a lot more than just backend tracing, and they capture metrics and traces on the front end (mobile/web), infrastructure, log capture, and even correlation to other APIs. These additional capabilities are out of scope for the OpenTracing project.

So if the goals are solving any of the problems an open standard would solve, OpenTracing doesn't do a lot. If you read the marketing put out by various vendors and foundations, one would think its the panacea, but the reality is far from the truth. OpenTracing does not standardize metrics, logs, or other structured data which tools consume.

For some reason there is a presumption that swapping agents out of an APM tool is difficult, this is entirely not the case. Vendors replace each other all the time, if interested in these stories, many published, and countless others remain behind closed doors. It's much easier to change instrumentation done at runtime versus those hardcoded into your products. This is especially the case when the API and standards are always changing, and many of the changes in OpenTracing have been breaking changes to the APIs.

Enterprise APM vendors must build a lot of instrumentation to support these various technologies, languages, and frameworks. Each of the companies had teams of people just maintaining framework instrumentation. Each tool vendor has dozens of engineers dedicated to writing these, which is a waste of resources, and could be more efficient if there were an actual open standard.

In the past six months, a working group formed thanks to the primary open source developers at Google, Pivotal, and Microsoft who created the OpenCensus project, focused on solving these issues including interoperability. OpenCensus is a single distribution of libraries that automatically collects traces and metrics from your app, displays them locally, and sends them to any analysis tool. OpenCensus is a pluggable implementation/framework that unifies instrumentation concepts that were traditionally separate, consisting of three major components: tracing, tagging, and metrics.

As part of this project, we are defining TraceContext which is a standard header which tool vendors and open source projects can use to propagate information, otherwise known as a wire protocol. We are building this for HTTP currently in the first version. The organizations behind this standard include your major APM vendors (for the most part), all major public cloud vendors, and some private cloud vendors.

Finally, although OpenTracing gets a tremendous buzz, it's not as popular as one might think. When looking at the distributed tracing Github OpenTracing is not heavily mentioned, in fact, the most used OSS projects are tools are versus APIs. Leading OSS projects include OpenZipkin and Jaeger (Uber/RedHat) along with Skywalking (Huawei). All of these projects have a wide range of commiters, especially OpenZipkin which started at Twitter, and is being community driven, while having ties with Pivotal. These organizations are all intimately involved in OpenCensus along with the main committers to these tools. Most of the marketing fails to mention these successful projects even though they have higher adoption, and are already compatible with many frameworks, API gateways, PaaS, and public clouds, unlike OpenTracing which has limited use cases aside from a handful of small APM vendors and RedHat who have a vested interest in its adoption. 

Since this is a wire protocol the public cloud providers, frameworks, proxies such as linkerd and envoy can implement it, and the data would be natively consumed by any tool, creating true interoperability. The value in this instrumentation is the analytics and context, not the instrumentation (for the most part). Getting behind this standard is vital for the APM industry as it will solve user concerns with open data exchange and interoperability. Please join the OpenCensus project and help us address the real problem, not create new future pain.

This blog post is the opinion of Jonah Kowall and is not affiliated with his employer or other parties.

Wednesday, September 20, 2017

Artificial Intelligence in Digital Operations

Artificial Intelligence in Digital Operations

Not a week goes by where I don't see a vendor claiming that they have applied Artificial Intelligence (AI) to running digital businesses (there was a new one this week, but I began writing this beforehand). The list of vendors continues to increase, and when digging into the technology, the marketing is often overstepping the use of the AI term. Let's take a step back and understand what AI is from a computer science perspective.

The traditional problems (or goals) of AI research include reasoning, knowledge, planning, learning, natural language processing, perception as per Wikipedia. The other interesting trend is the application of the AI effect. This term explains that as we apply models and data to problems, and create algorithms to solve these problems, we move the problem outside of the scope of AI. Thus, abusing term AI to describe much of what we do in technology. Most of the technology itself is not complex when it comes to building self-learning models, but yet is described in mathematical terms.   

Users expect AI to do the work for them, meaning providing insights and knowledge they may not possess or may require them to do additional learning and studying to come to these similar results. The net effect is we expect magic from these intelligent machines, hence the term AI. We see this very often in our digital assistants, whether you use the common ones Alexa, Google Assistant, Siri, or Cortana or the more advanced offerings such as or Unfortunately, there are far fewer companies outside of a couple dozen building actual AI. Outside of the powerhouses AI today represents programming to produce outcomes we expect a human to come to, not providing us with additional insight. When implementing these poor AI solutions, they are often represented by a decision tree and focused knowledge along that tree. The downside is that if the decision tree or knowledge has gaps those become issues which the AI can no longer solve. These expert systems show theirs seems quickly. The intention would be for the AI to adapt, but that has yet to occur in the market. Humans, on the other hand, can react to and derive new creative decisions based on reasoning, broader knowledge (such as technical fundamentals), and perception. Computers do not have these capabilities unless they are programmed, which is a challenge today. This use of humans versus machines is one reason that we do not have automated heavy machinery when human lives are at risk (flights, military, cruise ships, and others). While people do make mistakes, they can react to the unknown, and computers cannot typically do this well.

When it comes to applying "AI" to IT Operations (monitoring and automation are my key focus areas), the industry is immature. For the most part, we've seen textual analysis, anomaly detection, topological analysis, regression analysis, and historical analysis. Some of these can be quite advanced by incorporating supervised learning and neural networks. As more of these capabilities are combined, they can almost seem intelligent and can help humans manage today's scale requirements.

The downsides are that today's AI solutions fail to incorporate general knowledge of how systems operate, they cannot do research and provide advice which is not programmed, and they lack "intelligence" aside from helping humans cope with challenges. The unsupervised learning capabilities are limited today and cannot create newly learned outcomes versus those which are programmed. These defined rule books consist of expected outcomes, which the computer then follows, versus intelligence required to create new outcomes. Incorporation of broader sets of knowledge such as StackOverflow, Quora, vendor knowledge bases, or other great sources of knowledge accessible on the internet would allow computers to derive new outcomes and advice. The use of these rich data sources is in its infancy, once this knowledge becomes integrated and automated it would provide a massive benefit to those complex operating systems.

As things evolve and real intelligence starts to take hold this will change, but we are a long way from that point, especially in digital operations today. We have some customers at AppDynamics thinking about the orchestrated enterprise of the future, and how these intelligent agents can help them make better and faster business and operational decisions. We are eagerly working with these customers as a trusted partner, and we look forward to hearing your aspirations, ensuring our alignment on the future state of AppDynamics.

Thursday, August 24, 2017

Digital Business Operations : Fire Your CMDB

As outlined in a prior post, Digital Business operations require new thinking and new technologies. As Operations evolves to meet the needs of Digitalization, so too must the core systems of record for operations. When running a data center with physical assets, or even user assets such as laptops, printers, and desktops the CMDB was a useful construct to understand what you had, things were static, and thus the problem was more easily solved. In reality, almost no one had an accurate CMDB, they most often hovered around 80% coverage, based on the beliefs of staff, and often were driven by a combination of automated and manual processes. The use cases for the CMDB are often tied to ITSM processes such as request, incident, problem, and change management. By having good data capture to record asset and component ownership and configuration it made these processes more robust, reliable, and accurate.

By using discovery tools which crawled technologies or leveraged the network for data discovery was accurate, leading to a well maintained CMDB. In my personal experience, I found the network approach to be great (I was a very early nLayers customer at Thomson Reuters), but it had challenges around packet capture and aggregation. These are challenges today with any packet based collection and are made significantly worse in public cloud environments, which were not an issue last decade.

When virtualization entered the fold, the number of workloads increased and became more dynamic, but it wasn’t a major problem for the existing systems to handle this change, aside from an increase in scale. As applications evolved and configuration moved from being defined within the application server configuration, for example, database connections, connection pools, message queues, Memcache systems, and other components. Those have transitioned to being defined in the runtimes, and discovery tools increasingly have issues collecting configuration information. Most enterprises have dozens of configuration management tools and automation stacks. Ranging from legacy solutions provided by BMC, HP, IBM, CA, and others, but these teams have also added new-ish opens source such as Chef, Puppet, Ansible, Salt, and more. Today these teams are looking at orchestrating infrastructure and creating new layers, and examples include the open source project terraform. The reason for the fragmentation in configuration management is evolving applications and associated infrastructure, which also rely and depend on older applications, running on classical infrastructure (ex: VMWare, Tibco, and Mainframe). Over time the stack becomes complicated and costly. None of the new players in this space seem to support legacy technologies, and the legacy vendor solutions make it cumbersome to support modern architectures. The result is that we have a mess, and no path forward to reduce debt.

In today’s world with a high degree of self-service (public and private cloud including SaaS), containers, and orchestration these discovery tools do not work. The processes which consist of an ITSM strategy, often underpinned by ITIL (more on this later) no longer function in an efficient and highly automated system  Finally, building dependency maps and graphs are no longer possible or feasible via a centralized repository. Adding technology support to a non-functional process would not fix the problem. For example, CMDB discovery tools which add Docker support or attempt to handle Kubernetes or Swarm are missing the point and lack the capability to collecting data from ephemeral systems. The technology is not the only thing which has changed, but also the desired business outcomes. The net result is agility is paramount, implemented by cross functional product engineering teams. That shift requires a culture change within IT, and yesterday's solutions do not support these initiatives.

The business demands rapid innovation via incremental but continuous improvement, which results in a high frequency of change across infrastructure (physical and logical), applications, and environments. Discovering and controlling these systems is a challenge from both a security and audit perspective, but also from a service assurance perspective. Decentralized IT organizations driven by the business need to move quickly, and experiment and the mandate of innovation often contradict centralized IT organizations methodologies. Technology which relies upon access to systems that are often outside of the area of control within IT is an ever growing challenge. These issues require us to shift data collection from an approach of crawling and cataloging data towards instrumentation (scraping web services is fine if your data is lightweight, but typically depth is not captured in these exposed API endpoints). The approach of instrumentation provides a more accurate understanding of dependencies and user experience, and a dynamic way to understand relationships between physical and logical components, and allow us to create new use cases for this data to solve some of the problems the CMDB was designed to address. The next generation of  CMDB will be dynamically modeled. I have seen sprinklings of this in a couple of startups, but they are missing the right type of instrumentation to provide real-time capabilities. Events, logs, and metrics are not enough, ingesting topological models is not sufficient, this must be instrumentation across application and network.

My next post on this subject will focus on ITSM, which today is disjointed with both customer support and development. I am a strong believer in fixing these issues the right way with teamwork, and I’m looking forward to sharing my thoughts on this topic.

Monday, July 17, 2017

CIO Insights: Gartner CIO Survey Shows Canadian growth increasing

Being a Canadian, and to celebrating Canada turning 150 years old, I wanted to call out some interesting trend data I recently came across. Having spent a good amount of time visiting with customers in Canada in 2017, as I do each year I have noticed changes in the country. Quite possibly due to the changes in government and the political climate, or change and progress in the private sector.

In Canada, there are economic challenges today, mostly due to a high reliance on natural resources, and low market prices today. Even with these difficulties there are increased IT budgets in Canada, surpassing the global average growth rates based on the Gartner IT survey data.

When breaking this down further, only 38% of government employees projected budgetary growth versus 64% of non-government respondents in Canada. Comparing these to last year the numbers were 16% and 57% respectively. Canada sees increased investment from a technology perspective based on the survey group. Much of that investment is coming from the private sector versus the public sector
which is encouraging for the Canadian economy.

I still will be making a couple of trips to Canada in 2017 including speaking at the full stack conference in Toronto ( ). I'll be presenting new content including new thoughts around monitoring and instrumentation. I'm looking forward to finalizing my presentation over the next couple months. I wish you a happy birthday Canada and look forward to seeing the results of these changes. See everyone in Toronto.

Friday, June 30, 2017

Digital Business Operations the new mode for IT Operations

Recent conversations I've had with others in the IT Operations space helped formulate an idea I have been working on for a while concerning the future of Ops. The old mode of operating, you can call it ITIL based, or mode 1, or whatever you prefer is not going to be the primary mode of operations in the long term. Senior leadership has acknowledged this and made concerted efforts to change their habits of hiring, spending, and roadmaps for technology and delivery of new functionality.

More advanced teams operating in an agile manner, or mode 2 consist of smaller more integrated teams made up of individuals with different skills. These teams are meeting today's digital business challenges. In most enterprises, these teams are part of a bimodal strategy, but bridging the gap between mode 1 and mode 2 is something few have solved. I'm personally not sure this is possible due to cultural and fundamental differences in beliefs and trust. In many organizations, there is a high degree of variability on the level of investment between mode 1 and mode 2, but most leaders agree the future is moving more of the staff towards mode 2 due to business demand. Listening to customers is key.

There is a fundamental shift occurring between both modes of operation that we need two terms to explain how these teams operate and what they require from people, process, and technology perspective. Bimodal is not quite bifurcated enough compared to what is happening in these enterprises today. I'm coining the term Digital Business Operations and IT Operations as these two teams. There will be an acceleration of new technologies and capabilities which will further separate the way these teams operate, making unification even more complicated than it is today. Thankfully there will be better infrastructure abstraction technologies which will allow each of these teams to operate independently (naturally for root-cause we'll need bridges, which is what our goal is at AppDynamics). Many believe the answer to this will be the adoption of private and hybrid PaaS technologies, but I find these are too complex, rooted in yesterday's problems. A better lighter weight approach will emerge built upon containers and orchestration making infrastructure abstraction simple versus the complexity we see today in private PaaS.

The changes in infrastructure, application architecture and management are still pockets of enterprises, and often experimental in nature. Similarly digital business and agility must allow for experimentation, but at some point, the experiment solidifies into a core business tenant. This solidification is what will occur within Digital Business Operations which will result in new more repeatable (or industrialized) ways to handle processes, toolchains, and workflows which today are implemented inconsistently between organizations.

Digital Business Operations requires a fundamental change in specific areas we've taken for granted in mode 1 IT Operations. These include process frameworks such as ITIL, service management (ITSM, Ticketing, Bug Tracking), automation, and configuration management (especially the concept of a CMDB). Each one of these is a topic I'll hopefully cover in future posts. I'll share some of the challenges, ideas as to how the vendors may or may not solve these issues, and some insight into what practitioners or first mover organizations are doing to address these problems.

Friday, June 16, 2017

Instrumentation is Changing... Finally!

 I've always been a fan of trying to solve complicated problems. As an end user, I applied various technologies and tools to diagnose some strange ones over the years. Applications have become increasingly decoupled and distributed requiring the monitoring and diagnostics to change significantly. Let's look at a short timeline of the changes which have occurred, why these changes were necessary, how they helped solve a technological shift, and which challenges remain.

How Overcome
Component monitoring
Distributed systems became pervasive
High-end solutions (CA, BMC, HP, BMC) became commoditized (Solarwinds)
Event correlation
Too many monitoring tools (still applies) created information overload
Enterprises rely on antiquated tools, many have given up.
Log analytics
Diagnostics too challenging in distributed systems
Splunk unlocked it, but ELK has commoditized it.
Front-end monitoring
Customer centricity is the key for digital business
Hasn’t yet been well adopted, penetration is still under 20%.
Time-series metrics
Microservices created too many instances creating too much data
Has not been solved, but open source is now growing up to handle scale.
Complex and distributed systems are difficult to diagnose
Beginning to evolve.

In each of these phases we've had commercial innovators, and over time they have been replaced by either open source or commoditized commercial solutions which are basic and low cost. Typically as there were frankly bigger problems to solve.

I would say that 2010 to 2015 were the era of log analytics. At this point the technology is pervasive, 95% of companies I speak to have a strategy. Most of them use a combination of tools, but inexpensive or free software, not necessarily a trait which is shared by hardware or cloud storage, are becoming the norm. Most enterprises are using at least two vendors today for log analytics. Typically one of them is Splunk, and very often the other is ELK. When this market was less mature ELK was more fragmented, but not the solidified ElasticStack platform which Elastic codified has made it a viable alternative.

The era of end-user monitoring peaked in 2015. Everyone was interested, but implementations remain rather small for the use cases of performance monitoring and operational needs. End user experience is a critical area for today's digital businesses, but often requires a level of maturity which is a challenge for most buyers. I do expect this to continue growing, but there will always be a gap between end user monitoring and backend monitoring. Most vendors who have tried to build this as a standalone product have failed to gain traction. A recent example of this is SOASTA, who had good technology, and decent growth, but failed to build a self-sustaining business. Ultimately SOASTA was sold to Akamai.

2016 was the era of time series metrics, where that market peaked, and we saw a massive amount of new technology companies flourish. Examples of this were increased traction by vendors like DataDog, or newer cases including Wavefront, SingalFx, and others. We've just seen the start of M&A in this area Solarwinds buying Librato in 2015, and more recently VMWare buying Wavefront in 2017. We also have the commercial monitoring entrant Circonus who have begun OEMing their time-series database under the IronDB moniker. The next phase of this market will be the maturity of time-series databases in the open source world. Examples may include  Prometheus, but I'm keeping an eye on InfluxDB. As these options mature for monitoring, IoT, and other time-series use cases especially solving easier data storage at scale. The other missing component is front-ends like Grafana improving significantly with workflows and easier usage. Ultimately we may see an ELK like stack emerge in the time-series world, but we'll have to wait and see.

Due to the increase in microservices and the resulting Docker containers, orchestration layers are beginning to take hold in many organizations. As a result of these shifts in application architectures, 2017 seems to have more of a focus on tracing. The importance of tracing is great to see take hold, as root-cause isolation in distributed systems requires good tracing (at least until ML matures in a meaningful way). Tracing also allows you to understand impact when you have service outages or degradations. Additionally, tracing can be used to tie together IT and business metrics and data. The trace is essentially the "glue" of monitoring so to speak. By tagging and tracing you are essentially creating a forensic trail, although this has yet to apply within security, it will! Gartner even began talking about the application to security not long ago in the research note titled Application Performance Monitoring and Application Security Monitoring Are Converging G00324828, Cameron Haight, Neil MacDonald. The detailed tracing is what companies like AppDynamics and Dynatrace do, and has been the core of their technologies since they were founded based on these concepts. These tools solve complex problems faster and perform much more than just tracing, but the trace is the glue in the technology. Unfortunately for buyers, these monitoring and diagnostic technologies come with a typically high price tag, but they are not optional for digital businesses.

Today's open source tracing projects require developers to do work, this is different than how commercial APM tools work which auto instrument and support common application frameworks. The open-source tooling is getting better with standards like OpenTracing, and front-ends such as Zipkin continually evolving. The problem is these technologies still lack the automation for capture that you see in commercial tools. How often do I trace? What do I trace? When do I trace? If you expect developers to make these decisions all the time I think there will be issues. Developers don't understand the macro performance of how their component will fit into the bigger picture. Also once a service is written and reused for other applications, it's hard to understand the performance implications. I am interested and currently experimenting with proxy level tracing to help extend tagging and tracing in areas where you may not want or need heavier agents. I hope to share more on this on a future blog.

The long-term goal, however, is to combine all of these siloed data sets and technologies by using more advanced correlation capabilities, in addition to applying new algorithms and machine learning to the problem, which is currently in its infancy. Over at AppDynamics, we do this based on a trace itself, but we are evolving new capabilities, and doing so in a unified manner, going back to a trace. Monitoring of digital businesses is going to continue to be an exciting space for quite a long time, requiring constant evolution to keep pace with evolving software and infrastructure architectures.

I’ll be giving an updated talk on monitoring and instrumentation which will cover a lot of what is in this article. We will also go into more depth around instrumentation and tracing. I will premiere this new talk this October at the Full Stack Conference October 23rd and 24th 2017 in Toronto for tickets. I am Looking forward to contributing to this great event.  

Monday, October 10, 2016

That's too Expensive, the Pricing Battle

Pricing is a tricky beast. I've seen a lot of models out there, and each has pros and cons. Firstly it depends on who are your target customers. My experience is in Enterprise software, which typically has a larger transaction price and volume, yet the number of deals is smaller. When focusing on SMBs as a target, the models change along with the selling motion. For good reasons most software companies want to have both models, but I haven't seen many companies able to execute this strategy. You end up with vendors adopting one model and causing fractures in the way tooling is licensed, many times the economics don't equate to good business decisions for the end user or the vendor. Here are the various licensing models I've seen in the IT Operations Management space:
  • Application footprint or infrastructure footprint based pricing
    • Per node, per CPU, per application server, per JVM, per CLR, per runtime
  • User-based pricing
    • Per concurrent user, per named user, per monthly active user, per page view
This model can track the users being monitored (in the case of end-user experience), or the users using a tool. For example, in Service Desk it could be the number of help desk agents.
  • Storage based pricing
    • Per gigabyte consumed, per event consumed
The measurements become more challenging in highly dynamic or microservices environments which cause additional issues regarding usage of specific technologies. Most apps consist of both legacy and modern technologies; hence the value is different from a solution to manage them.
Then there are the terms of the license which can be anywhere from monthly billing to 5-year commitments. These commitments can be for a minimum and/or a burst model. I've had some great discussions with analysts around value-based pricing. Although this is a very loosely defined term, building a pricing model based on the value someone gets from the software sounds perfect, in theory. How many problems are you detecting? Solving? Although this makes sense, calculating the "value" is a guess in most cases. With APM you can determine the amount of time/money/revenue saved, but it's still a challenge to build and measure clear ROI. Measuring ROI becomes even harder with other less customer-centric technologies.

In my career I’ve now seen pricing from three distinct angles, I’m going to summarize what I’ve found in each of my roles. These are personal experiences, and your mileage may vary.

End User

As an end user, I bought tens of millions of dollars’ worth of software over my 17 years as a practitioner. I always tried various tricks to ensure I was getting the best deal possible for my employer. Licensing and pricing is a challenge. How do I pay as little as possible for the best solution for my needs? When can I afford to buy an inferior product to meet my budgetary needs? When should I request more budget to select a technology that will differentiate us as a business? Regardless, the net is that everything is overpriced, and I could never get the pricing low enough to remain satisfied. It doesn't matter if I were paying based on application, infrastructure, consumption, or on-demand pricing. Although my technology providers were my partners, I also felt I needed to extract the most value for the least amount of capital from them to serve our shareholders.


As I transitioned over to an analyst role, I learned yet more tricks around pricing and deal negotiation. Most technologies go through a cycle of immaturity to mainstream, and finally into obsolesce. Gartner uses two models to describe this; one is the Hype Cycle which shows the technology trigger through it being a productive technology. Additionally when coupled with a Market Clock the lifecycle is visible. Market Clocks address the lifecycle of a market and how technology becomes standardized, commoditized, and eventually replaced. These constructs are useful to both end users and vendors to understand how technologies mature, and the related pricing and competition to be expected within a particular technology market. I often gave advice to vendors due to the number of licensing and pricing models I had seen, and what end users were asking. Clearly, there was always complaints by end users about any pricing model aside from open source. Everyone loves the idea of free software, yet there are many hidden costs to take into account. Which technology providers can deliver results is often more important than the licensing model.


I was fortunate enough to run the AppDynamics pricing committee for about six months, and I learned a lot about how to license and price. During this experience was the first time I had studied margins. AppDynamics software is available both on-premises and via SaaS delivery which makes the model especially interesting. Each delivery model has different margins and costs to consider. These have to be taken into account when we determine a pricing model and our discounting. I also learned, first hand, the struggles customers had with our pricing model. The net result is that regardless of how you price a solution the users complain about the pricing. There is no way to solve this problem that I have found, and it was rather depressing. The opinions are strategies across sales, product management, and marketing are all different, and each group has a differing perspective which is very challenging to rationalize. I am not sad to have moved along from that responsibility :)

My Take

Every model is flawed; pricing models are inflexible and software costs too much. If you don't bundle products, then quoting and licensing become complex, yet if you bundle then, you end up with shelfware. End users want to pay for what they use, and yet they don't want commitments, they want "on-demand" pricing. Without commitments, most vendors have issues predicting revenue or demand. This is often an issue if you are using traditional hosting for the product, which most SaaS companies do to some extent for cost reasons. Software, unlike hardware, doesn't have the same type of fixed cost to deliver. The margins are different. End users and salespeople also want the model to be simple to understand, calculate, and rationalize.

I may share more secrets later, maybe around how to negotiate licenses and different ways to get leverage. Leave your comments here or via twitter @jkowall on what interests you on this topic.