Friday, June 16, 2017

Instrumentation is Changing... Finally!

 I've always been a fan of trying to solve complicated problems. As an end user, I applied various technologies and tools to diagnose some strange ones over the years. Applications have become increasingly decoupled and distributed requiring the monitoring and diagnostics to change significantly. Let's look at a short timeline of the changes which have occurred, why these changes were necessary, how they helped solve a technological shift, and which challenges remain.

How Overcome
Component monitoring
Distributed systems became pervasive
High-end solutions (CA, BMC, HP, BMC) became commoditized (Solarwinds)
Event correlation
Too many monitoring tools (still applies) created information overload
Enterprises rely on antiquated tools, many have given up.
Log analytics
Diagnostics too challenging in distributed systems
Splunk unlocked it, but ELK has commoditized it.
Front-end monitoring
Customer centricity is the key for digital business
Hasn’t yet been well adopted, penetration is still under 20%.
Time-series metrics
Microservices created too many instances creating too much data
Has not been solved, but open source is now growing up to handle scale.
Complex and distributed systems are difficult to diagnose
Beginning to evolve.

In each of these phases we've had commercial innovators, and over time they have been replaced by either open source or commoditized commercial solutions which are basic and low cost. Typically as there were frankly bigger problems to solve.

I would say that 2010 to 2015 were the era of log analytics. At this point the technology is pervasive, 95% of companies I speak to have a strategy. Most of them use a combination of tools, but inexpensive or free software, not necessarily a trait which is shared by hardware or cloud storage, are becoming the norm. Most enterprises are using at least two vendors today for log analytics. Typically one of them is Splunk, and very often the other is ELK. When this market was less mature ELK was more fragmented, but not the solidified ElasticStack platform which Elastic codified has made it a viable alternative.

The era of end-user monitoring peaked in 2015. Everyone was interested, but implementations remain rather small for the use cases of performance monitoring and operational needs. End user experience is a critical area for today's digital businesses, but often requires a level of maturity which is a challenge for most buyers. I do expect this to continue growing, but there will always be a gap between end user monitoring and backend monitoring. Most vendors who have tried to build this as a standalone product have failed to gain traction. A recent example of this is SOASTA, who had good technology, and decent growth, but failed to build a self-sustaining business. Ultimately SOASTA was sold to Akamai.

2016 was the era of time series metrics, where that market peaked, and we saw a massive amount of new technology companies flourish. Examples of this were increased traction by vendors like DataDog, or newer cases including Wavefront, SingalFx, and others. We've just seen the start of M&A in this area Solarwinds buying Librato in 2015, and more recently VMWare buying Wavefront in 2017. We also have the commercial monitoring entrant Circonus who have begun OEMing their time-series database under the IronDB moniker. The next phase of this market will be the maturity of time-series databases in the open source world. Examples may include  Prometheus, but I'm keeping an eye on InfluxDB. As these options mature for monitoring, IoT, and other time-series use cases especially solving easier data storage at scale. The other missing component is front-ends like Grafana improving significantly with workflows and easier usage. Ultimately we may see an ELK like stack emerge in the time-series world, but we'll have to wait and see.

Due to the increase in microservices and the resulting Docker containers, orchestration layers are beginning to take hold in many organizations. As a result of these shifts in application architectures, 2017 seems to have more of a focus on tracing. The importance of tracing is great to see take hold, as root-cause isolation in distributed systems requires good tracing (at least until ML matures in a meaningful way). Tracing also allows you to understand impact when you have service outages or degradations. Additionally, tracing can be used to tie together IT and business metrics and data. The trace is essentially the "glue" of monitoring so to speak. By tagging and tracing you are essentially creating a forensic trail, although this has yet to apply within security, it will! Gartner even began talking about the application to security not long ago in the research note titled Application Performance Monitoring and Application Security Monitoring Are Converging G00324828, Cameron Haight, Neil MacDonald. The detailed tracing is what companies like AppDynamics and Dynatrace do, and has been the core of their technologies since they were founded based on these concepts. These tools solve complex problems faster and perform much more than just tracing, but the trace is the glue in the technology. Unfortunately for buyers, these monitoring and diagnostic technologies come with a typically high price tag, but they are not optional for digital businesses.

Today's open source tracing projects require developers to do work, this is different than how commercial APM tools work which auto instrument and support common application frameworks. The open-source tooling is getting better with standards like OpenTracing, and front-ends such as Zipkin continually evolving. The problem is these technologies still lack the automation for capture that you see in commercial tools. How often do I trace? What do I trace? When do I trace? If you expect developers to make these decisions all the time I think there will be issues. Developers don't understand the macro performance of how their component will fit into the bigger picture. Also once a service is written and reused for other applications, it's hard to understand the performance implications. I am interested and currently experimenting with proxy level tracing to help extend tagging and tracing in areas where you may not want or need heavier agents. I hope to share more on this on a future blog.

The long-term goal, however, is to combine all of these siloed data sets and technologies by using more advanced correlation capabilities, in addition to applying new algorithms and machine learning to the problem, which is currently in its infancy. Over at AppDynamics, we do this based on a trace itself, but we are evolving new capabilities, and doing so in a unified manner, going back to a trace. Monitoring of digital businesses is going to continue to be an exciting space for quite a long time, requiring constant evolution to keep pace with evolving software and infrastructure architectures.

I’ll be giving an updated talk on monitoring and instrumentation which will cover a lot of what is in this article. We will also go into more depth around instrumentation and tracing. I will premiere this new talk this October at the Full Stack Conference October 23rd and 24th 2017 in Toronto for tickets. I am Looking forward to contributing to this great event.  

No comments: