Skip to main content

Instrumentation is Changing... Finally!

 I've always been a fan of trying to solve complicated problems. As an end user, I applied various technologies and tools to diagnose some strange ones over the years. Applications have become increasingly decoupled and distributed requiring the monitoring and diagnostics to change significantly. Let's look at a short timeline of the changes which have occurred, why these changes were necessary, how they helped solve a technological shift, and which challenges remain.

Phase
Why
How Overcome
Component monitoring
Distributed systems became pervasive
High-end solutions (CA, BMC, HP, BMC) became commoditized (Solarwinds)
Event correlation
Too many monitoring tools (still applies) created information overload
Enterprises rely on antiquated tools, many have given up.
Log analytics
Diagnostics too challenging in distributed systems
Splunk unlocked it, but ELK has commoditized it.
Front-end monitoring
Customer centricity is the key for digital business
Hasn’t yet been well adopted, penetration is still under 20%.
Time-series metrics
Microservices created too many instances creating too much data
Has not been solved, but open source is now growing up to handle scale.
Tracing
Complex and distributed systems are difficult to diagnose
Beginning to evolve.

In each of these phases we've had commercial innovators, and over time they have been replaced by either open source or commoditized commercial solutions which are basic and low cost. Typically as there were frankly bigger problems to solve.

I would say that 2010 to 2015 were the era of log analytics. At this point the technology is pervasive, 95% of companies I speak to have a strategy. Most of them use a combination of tools, but inexpensive or free software, not necessarily a trait which is shared by hardware or cloud storage, are becoming the norm. Most enterprises are using at least two vendors today for log analytics. Typically one of them is Splunk, and very often the other is ELK. When this market was less mature ELK was more fragmented, but not the solidified ElasticStack platform which Elastic codified has made it a viable alternative.

The era of end-user monitoring peaked in 2015. Everyone was interested, but implementations remain rather small for the use cases of performance monitoring and operational needs. End user experience is a critical area for today's digital businesses, but often requires a level of maturity which is a challenge for most buyers. I do expect this to continue growing, but there will always be a gap between end user monitoring and backend monitoring. Most vendors who have tried to build this as a standalone product have failed to gain traction. A recent example of this is SOASTA, who had good technology, and decent growth, but failed to build a self-sustaining business. Ultimately SOASTA was sold to Akamai.

2016 was the era of time series metrics, where that market peaked, and we saw a massive amount of new technology companies flourish. Examples of this were increased traction by vendors like DataDog, or newer cases including Wavefront, SingalFx, and others. We've just seen the start of M&A in this area Solarwinds buying Librato in 2015, and more recently VMWare buying Wavefront in 2017. We also have the commercial monitoring entrant Circonus who have begun OEMing their time-series database under the IronDB moniker. The next phase of this market will be the maturity of time-series databases in the open source world. Examples may include  Prometheus, but I'm keeping an eye on InfluxDB. As these options mature for monitoring, IoT, and other time-series use cases especially solving easier data storage at scale. The other missing component is front-ends like Grafana improving significantly with workflows and easier usage. Ultimately we may see an ELK like stack emerge in the time-series world, but we'll have to wait and see.

Due to the increase in microservices and the resulting Docker containers, orchestration layers are beginning to take hold in many organizations. As a result of these shifts in application architectures, 2017 seems to have more of a focus on tracing. The importance of tracing is great to see take hold, as root-cause isolation in distributed systems requires good tracing (at least until ML matures in a meaningful way). Tracing also allows you to understand impact when you have service outages or degradations. Additionally, tracing can be used to tie together IT and business metrics and data. The trace is essentially the "glue" of monitoring so to speak. By tagging and tracing you are essentially creating a forensic trail, although this has yet to apply within security, it will! Gartner even began talking about the application to security not long ago in the research note titled Application Performance Monitoring and Application Security Monitoring Are Converging G00324828, Cameron Haight, Neil MacDonald. The detailed tracing is what companies like AppDynamics and Dynatrace do, and has been the core of their technologies since they were founded based on these concepts. These tools solve complex problems faster and perform much more than just tracing, but the trace is the glue in the technology. Unfortunately for buyers, these monitoring and diagnostic technologies come with a typically high price tag, but they are not optional for digital businesses.

Today's open source tracing projects require developers to do work, this is different than how commercial APM tools work which auto instrument and support common application frameworks. The open-source tooling is getting better with standards like OpenTracing, and front-ends such as Zipkin continually evolving. The problem is these technologies still lack the automation for capture that you see in commercial tools. How often do I trace? What do I trace? When do I trace? If you expect developers to make these decisions all the time I think there will be issues. Developers don't understand the macro performance of how their component will fit into the bigger picture. Also once a service is written and reused for other applications, it's hard to understand the performance implications. I am interested and currently experimenting with proxy level tracing to help extend tagging and tracing in areas where you may not want or need heavier agents. I hope to share more on this on a future blog.

The long-term goal, however, is to combine all of these siloed data sets and technologies by using more advanced correlation capabilities, in addition to applying new algorithms and machine learning to the problem, which is currently in its infancy. Over at AppDynamics, we do this based on a trace itself, but we are evolving new capabilities, and doing so in a unified manner, going back to a trace. Monitoring of digital businesses is going to continue to be an exciting space for quite a long time, requiring constant evolution to keep pace with evolving software and infrastructure architectures.

I’ll be giving an updated talk on monitoring and instrumentation which will cover a lot of what is in this article. We will also go into more depth around instrumentation and tracing. I will premiere this new talk this October at the Full Stack Conference October 23rd and 24th 2017 in Toronto http://2017.fsto.co for tickets. I am Looking forward to contributing to this great event.  

Comments

Popular posts from this blog

Dynatrace Growth Misinformation

For my valued readers: I wanted to point out some issues I’ve recently seen in the public domain. As a Gartner analyst, I heard many claims about 200% growth, and all kind of data points which have little basis in fact. When those vendors are asked what actual numbers they are basing those growth claims on, often the questions are dodged. Dynatrace, recently used the Gartner name and brand in a press release. In Its First Year as an Independent Company, Gartner Ranks Dynatrace #1 in APM Market http://www.prweb.com/releases/2015/06/prweb12773790.htm I want to clarify the issues in their statements based on the actual Gartner facts published by Gartner in its Market Share data: Dynatrace says in their press release: “expand globally with more than three times the revenue of other new generation APM vendors” First, let’s look at how new the various technologies are: Dynatrace Data Center RUM (DCRUM) is based on the Adlex technology acquired in 2005, but was cr

Vsphere server issues and upgrade progress

So I found out that using the host update tool versus Vcenter update manager is much easier and more reliable when moving from ESXi 3.5 to 4.0. Before I was using the update manager and it wasn't working all that reliably. So far I haven't had any issues using the host update tool. I've done many upgrades now, and I only have 4 left, 3 of which I am doing this weekend. Whenever I speak to vmware they always think I'm using ESX, when I prefer and expect that people should move to the more appliance model of ESXi. With 4.0 they are pretty much on par, and I'm going to stick with ESXi. On one of my vsphere 4.0 servers (virtualcenter) its doing this annoying thing when I try to use the performance overview:   Perf Charts service experienced and internal error.   Message: Report application initialization is not completed successfully. Retry in 60 seconds.   In my stats.log I see this.   [28 Aug 09, 22:28:07] [ERROR] com.vmware.vim.stats.webui.startup.Stat

Misunderstanding "Open Tracing" for the Enterprise

When first hearing of the OpenTracing project in 2016 there was excitement, finally an open standard for tracing. First, what is a trace? A trace is following a transaction from different services to build an end to end picture. The latency of each transaction segment is captured to determine which is slow, or causing performance issues. The trace may also include metadata such as metrics and logs, more on that later. Great, so if this is open this will solve all interoperability issues we have, and allow me to use multiple APM and tracing tools at once? It will help avoid vendor or project lock-in, unlock cloud services which are opaque or invisible? Nope! Why not? Today there are so many different implementations of tracing providing end to end transaction monitoring, and the reason why is that each project or vendor has different capabilities and use cases for the traces. Most tool users don't need to know the implementation details, but when manually instrumenting wi