Skip to main content

Instrumentation is Changing... Finally!

 I've always been a fan of trying to solve complicated problems. As an end user, I applied various technologies and tools to diagnose some strange ones over the years. Applications have become increasingly decoupled and distributed requiring the monitoring and diagnostics to change significantly. Let's look at a short timeline of the changes which have occurred, why these changes were necessary, how they helped solve a technological shift, and which challenges remain.

How Overcome
Component monitoring
Distributed systems became pervasive
High-end solutions (CA, BMC, HP, BMC) became commoditized (Solarwinds)
Event correlation
Too many monitoring tools (still applies) created information overload
Enterprises rely on antiquated tools, many have given up.
Log analytics
Diagnostics too challenging in distributed systems
Splunk unlocked it, but ELK has commoditized it.
Front-end monitoring
Customer centricity is the key for digital business
Hasn’t yet been well adopted, penetration is still under 20%.
Time-series metrics
Microservices created too many instances creating too much data
Has not been solved, but open source is now growing up to handle scale.
Complex and distributed systems are difficult to diagnose
Beginning to evolve.

In each of these phases we've had commercial innovators, and over time they have been replaced by either open source or commoditized commercial solutions which are basic and low cost. Typically as there were frankly bigger problems to solve.

I would say that 2010 to 2015 were the era of log analytics. At this point the technology is pervasive, 95% of companies I speak to have a strategy. Most of them use a combination of tools, but inexpensive or free software, not necessarily a trait which is shared by hardware or cloud storage, are becoming the norm. Most enterprises are using at least two vendors today for log analytics. Typically one of them is Splunk, and very often the other is ELK. When this market was less mature ELK was more fragmented, but not the solidified ElasticStack platform which Elastic codified has made it a viable alternative.

The era of end-user monitoring peaked in 2015. Everyone was interested, but implementations remain rather small for the use cases of performance monitoring and operational needs. End user experience is a critical area for today's digital businesses, but often requires a level of maturity which is a challenge for most buyers. I do expect this to continue growing, but there will always be a gap between end user monitoring and backend monitoring. Most vendors who have tried to build this as a standalone product have failed to gain traction. A recent example of this is SOASTA, who had good technology, and decent growth, but failed to build a self-sustaining business. Ultimately SOASTA was sold to Akamai.

2016 was the era of time series metrics, where that market peaked, and we saw a massive amount of new technology companies flourish. Examples of this were increased traction by vendors like DataDog, or newer cases including Wavefront, SingalFx, and others. We've just seen the start of M&A in this area Solarwinds buying Librato in 2015, and more recently VMWare buying Wavefront in 2017. We also have the commercial monitoring entrant Circonus who have begun OEMing their time-series database under the IronDB moniker. The next phase of this market will be the maturity of time-series databases in the open source world. Examples may include  Prometheus, but I'm keeping an eye on InfluxDB. As these options mature for monitoring, IoT, and other time-series use cases especially solving easier data storage at scale. The other missing component is front-ends like Grafana improving significantly with workflows and easier usage. Ultimately we may see an ELK like stack emerge in the time-series world, but we'll have to wait and see.

Due to the increase in microservices and the resulting Docker containers, orchestration layers are beginning to take hold in many organizations. As a result of these shifts in application architectures, 2017 seems to have more of a focus on tracing. The importance of tracing is great to see take hold, as root-cause isolation in distributed systems requires good tracing (at least until ML matures in a meaningful way). Tracing also allows you to understand impact when you have service outages or degradations. Additionally, tracing can be used to tie together IT and business metrics and data. The trace is essentially the "glue" of monitoring so to speak. By tagging and tracing you are essentially creating a forensic trail, although this has yet to apply within security, it will! Gartner even began talking about the application to security not long ago in the research note titled Application Performance Monitoring and Application Security Monitoring Are Converging G00324828, Cameron Haight, Neil MacDonald. The detailed tracing is what companies like AppDynamics and Dynatrace do, and has been the core of their technologies since they were founded based on these concepts. These tools solve complex problems faster and perform much more than just tracing, but the trace is the glue in the technology. Unfortunately for buyers, these monitoring and diagnostic technologies come with a typically high price tag, but they are not optional for digital businesses.

Today's open source tracing projects require developers to do work, this is different than how commercial APM tools work which auto instrument and support common application frameworks. The open-source tooling is getting better with standards like OpenTracing, and front-ends such as Zipkin continually evolving. The problem is these technologies still lack the automation for capture that you see in commercial tools. How often do I trace? What do I trace? When do I trace? If you expect developers to make these decisions all the time I think there will be issues. Developers don't understand the macro performance of how their component will fit into the bigger picture. Also once a service is written and reused for other applications, it's hard to understand the performance implications. I am interested and currently experimenting with proxy level tracing to help extend tagging and tracing in areas where you may not want or need heavier agents. I hope to share more on this on a future blog.

The long-term goal, however, is to combine all of these siloed data sets and technologies by using more advanced correlation capabilities, in addition to applying new algorithms and machine learning to the problem, which is currently in its infancy. Over at AppDynamics, we do this based on a trace itself, but we are evolving new capabilities, and doing so in a unified manner, going back to a trace. Monitoring of digital businesses is going to continue to be an exciting space for quite a long time, requiring constant evolution to keep pace with evolving software and infrastructure architectures.

I’ll be giving an updated talk on monitoring and instrumentation which will cover a lot of what is in this article. We will also go into more depth around instrumentation and tracing. I will premiere this new talk this October at the Full Stack Conference October 23rd and 24th 2017 in Toronto for tickets. I am Looking forward to contributing to this great event.  


Popular posts from this blog

Misunderstanding "Open Tracing" for the Enterprise

When first hearing of the OpenTracing project in 2016 there was excitement, finally an open standard for tracing. First, what is a trace? A trace is following a transaction from different services to build an end to end picture. The latency of each transaction segment is captured to determine which is slow, or causing performance issues. The trace may also include metadata such as metrics and logs, more on that later. Great, so if this is open this will solve all interoperability issues we have, and allow me to use multiple APM and tracing tools at once? It will help avoid vendor or project lock-in, unlock cloud services which are opaque or invisible? Nope! Why not? Today there are so many different implementations of tracing providing end to end transaction monitoring, and the reason why is that each project or vendor has different capabilities and use cases for the traces. Most tool users don't need to know the implementation details, but when manually instrumenting wi

NPM is Broken

As someone who bought and implemented NPM solutions, covered them as an analyst, and now watches the industry, one cannot help but notice that NPM(D) is broken. According to Gartner themselves, the data center is rapidly changing, the data center is going away, m aybe not as quickly as Capp states, but it’s happening. This is apparent by the massive public cloud growth posted by Amazon, Microsoft, and Google in their infrastructure businesses. This means that traditional appliance-based NPMD offerings will not work, nor will traditional ways of collecting packet data. Many of the flow offerings do not handle the new types of flows which these services generate, but most importantly they do not understand the internet, which is the most important part of assuring services in cloud hosted environments. The network itself is not just moving to overlay a-la NSX and ACI, it's moving inside of orchestrated containers, and new proxy/load balancing systems typically built off component

F5 Persistence and my 6 week battle with support

We've been having issues with persistence on our F5's since we launched our new product. We have tried many different ways of trying to get our clients to stick on a server. Of course the first step was using a standard cookie persistence which the F5 was injecting. All of our products which use SSL is being terminated on the F5, which makes cookie work fine even for SSL traffic. After we started seeing clients going to many servers, we figured it would be safe to use a JSESSIONID cookie which is a standard Java application server cookie that is always unique per session. We implemented the following Irule (slightly modified in order to get more logging): (registration is free) when HTTP_REQUEST { # Check if there is a JSESSIONID cookie if {[HTTP::cookie "JSESSIONID"] ne ""}{ # Persist off of the cookie value with a timeout of 2 hours (7200 seconds) p