Skip to main content

The End of my Affair with Apdex

A decade ago, when I first learned of Apdex, it was thanks to a wonderful technology partner, Coradiant. At the time, I was running IT operations and web operations, and brought Coradiant into the fold. Coradiant was ahead of its time, providing end-user experience monitoring capabilities via packet analysis. The network-based approach was effective in a day when the web was less rich. Coradiant was one of the first companies to embed Apdex in its products.

As a user of APM tools, I was looking for the ultimate KPI, and the concept of Apdex resonated with me and my senior management. A single magical number gave us an idea of how well development, QA, and operations were doing in terms of user experience and performance. Had I found the metric to rule all metrics? I thought I had, and I was a fan of Apdex for many years leading up to 2012, when I started to dig into the true calculations behind this magical number.

As my colleague Jim Hirschauer pointed out in a 2013 blog post, the Apdex index is calculated by putting the number of satisfied versus tolerating requests into a formula. The definition of a user being "satisfied" or "tolerating" has to do with a lot more than just performance, but the applied use cases for Apdex are unfortunately focused on performance only. Performance is still a critical criterion, but the definition of satisfied or tolerating is situational.

I'm currently writing this from 28,000 feet above northern Florida, over barely usable in-flight internet, which makes me wish I had a 56k modem. I am tolerating the latency and bandwidth, but not the $32 I paid for this horrible experience , but hey, at least Twitter and email work. I self-classify as an "un-tolerating" user, but I am happy with some connectivity. People who know me will tell you I have a bandwidth and network problem. Hence, my level of a tolerable network connection is abnormal. My Apdex score would be far different than the average user due to my personal perspective, as would the business user versus the consumer, based on their specific situation as they use an application. Other criteria that affect satisfaction include the type of device in use and connection type of that device.
The thing that is missing from Apdex is the notion of a service level. There are two ways to manage service level agreements. First, a service level may be calculated, as we do at AppDynamics with our baselines. Secondarily, it may be a static threshold, which the customer expects; we support this use case in our analytics product. These two ways of calculating an SLA cover the right ways to measure and score performance.

This is AppDynamics’ Transaction Analytics Breakdown for users who had errors or poor user experience over the last week, and their SLA class:


Simplistic SLAs are in the core APM product. Here is a view showing requests that were below the calculated baseline, showing which were in SLA violation.

The notion of combining an SLA with Apdex will result in a meaningful number being generated. Unfortunately, I cannot take credit for this idea. Alain Cohen, one of the brightest minds in performance analysis, was the co-founder and CTO (almost co-CEO) of OPNET. Alain discussed his ideas with me around this new performance index concept called OpDex, which fixes many of the ApDex flaws by applying an SLA. Unfortunately, Alain is no longer solving performance problems for customers; he's decided to take his skills and talents elsewhere after a nice payout.

Alain shared his OpDex plan with me in 2011; thankfully all of the details are outlined in this patent, which was granted in 2013. But OPNET's great run of innovation has ended, and Riverbed has failed to pick up where they left off, but at least they have patents to show for these good ideas and concepts.

The other issue with Apdex is that users are being ignored by the formula. CoScale outlined this issues in a detailed blog post.They explain that histograms are far better ways to analyze a variant population. This is no different than looking at performance metrics coming from the infrastructure layer, but the use of histograms and heat charts tend to provide much better visual analysis.

AppDynamics employs automated baselines for every metric collected, and measures based on deviations out of the box. We also support static SLA thresholds as needed. Visually, AppDynamics has a lot of options including viewing data in histograms, looking at percentiles, and providing an advanced analytics platform for whatever use cases our users come up with. We believe these are valid approaches to the downsides of using Apdex extensively in a product, which has it’s set of downsides.


Comments

Popular posts from this blog

Dynatrace Growth Misinformation

For my valued readers: I wanted to point out some issues I’ve recently seen in the public domain. As a Gartner analyst, I heard many claims about 200% growth, and all kind of data points which have little basis in fact. When those vendors are asked what actual numbers they are basing those growth claims on, often the questions are dodged. Dynatrace, recently used the Gartner name and brand in a press release. In Its First Year as an Independent Company, Gartner Ranks Dynatrace #1 in APM Market http://www.prweb.com/releases/2015/06/prweb12773790.htm I want to clarify the issues in their statements based on the actual Gartner facts published by Gartner in its Market Share data: Dynatrace says in their press release: “expand globally with more than three times the revenue of other new generation APM vendors” First, let’s look at how new the various technologies are: Dynatrace Data Center RUM (DCRUM) is based on the Adlex technology acquired in 2005, but was cr

Misunderstanding "Open Tracing" for the Enterprise

When first hearing of the OpenTracing project in 2016 there was excitement, finally an open standard for tracing. First, what is a trace? A trace is following a transaction from different services to build an end to end picture. The latency of each transaction segment is captured to determine which is slow, or causing performance issues. The trace may also include metadata such as metrics and logs, more on that later. Great, so if this is open this will solve all interoperability issues we have, and allow me to use multiple APM and tracing tools at once? It will help avoid vendor or project lock-in, unlock cloud services which are opaque or invisible? Nope! Why not? Today there are so many different implementations of tracing providing end to end transaction monitoring, and the reason why is that each project or vendor has different capabilities and use cases for the traces. Most tool users don't need to know the implementation details, but when manually instrumenting wi

Vsphere server issues and upgrade progress

So I found out that using the host update tool versus Vcenter update manager is much easier and more reliable when moving from ESXi 3.5 to 4.0. Before I was using the update manager and it wasn't working all that reliably. So far I haven't had any issues using the host update tool. I've done many upgrades now, and I only have 4 left, 3 of which I am doing this weekend. Whenever I speak to vmware they always think I'm using ESX, when I prefer and expect that people should move to the more appliance model of ESXi. With 4.0 they are pretty much on par, and I'm going to stick with ESXi. On one of my vsphere 4.0 servers (virtualcenter) its doing this annoying thing when I try to use the performance overview:   Perf Charts service experienced and internal error.   Message: Report application initialization is not completed successfully. Retry in 60 seconds.   In my stats.log I see this.   [28 Aug 09, 22:28:07] [ERROR] com.vmware.vim.stats.webui.startup.Stat