Monday, May 4, 2015

The End of my Affair with Apdex

A decade ago, when I first learned of Apdex, it was thanks to a wonderful technology partner, Coradiant. At the time, I was running IT operations and web operations, and brought Coradiant into the fold. Coradiant was ahead of its time, providing end-user experience monitoring capabilities via packet analysis. The network-based approach was effective in a day when the web was less rich. Coradiant was one of the first companies to embed Apdex in its products.

As a user of APM tools, I was looking for the ultimate KPI, and the concept of Apdex resonated with me and my senior management. A single magical number gave us an idea of how well development, QA, and operations were doing in terms of user experience and performance. Had I found the metric to rule all metrics? I thought I had, and I was a fan of Apdex for many years leading up to 2012, when I started to dig into the true calculations behind this magical number.

As my colleague Jim Hirschauer pointed out in a 2013 blog post, the Apdex index is calculated by putting the number of satisfied versus tolerating requests into a formula. The definition of a user being "satisfied" or "tolerating" has to do with a lot more than just performance, but the applied use cases for Apdex are unfortunately focused on performance only. Performance is still a critical criterion, but the definition of satisfied or tolerating is situational.

I'm currently writing this from 28,000 feet above northern Florida, over barely usable in-flight internet, which makes me wish I had a 56k modem. I am tolerating the latency and bandwidth, but not the $32 I paid for this horrible experience , but hey, at least Twitter and email work. I self-classify as an "un-tolerating" user, but I am happy with some connectivity. People who know me will tell you I have a bandwidth and network problem. Hence, my level of a tolerable network connection is abnormal. My Apdex score would be far different than the average user due to my personal perspective, as would the business user versus the consumer, based on their specific situation as they use an application. Other criteria that affect satisfaction include the type of device in use and connection type of that device.
The thing that is missing from Apdex is the notion of a service level. There are two ways to manage service level agreements. First, a service level may be calculated, as we do at AppDynamics with our baselines. Secondarily, it may be a static threshold, which the customer expects; we support this use case in our analytics product. These two ways of calculating an SLA cover the right ways to measure and score performance.

This is AppDynamics’ Transaction Analytics Breakdown for users who had errors or poor user experience over the last week, and their SLA class:

Simplistic SLAs are in the core APM product. Here is a view showing requests that were below the calculated baseline, showing which were in SLA violation.

The notion of combining an SLA with Apdex will result in a meaningful number being generated. Unfortunately, I cannot take credit for this idea. Alain Cohen, one of the brightest minds in performance analysis, was the co-founder and CTO (almost co-CEO) of OPNET. Alain discussed his ideas with me around this new performance index concept called OpDex, which fixes many of the ApDex flaws by applying an SLA. Unfortunately, Alain is no longer solving performance problems for customers; he's decided to take his skills and talents elsewhere after a nice payout.

Alain shared his OpDex plan with me in 2011; thankfully all of the details are outlined in this patent, which was granted in 2013. But OPNET's great run of innovation has ended, and Riverbed has failed to pick up where they left off, but at least they have patents to show for these good ideas and concepts.

The other issue with Apdex is that users are being ignored by the formula. CoScale outlined this issues in a detailed blog post.They explain that histograms are far better ways to analyze a variant population. This is no different than looking at performance metrics coming from the infrastructure layer, but the use of histograms and heat charts tend to provide much better visual analysis.

AppDynamics employs automated baselines for every metric collected, and measures based on deviations out of the box. We also support static SLA thresholds as needed. Visually, AppDynamics has a lot of options including viewing data in histograms, looking at percentiles, and providing an advanced analytics platform for whatever use cases our users come up with. We believe these are valid approaches to the downsides of using Apdex extensively in a product, which has it’s set of downsides.

No comments: