Wednesday, May 20, 2015

Performance Monitoring, another view

Originally posted on devops.com : http://devops.com/2015/05/20/performance-monitoring-another-view/

This is my first post to DevOps.com, for those who haven’t read my other writings, I currently work as the VP of Market Development and Insights at AppDynamics. I joined in February 2015 after being at Gartner for 4 years as a Research VP covering all things monitoring. During my time as a Gartner analyst I would help buyers make decisions about purchasing monitoring software. When I read articles on the internet which blatantly disregard best practices I either smell that something is fishy (sponsorship) or they just didn’t follow a process. Conversely some people do follow a proper process. I specifically wrote this up due to reading a really fishy article.

Step 1 – Admit it

The first step to determining you have a problem is admitting it. Most monitoring tool buyers realize they have too many tools, and none of them help isolate root cause. This is a result of most buying happening at the infrastructure layer, when the application layer is either lightly exercised synthetically or disregarded. People soon realize you must find an application layer monitoring tool, or write your own instrumentation. The vendors who sell logging software would love for you to generate a lot of rich logs, so they can help you solve problems, but realistically this doesn’t work in most enterprises as they are using a combination of custom and packaged software, and forcing developers to write consistent logging is a frustrating and futile task.

Step 2 – Document it

The next phase is to document what you are using today, most organizations have over 10 monitoring tools, and they typically want to buy another one. As they consider APM tools they should document what they have today, and determine a plan to consolidate and eliminate redundant tools. Having tools within silos will happen until you can fix the organizational structure.

Step 3 – Requirements

I can’t tell you how many times vendors set requirements for buyers. You can tell when vendors do this versus actually understanding the requirements the user has, and if the solution will scale and be operable by the team. Purchasing a tool which requires full time consultants on staff to keep it running is no longer feasible today, sorry to the old big ITOM vendors (formerly the big-4) but people don’t want that anymore.

Step 4 – Test it!

If you are evaluating software you MUST test it, and make sure it works in your environment. Better yet load test the software to determine if the vendors claims of overhead and visibility are in fact true. I see far too many people buy software they do not test.
Getting back to the article which prompted this post:
The author is located in Austria and I believe there is a connection to Dynatrace. Dynatraces’ R&D and technical leadership are in Austria, but this is just conjecture. The article is written with a bias, which is obvious to any reader who has used these tools. The evaluator did not test these products, or if he did he didn’t have a testing methodology.
Finally the author’s employer is new and all employees have come from Automic Software in the past year. I don’t need to explain to those who work with Automic the state of things there. http://www.glassdoor.com/Reviews/Automic-Software-Reviews-E269431.htm#
I think I’ve spent enough time on the background, here are my specific issues with the content.
The screenshots are taken from various websites, and are not current or accurate. If someone writes an evaluation of technology, at least test it, and use your screenshots from the testing!
This post was specifically to discuss DevOps, but yet the author doesn’t discuss the need to monitor microservices or asynchronous application patterns, which are clearly the current architecture of choice. This is especially the case in DevOps or continuous release environments. In fact Microsoft themselves announced new products last week at Ignite including nano server, and Azure Fabric Service for building these patterns. The difficulty with microservices is that each request externally spawns a large number of internal requests, these often start and stop out of order creating issues when monitoring these. This causes major issues for APM products, and very few handle this effectively today.
The article calls out user interface and deployment models inconsistently, and fails to mention Dynatrace is an on-premises software product. Dynatrace is also a thick Java client, built on Eclipse. This is the definition of a heavy client, which the author calls out for other products in this article.
Continuing the incorrect facts in this article the author calls out Dell (formerly Quest) Foglight. This is not Dell’s current product for modern architectures. Dell has moved to a SaaS product called Foglight APM SaaS, which is not what was evaluated. The evaluator should look at the current product (noticing a trend here).
Finally the AppDynamics review is inaccurate:
Editors note: the following is this authors opinions, not the opinions of DevOps.com.  As the author states, he is employed by AppDynamics.
“code level visibility may be limited here to reduce overhead.”
The default in order to scale AppDynamics to monitor thousands of systems with one single on premises controller (something no other APM vendor delivers) is done based on advanced smart instrumentation, but every transaction is monitored and measured. All products do  limit the amount of data they request in some way. AppDynamics has several modes including one which captures full call traces for every execution.
“ lack of IIS request visibility”
AppDynamics provides excellent visibility to IIS requests, I’m not sure what the author means here. AppDynamics also supports implementation on Azure using nuget.
“features walk clock time with no CPU breakdown”
Once again, I’m not sure what the author is referring to here. The product provides several ways of viewing CPU usage.
“Its flash-based web UI and rather cumbersome agent configuration”
The agent installation is no different than other products, showing the author has not installed the product. Finally the product is HTML5, and has been for a while, there are still some flash views in configuration screens, but those are being removed with each release. I’d much rather have a web based UI with a little remaining flash than a fat client UI requiring a large download.
AppDynamics has a single UI which goes far deeper into database performance than the other products in this review. AppDynamics also provides far more capabilities than what was highlighted in this article including synthetic. AppDynamics provides deployment of the same software via SaaS or on premises installation.
Finally this review did not look at analytics, which is clearly an area of increasing demand within APM. That being said this review is far from factual or useful.
Hopefully this sets some of this straight, please leave comments here, or @jkowall on twitter!
Thanks.

Thursday, May 7, 2015

An Open Response to the Open Letter To Monitoring/Metrics/Alerting Companies


John we couldn’t agree more with this letter, the specific issue is context. Too many of the tools and systems in use both commercially and with open source are metric or event (log) collectors, providing dashboards with little context about what is happening. In order to provide the proper context and operational visibility one must understand relationships and data flows between metrics and events.

This well written letter makes many points we completely agree with at AppDynamics. The use of words predictive or fixing issues automatically are not something we prescribe. Gartner has also long condoned the use of predictive in ITOA scenarios (“IT Operations Analytics Technology Requires Planning and Training” Will Cappelli December 2012). The area we disagree with is having early warning indicators of problems which are escalating. If technology is employed which collects end user experience from the browser and that performance is baselined by geography, as degradation occurs across the user community this often is an early warning indicator that something is behaving abnormally. We have customers who have seen a vast reduction in complete outages (P1 issues), and an increase in degraded service issues (P2 issues). This means we have evidence that the use of AppDynamics can in fact reduce the number of outages by providing early warning indicators. We have other evidence showing legacy enterprise monitoring tools are far too slow, this is a coupling of older technology, and organizational or process issues. This prevents the alerts from getting into the right hands in a timely manner. For example in a enterprise with siloed teams and tools, a storage contention bottleneck on a particular array would often be seen by the storage team, but lack of application operations visibility and escalation in a timely manner would result in service issues. This of course can be solved by fixing organization issues, but that is a challenge at scale.

Monday, May 4, 2015

CIO: Changing Sourcing Strategies

Sourcing strategies are evolving rapidly. CIOs have to handle this shift, where innovation isn't coming from the usual vendors. There is a clear movement toward those who can facilitate the most critical agendas to enable digital business. New organizational structures are required that allow for a culture of innovation and experimentation. In order to innovate, teams that typically consist of larger, slower-moving units must transform into small, startup-type or project-based units.

AppDynamics CEO, Jyoti Bansal, often speaks about his methodology for creating a startup within a startup. Many new product initiatives begin with a small team of two-to-four engineers and product managers tasked with building a minimum viable product. In this model startup teams are augmented with small acquisitions which come with talent. This new talent provides new perspectives, skills, and core intellectual property allowing the acquirer to meet customer needs quickly. This is similar to the model employed at Google, Facebook, Yahoo, or Twitter allowing for incubation and experimentation with new, unorthodox ideas. At Google, some of these new ideas become winners like Gmail and Google Apps for business, while many fail. Some of these new products take several iterations before they can be productized. Google Talk is one example that was allowed to “sunset” but was reborn as Google Hangouts, which has had rampant adoption with its second iteration and integration of other services. Google Hangouts on Android has over 500 million installations.

This is the new pace of innovation, something CIOs must now replicate in their own operations. A Gartner headline says, "Digital Business Economy is Resulting in Every Business Unit Becoming a Technology Startup." What this means is if you do not evolve your organization, thinking, and sourcing, you will be unable to compete. The furious and often large acquisition strategy, results in non-integrated technologies which fail to compete in a time where technology becomes a major advantage in digital transformations.
Within IT operations management, an area I've studied and tracked for over two decades, we saw many large acquisitions over the years from HP, IBM, BMC, and CA. These acquisitions provided growth and breadth, by adding new capabilities that were slotted into a portfolio. That growth strategy though, which had once been effective, ended up creating too much complexity. The complexity required services, which in turn led to challenges in staying current.

This technology debt and lag do not meet the needs of today's technology buyers. The large acquisition strategy has slowly fallen off; meanwhile, these big players have had major challenges making the transition to organic innovation. Gartner predicts that "By 2017, at least two of the "big four" IT operations management vendors will cease to be viable strategic partners for 60 percent of I&O organizations." We've already seen this play out with the privatization of yesterday's innovators and ongoing execution issues with others.
New innovative upstarts in ITOM run at a different speeds, allowing them to create, build, and provide the value customers want, without the heavy services, consulting, and integration burdens. These new vendors have highly differentiated strategies and approaches that keep them innovative and give them a competitive advantage.  
The question remains for CIOs: Of these innovative vendors, which will be able to continue down the innovation path as they move from being medium-sized companies to become large companies? As revenues of these innovative vendors continue to increase substantially, innovation at scale remains an open question. But at AppDynamics, we're focused on culture and strategy to enable and retain our innovation as we build the next generation ITOM technology for software-defined businesses.

The End of my Affair with Apdex

A decade ago, when I first learned of Apdex, it was thanks to a wonderful technology partner, Coradiant. At the time, I was running IT operations and web operations, and brought Coradiant into the fold. Coradiant was ahead of its time, providing end-user experience monitoring capabilities via packet analysis. The network-based approach was effective in a day when the web was less rich. Coradiant was one of the first companies to embed Apdex in its products.

As a user of APM tools, I was looking for the ultimate KPI, and the concept of Apdex resonated with me and my senior management. A single magical number gave us an idea of how well development, QA, and operations were doing in terms of user experience and performance. Had I found the metric to rule all metrics? I thought I had, and I was a fan of Apdex for many years leading up to 2012, when I started to dig into the true calculations behind this magical number.

As my colleague Jim Hirschauer pointed out in a 2013 blog post, the Apdex index is calculated by putting the number of satisfied versus tolerating requests into a formula. The definition of a user being "satisfied" or "tolerating" has to do with a lot more than just performance, but the applied use cases for Apdex are unfortunately focused on performance only. Performance is still a critical criterion, but the definition of satisfied or tolerating is situational.

I'm currently writing this from 28,000 feet above northern Florida, over barely usable in-flight internet, which makes me wish I had a 56k modem. I am tolerating the latency and bandwidth, but not the $32 I paid for this horrible experience , but hey, at least Twitter and email work. I self-classify as an "un-tolerating" user, but I am happy with some connectivity. People who know me will tell you I have a bandwidth and network problem. Hence, my level of a tolerable network connection is abnormal. My Apdex score would be far different than the average user due to my personal perspective, as would the business user versus the consumer, based on their specific situation as they use an application. Other criteria that affect satisfaction include the type of device in use and connection type of that device.
The thing that is missing from Apdex is the notion of a service level. There are two ways to manage service level agreements. First, a service level may be calculated, as we do at AppDynamics with our baselines. Secondarily, it may be a static threshold, which the customer expects; we support this use case in our analytics product. These two ways of calculating an SLA cover the right ways to measure and score performance.

This is AppDynamics’ Transaction Analytics Breakdown for users who had errors or poor user experience over the last week, and their SLA class:


Simplistic SLAs are in the core APM product. Here is a view showing requests that were below the calculated baseline, showing which were in SLA violation.

The notion of combining an SLA with Apdex will result in a meaningful number being generated. Unfortunately, I cannot take credit for this idea. Alain Cohen, one of the brightest minds in performance analysis, was the co-founder and CTO (almost co-CEO) of OPNET. Alain discussed his ideas with me around this new performance index concept called OpDex, which fixes many of the ApDex flaws by applying an SLA. Unfortunately, Alain is no longer solving performance problems for customers; he's decided to take his skills and talents elsewhere after a nice payout.

Alain shared his OpDex plan with me in 2011; thankfully all of the details are outlined in this patent, which was granted in 2013. But OPNET's great run of innovation has ended, and Riverbed has failed to pick up where they left off, but at least they have patents to show for these good ideas and concepts.

The other issue with Apdex is that users are being ignored by the formula. CoScale outlined this issues in a detailed blog post.They explain that histograms are far better ways to analyze a variant population. This is no different than looking at performance metrics coming from the infrastructure layer, but the use of histograms and heat charts tend to provide much better visual analysis.

AppDynamics employs automated baselines for every metric collected, and measures based on deviations out of the box. We also support static SLA thresholds as needed. Visually, AppDynamics has a lot of options including viewing data in histograms, looking at percentiles, and providing an advanced analytics platform for whatever use cases our users come up with. We believe these are valid approaches to the downsides of using Apdex extensively in a product, which has it’s set of downsides.