Skip to main content

An Open Response to the Open Letter To Monitoring/Metrics/Alerting Companies


John we couldn’t agree more with this letter, the specific issue is context. Too many of the tools and systems in use both commercially and with open source are metric or event (log) collectors, providing dashboards with little context about what is happening. In order to provide the proper context and operational visibility one must understand relationships and data flows between metrics and events.

This well written letter makes many points we completely agree with at AppDynamics. The use of words predictive or fixing issues automatically are not something we prescribe. Gartner has also long condoned the use of predictive in ITOA scenarios (“IT Operations Analytics Technology Requires Planning and Training” Will Cappelli December 2012). The area we disagree with is having early warning indicators of problems which are escalating. If technology is employed which collects end user experience from the browser and that performance is baselined by geography, as degradation occurs across the user community this often is an early warning indicator that something is behaving abnormally. We have customers who have seen a vast reduction in complete outages (P1 issues), and an increase in degraded service issues (P2 issues). This means we have evidence that the use of AppDynamics can in fact reduce the number of outages by providing early warning indicators. We have other evidence showing legacy enterprise monitoring tools are far too slow, this is a coupling of older technology, and organizational or process issues. This prevents the alerts from getting into the right hands in a timely manner. For example in a enterprise with siloed teams and tools, a storage contention bottleneck on a particular array would often be seen by the storage team, but lack of application operations visibility and escalation in a timely manner would result in service issues. This of course can be solved by fixing organization issues, but that is a challenge at scale.

Comments

Anonymous said…
First, thanks for the response! My post was intended on talking about these topics in more depth, so thanks for indulging.

Some clarification I think is needed. Collecting metrics and analyzing them, displaying them, and alerting on them is a requirement. To the extent that they are early (“leading”) indicators is context-specific, of course. I suspect that this isn’t controversial.

When you say that you have customers who have seen a “vast reduction in complete outages (P1 issues), and an increase in degraded service issues (P2 issues)" - you don’t have evidence that AppDynamics is reducing them. You have evidence (like you actually state) that the *use* of AppDynamics *can* reduce the number of outages.

I’m betting that such reduction came as a result of smart people using AppDynamics in addition to all of the other tools at their disposal.

This point is the one I’m making: the teams that are using your tool (amongst many) *are* the source of that reduction. Your tool is a tool. To the extent that the tool helps the teams, great. But that’s about the team, not the tool itself. This is in sharp contrast to what many services claim.

AppDynamics is a good one, from what I remember seeing, but it’s a tool to be used by engineers who know more about their application/systems than it does. Your tool will reveal what it’s been told to reveal. It’s told where to look, and what to look for and what data to collect. It can try to guess, based on what you tell it, what is good and what is not good; it doesn’t know this itself. It knows what it’s supposed to do based on how its user imagined it would be needed, at the time it was set up, which is helpful but not omniscience. :)

Because it’s not omniscient, engineers *adapt* to what they think the tool does well, and then they lean on that.

They also *adapt* to what it doesn’t do well, and then they lean on other tools to fill that gap.

Just like a team of network, database, application, or front-end engineers will defer to each other’s expertise, in a time of outage response, this is what I meant by viewing monitoring tools as members of your team.

This is why the “single pane of glass” or “all in one place” pitch is disingenuous. I don’t know that AppDynamics uses this phrase, but it’s one that makes me nervous every time I hear it, and want to not look further into the product.

When I said that anomaly detection is an unsolved problem, I was pointing out that data collected by a tool only becomes information when it's used in conjunction with other data that the tool doesn't have visibility into. For example, with alerts: false positives and false negatives will always be a reality. The way we deal with them is through *inductive* reasoning, not *deductive* reasoning. Which is why no tool will be the single and all-knowing eye.

I agree that visibility is key, and that organizational structures can indeed be a blocker for gaining that visibility across teams. But getting a false alert into any teams hands is still an issue that teams have to solve, with tools. Tools can’t solve that.

See what I mean?
Unknown said…
Thanks for the detailed reply John! Firstly I think we need a better forum for discussing these things than blogs and twitter. Maybe we need a MonitoringSucks Linkedin Group?

I’m happy to show you what the current state of AppDynamics is, I think you’ll be impressed what’s happened if you haven’t seen or heard from us in a couple years. Things have evolved considerably, this is the reason I am here. I’m also happy to put you in touch with one of many who have seen the movement from P1 issues towards P2 issues. The reason you’ll find is that most of the tooling people have in place today is infrastructure focused. If the group is advanced enough (like you have done at Etsy) recording business and application metrics and monitoring those can help. The problem is that all of those systems lack the context, by looking at metrics there is no visualization of data flow ex: transactional, application or infrastructure topology. This makes reconstructing or visualizing issues a major challenge. This is why monitoring consists of dashboard overload.
Building a corollary, although mission control for NASA consists of a lot of metrics, graphs, and data the main view is still a visual representation of the mission status. This equates to what monitoring should have being a topology!

http://astronomy.snjr.net/blog/wp-content/uploads/2012/05/Houston-.jpg

The point is by uplevelling the conversation beyond the infrastructure the team is actually focused on fixing application issues, and not infrastructure issues. Prior to using APM they didn’t have the right visibility or visualizations (if they were using older APM technologies).

We are actually trying to build a unified monitoring tool, which isn’t a dumping ground for data, which was how prior approaches were done. We are actually going to replace other monitoring tools and provide a application context view of the infrastructure. When you are diagnosing a user or application issue the actual infrastructure components being touched by that business transactions will be what is shown, avoiding the issues of the past. This lofty goal is why I came here. If we can do this, we will change the reason monitoring sucks today. I do think this can fix many of the issues, not to mention the collaboration capabilities we launched towards the end of last year. Let me know if you are interested I can show you what we’ve done and how we’re approaching these problems in future product.

Popular posts from this blog

Dynatrace Growth Misinformation

For my valued readers: I wanted to point out some issues I’ve recently seen in the public domain. As a Gartner analyst, I heard many claims about 200% growth, and all kind of data points which have little basis in fact. When those vendors are asked what actual numbers they are basing those growth claims on, often the questions are dodged. Dynatrace, recently used the Gartner name and brand in a press release. In Its First Year as an Independent Company, Gartner Ranks Dynatrace #1 in APM Market http://www.prweb.com/releases/2015/06/prweb12773790.htm I want to clarify the issues in their statements based on the actual Gartner facts published by Gartner in its Market Share data: Dynatrace says in their press release: “expand globally with more than three times the revenue of other new generation APM vendors” First, let’s look at how new the various technologies are: Dynatrace Data Center RUM (DCRUM) is based on the Adlex technology acquired in 2005, but was cr

Misunderstanding "Open Tracing" for the Enterprise

When first hearing of the OpenTracing project in 2016 there was excitement, finally an open standard for tracing. First, what is a trace? A trace is following a transaction from different services to build an end to end picture. The latency of each transaction segment is captured to determine which is slow, or causing performance issues. The trace may also include metadata such as metrics and logs, more on that later. Great, so if this is open this will solve all interoperability issues we have, and allow me to use multiple APM and tracing tools at once? It will help avoid vendor or project lock-in, unlock cloud services which are opaque or invisible? Nope! Why not? Today there are so many different implementations of tracing providing end to end transaction monitoring, and the reason why is that each project or vendor has different capabilities and use cases for the traces. Most tool users don't need to know the implementation details, but when manually instrumenting wi

Vsphere server issues and upgrade progress

So I found out that using the host update tool versus Vcenter update manager is much easier and more reliable when moving from ESXi 3.5 to 4.0. Before I was using the update manager and it wasn't working all that reliably. So far I haven't had any issues using the host update tool. I've done many upgrades now, and I only have 4 left, 3 of which I am doing this weekend. Whenever I speak to vmware they always think I'm using ESX, when I prefer and expect that people should move to the more appliance model of ESXi. With 4.0 they are pretty much on par, and I'm going to stick with ESXi. On one of my vsphere 4.0 servers (virtualcenter) its doing this annoying thing when I try to use the performance overview:   Perf Charts service experienced and internal error.   Message: Report application initialization is not completed successfully. Retry in 60 seconds.   In my stats.log I see this.   [28 Aug 09, 22:28:07] [ERROR] com.vmware.vim.stats.webui.startup.Stat