Thursday, May 7, 2015

An Open Response to the Open Letter To Monitoring/Metrics/Alerting Companies

John we couldn’t agree more with this letter, the specific issue is context. Too many of the tools and systems in use both commercially and with open source are metric or event (log) collectors, providing dashboards with little context about what is happening. In order to provide the proper context and operational visibility one must understand relationships and data flows between metrics and events.

This well written letter makes many points we completely agree with at AppDynamics. The use of words predictive or fixing issues automatically are not something we prescribe. Gartner has also long condoned the use of predictive in ITOA scenarios (“IT Operations Analytics Technology Requires Planning and Training” Will Cappelli December 2012). The area we disagree with is having early warning indicators of problems which are escalating. If technology is employed which collects end user experience from the browser and that performance is baselined by geography, as degradation occurs across the user community this often is an early warning indicator that something is behaving abnormally. We have customers who have seen a vast reduction in complete outages (P1 issues), and an increase in degraded service issues (P2 issues). This means we have evidence that the use of AppDynamics can in fact reduce the number of outages by providing early warning indicators. We have other evidence showing legacy enterprise monitoring tools are far too slow, this is a coupling of older technology, and organizational or process issues. This prevents the alerts from getting into the right hands in a timely manner. For example in a enterprise with siloed teams and tools, a storage contention bottleneck on a particular array would often be seen by the storage team, but lack of application operations visibility and escalation in a timely manner would result in service issues. This of course can be solved by fixing organization issues, but that is a challenge at scale.


John Allspaw said...

First, thanks for the response! My post was intended on talking about these topics in more depth, so thanks for indulging.

Some clarification I think is needed. Collecting metrics and analyzing them, displaying them, and alerting on them is a requirement. To the extent that they are early (“leading”) indicators is context-specific, of course. I suspect that this isn’t controversial.

When you say that you have customers who have seen a “vast reduction in complete outages (P1 issues), and an increase in degraded service issues (P2 issues)" - you don’t have evidence that AppDynamics is reducing them. You have evidence (like you actually state) that the *use* of AppDynamics *can* reduce the number of outages.

I’m betting that such reduction came as a result of smart people using AppDynamics in addition to all of the other tools at their disposal.

This point is the one I’m making: the teams that are using your tool (amongst many) *are* the source of that reduction. Your tool is a tool. To the extent that the tool helps the teams, great. But that’s about the team, not the tool itself. This is in sharp contrast to what many services claim.

AppDynamics is a good one, from what I remember seeing, but it’s a tool to be used by engineers who know more about their application/systems than it does. Your tool will reveal what it’s been told to reveal. It’s told where to look, and what to look for and what data to collect. It can try to guess, based on what you tell it, what is good and what is not good; it doesn’t know this itself. It knows what it’s supposed to do based on how its user imagined it would be needed, at the time it was set up, which is helpful but not omniscience. :)

Because it’s not omniscient, engineers *adapt* to what they think the tool does well, and then they lean on that.

They also *adapt* to what it doesn’t do well, and then they lean on other tools to fill that gap.

Just like a team of network, database, application, or front-end engineers will defer to each other’s expertise, in a time of outage response, this is what I meant by viewing monitoring tools as members of your team.

This is why the “single pane of glass” or “all in one place” pitch is disingenuous. I don’t know that AppDynamics uses this phrase, but it’s one that makes me nervous every time I hear it, and want to not look further into the product.

When I said that anomaly detection is an unsolved problem, I was pointing out that data collected by a tool only becomes information when it's used in conjunction with other data that the tool doesn't have visibility into. For example, with alerts: false positives and false negatives will always be a reality. The way we deal with them is through *inductive* reasoning, not *deductive* reasoning. Which is why no tool will be the single and all-knowing eye.

I agree that visibility is key, and that organizational structures can indeed be a blocker for gaining that visibility across teams. But getting a false alert into any teams hands is still an issue that teams have to solve, with tools. Tools can’t solve that.

See what I mean?

Jonah Kowall said...

Thanks for the detailed reply John! Firstly I think we need a better forum for discussing these things than blogs and twitter. Maybe we need a MonitoringSucks Linkedin Group?

I’m happy to show you what the current state of AppDynamics is, I think you’ll be impressed what’s happened if you haven’t seen or heard from us in a couple years. Things have evolved considerably, this is the reason I am here. I’m also happy to put you in touch with one of many who have seen the movement from P1 issues towards P2 issues. The reason you’ll find is that most of the tooling people have in place today is infrastructure focused. If the group is advanced enough (like you have done at Etsy) recording business and application metrics and monitoring those can help. The problem is that all of those systems lack the context, by looking at metrics there is no visualization of data flow ex: transactional, application or infrastructure topology. This makes reconstructing or visualizing issues a major challenge. This is why monitoring consists of dashboard overload.
Building a corollary, although mission control for NASA consists of a lot of metrics, graphs, and data the main view is still a visual representation of the mission status. This equates to what monitoring should have being a topology!

The point is by uplevelling the conversation beyond the infrastructure the team is actually focused on fixing application issues, and not infrastructure issues. Prior to using APM they didn’t have the right visibility or visualizations (if they were using older APM technologies).

We are actually trying to build a unified monitoring tool, which isn’t a dumping ground for data, which was how prior approaches were done. We are actually going to replace other monitoring tools and provide a application context view of the infrastructure. When you are diagnosing a user or application issue the actual infrastructure components being touched by that business transactions will be what is shown, avoiding the issues of the past. This lofty goal is why I came here. If we can do this, we will change the reason monitoring sucks today. I do think this can fix many of the issues, not to mention the collaboration capabilities we launched towards the end of last year. Let me know if you are interested I can show you what we’ve done and how we’re approaching these problems in future product.