Friday, June 30, 2017

Digital Business Operations the new mode for IT Operations

Recent conversations I've had with others in the IT Operations space helped formulate an idea I have been working on for a while concerning the future of Ops. The old mode of operating, you can call it ITIL based, or mode 1, or whatever you prefer is not going to be the primary mode of operations in the long term. Senior leadership has acknowledged this and made concerted efforts to change their habits of hiring, spending, and roadmaps for technology and delivery of new functionality.

More advanced teams operating in an agile manner, or mode 2 consist of smaller more integrated teams made up of individuals with different skills. These teams are meeting today's digital business challenges. In most enterprises, these teams are part of a bimodal strategy, but bridging the gap between mode 1 and mode 2 is something few have solved. I'm personally not sure this is possible due to cultural and fundamental differences in beliefs and trust. In many organizations, there is a high degree of variability on the level of investment between mode 1 and mode 2, but most leaders agree the future is moving more of the staff towards mode 2 due to business demand. Listening to customers is key.

There is a fundamental shift occurring between both modes of operation that we need two terms to explain how these teams operate and what they require from people, process, and technology perspective. Bimodal is not quite bifurcated enough compared to what is happening in these enterprises today. I'm coining the term Digital Business Operations and IT Operations as these two teams. There will be an acceleration of new technologies and capabilities which will further separate the way these teams operate, making unification even more complicated than it is today. Thankfully there will be better infrastructure abstraction technologies which will allow each of these teams to operate independently (naturally for root-cause we'll need bridges, which is what our goal is at AppDynamics). Many believe the answer to this will be the adoption of private and hybrid PaaS technologies, but I find these are too complex, rooted in yesterday's problems. A better lighter weight approach will emerge built upon containers and orchestration making infrastructure abstraction simple versus the complexity we see today in private PaaS.

The changes in infrastructure, application architecture and management are still pockets of enterprises, and often experimental in nature. Similarly digital business and agility must allow for experimentation, but at some point, the experiment solidifies into a core business tenant. This solidification is what will occur within Digital Business Operations which will result in new more repeatable (or industrialized) ways to handle processes, toolchains, and workflows which today are implemented inconsistently between organizations.

Digital Business Operations requires a fundamental change in specific areas we've taken for granted in mode 1 IT Operations. These include process frameworks such as ITIL, service management (ITSM, Ticketing, Bug Tracking), automation, and configuration management (especially the concept of a CMDB). Each one of these is a topic I'll hopefully cover in future posts. I'll share some of the challenges, ideas as to how the vendors may or may not solve these issues, and some insight into what practitioners or first mover organizations are doing to address these problems.

Friday, June 16, 2017

Instrumentation is Changing... Finally!

 I've always been a fan of trying to solve complicated problems. As an end user, I applied various technologies and tools to diagnose some strange ones over the years. Applications have become increasingly decoupled and distributed requiring the monitoring and diagnostics to change significantly. Let's look at a short timeline of the changes which have occurred, why these changes were necessary, how they helped solve a technological shift, and which challenges remain.

How Overcome
Component monitoring
Distributed systems became pervasive
High-end solutions (CA, BMC, HP, BMC) became commoditized (Solarwinds)
Event correlation
Too many monitoring tools (still applies) created information overload
Enterprises rely on antiquated tools, many have given up.
Log analytics
Diagnostics too challenging in distributed systems
Splunk unlocked it, but ELK has commoditized it.
Front-end monitoring
Customer centricity is the key for digital business
Hasn’t yet been well adopted, penetration is still under 20%.
Time-series metrics
Microservices created too many instances creating too much data
Has not been solved, but open source is now growing up to handle scale.
Complex and distributed systems are difficult to diagnose
Beginning to evolve.

In each of these phases we've had commercial innovators, and over time they have been replaced by either open source or commoditized commercial solutions which are basic and low cost. Typically as there were frankly bigger problems to solve.

I would say that 2010 to 2015 were the era of log analytics. At this point the technology is pervasive, 95% of companies I speak to have a strategy. Most of them use a combination of tools, but inexpensive or free software, not necessarily a trait which is shared by hardware or cloud storage, are becoming the norm. Most enterprises are using at least two vendors today for log analytics. Typically one of them is Splunk, and very often the other is ELK. When this market was less mature ELK was more fragmented, but not the solidified ElasticStack platform which Elastic codified has made it a viable alternative.

The era of end-user monitoring peaked in 2015. Everyone was interested, but implementations remain rather small for the use cases of performance monitoring and operational needs. End user experience is a critical area for today's digital businesses, but often requires a level of maturity which is a challenge for most buyers. I do expect this to continue growing, but there will always be a gap between end user monitoring and backend monitoring. Most vendors who have tried to build this as a standalone product have failed to gain traction. A recent example of this is SOASTA, who had good technology, and decent growth, but failed to build a self-sustaining business. Ultimately SOASTA was sold to Akamai.

2016 was the era of time series metrics, where that market peaked, and we saw a massive amount of new technology companies flourish. Examples of this were increased traction by vendors like DataDog, or newer cases including Wavefront, SingalFx, and others. We've just seen the start of M&A in this area Solarwinds buying Librato in 2015, and more recently VMWare buying Wavefront in 2017. We also have the commercial monitoring entrant Circonus who have begun OEMing their time-series database under the IronDB moniker. The next phase of this market will be the maturity of time-series databases in the open source world. Examples may include  Prometheus, but I'm keeping an eye on InfluxDB. As these options mature for monitoring, IoT, and other time-series use cases especially solving easier data storage at scale. The other missing component is front-ends like Grafana improving significantly with workflows and easier usage. Ultimately we may see an ELK like stack emerge in the time-series world, but we'll have to wait and see.

Due to the increase in microservices and the resulting Docker containers, orchestration layers are beginning to take hold in many organizations. As a result of these shifts in application architectures, 2017 seems to have more of a focus on tracing. The importance of tracing is great to see take hold, as root-cause isolation in distributed systems requires good tracing (at least until ML matures in a meaningful way). Tracing also allows you to understand impact when you have service outages or degradations. Additionally, tracing can be used to tie together IT and business metrics and data. The trace is essentially the "glue" of monitoring so to speak. By tagging and tracing you are essentially creating a forensic trail, although this has yet to apply within security, it will! Gartner even began talking about the application to security not long ago in the research note titled Application Performance Monitoring and Application Security Monitoring Are Converging G00324828, Cameron Haight, Neil MacDonald. The detailed tracing is what companies like AppDynamics and Dynatrace do, and has been the core of their technologies since they were founded based on these concepts. These tools solve complex problems faster and perform much more than just tracing, but the trace is the glue in the technology. Unfortunately for buyers, these monitoring and diagnostic technologies come with a typically high price tag, but they are not optional for digital businesses.

Today's open source tracing projects require developers to do work, this is different than how commercial APM tools work which auto instrument and support common application frameworks. The open-source tooling is getting better with standards like OpenTracing, and front-ends such as Zipkin continually evolving. The problem is these technologies still lack the automation for capture that you see in commercial tools. How often do I trace? What do I trace? When do I trace? If you expect developers to make these decisions all the time I think there will be issues. Developers don't understand the macro performance of how their component will fit into the bigger picture. Also once a service is written and reused for other applications, it's hard to understand the performance implications. I am interested and currently experimenting with proxy level tracing to help extend tagging and tracing in areas where you may not want or need heavier agents. I hope to share more on this on a future blog.

The long-term goal, however, is to combine all of these siloed data sets and technologies by using more advanced correlation capabilities, in addition to applying new algorithms and machine learning to the problem, which is currently in its infancy. Over at AppDynamics, we do this based on a trace itself, but we are evolving new capabilities, and doing so in a unified manner, going back to a trace. Monitoring of digital businesses is going to continue to be an exciting space for quite a long time, requiring constant evolution to keep pace with evolving software and infrastructure architectures.

I’ll be giving an updated talk on monitoring and instrumentation which will cover a lot of what is in this article. We will also go into more depth around instrumentation and tracing. I will premiere this new talk this October at the Full Stack Conference October 23rd and 24th 2017 in Toronto for tickets. I am Looking forward to contributing to this great event.