Wednesday, September 20, 2017

Artificial Intelligence in Digital Operations

Artificial Intelligence in Digital Operations


Not a week goes by where I don't see a vendor claiming that they have applied Artificial Intelligence (AI) to running digital businesses (there was a new one this week, but I began writing this beforehand). The list of vendors continues to increase, and when digging into the technology, the marketing is often overstepping the use of the AI term. Let's take a step back and understand what AI is from a computer science perspective.

The traditional problems (or goals) of AI research include reasoning, knowledge, planning, learning, natural language processing, perception as per Wikipedia. The other interesting trend is the application of the AI effect. This term explains that as we apply models and data to problems, and create algorithms to solve these problems, we move the problem outside of the scope of AI. Thus, abusing term AI to describe much of what we do in technology. Most of the technology itself is not complex when it comes to building self-learning models, but yet is described in mathematical terms.   

Users expect AI to do the work for them, meaning providing insights and knowledge they may not possess or may require them to do additional learning and studying to come to these similar results. The net effect is we expect magic from these intelligent machines, hence the term AI. We see this very often in our digital assistants, whether you use the common ones Alexa, Google Assistant, Siri, or Cortana or the more advanced offerings such as x.ai or zoom.ai. Unfortunately, there are far fewer companies outside of a couple dozen building actual AI. Outside of the powerhouses AI today represents programming to produce outcomes we expect a human to come to, not providing us with additional insight. When implementing these poor AI solutions, they are often represented by a decision tree and focused knowledge along that tree. The downside is that if the decision tree or knowledge has gaps those become issues which the AI can no longer solve. These expert systems show theirs seems quickly. The intention would be for the AI to adapt, but that has yet to occur in the market. Humans, on the other hand, can react to and derive new creative decisions based on reasoning, broader knowledge (such as technical fundamentals), and perception. Computers do not have these capabilities unless they are programmed, which is a challenge today. This use of humans versus machines is one reason that we do not have automated heavy machinery when human lives are at risk (flights, military, cruise ships, and others). While people do make mistakes, they can react to the unknown, and computers cannot typically do this well.

When it comes to applying "AI" to IT Operations (monitoring and automation are my key focus areas), the industry is immature. For the most part, we've seen textual analysis, anomaly detection, topological analysis, regression analysis, and historical analysis. Some of these can be quite advanced by incorporating supervised learning and neural networks. As more of these capabilities are combined, they can almost seem intelligent and can help humans manage today's scale requirements.

The downsides are that today's AI solutions fail to incorporate general knowledge of how systems operate, they cannot do research and provide advice which is not programmed, and they lack "intelligence" aside from helping humans cope with challenges. The unsupervised learning capabilities are limited today and cannot create newly learned outcomes versus those which are programmed. These defined rule books consist of expected outcomes, which the computer then follows, versus intelligence required to create new outcomes. Incorporation of broader sets of knowledge such as StackOverflow, Quora, vendor knowledge bases, or other great sources of knowledge accessible on the internet would allow computers to derive new outcomes and advice. The use of these rich data sources is in its infancy, once this knowledge becomes integrated and automated it would provide a massive benefit to those complex operating systems.

As things evolve and real intelligence starts to take hold this will change, but we are a long way from that point, especially in digital operations today. We have some customers at AppDynamics thinking about the orchestrated enterprise of the future, and how these intelligent agents can help them make better and faster business and operational decisions. We are eagerly working with these customers as a trusted partner, and we look forward to hearing your aspirations, ensuring our alignment on the future state of AppDynamics.

Thursday, August 24, 2017

Digital Business Operations : Fire Your CMDB

As outlined in a prior post, Digital Business operations require new thinking and new technologies. As Operations evolves to meet the needs of Digitalization, so too must the core systems of record for operations. When running a data center with physical assets, or even user assets such as laptops, printers, and desktops the CMDB was a useful construct to understand what you had, things were static, and thus the problem was more easily solved. In reality, almost no one had an accurate CMDB, they most often hovered around 80% coverage, based on the beliefs of staff, and often were driven by a combination of automated and manual processes. The use cases for the CMDB are often tied to ITSM processes such as request, incident, problem, and change management. By having good data capture to record asset and component ownership and configuration it made these processes more robust, reliable, and accurate.

By using discovery tools which crawled technologies or leveraged the network for data discovery was accurate, leading to a well maintained CMDB. In my personal experience, I found the network approach to be great (I was a very early nLayers customer at Thomson Reuters), but it had challenges around packet capture and aggregation. These are challenges today with any packet based collection and are made significantly worse in public cloud environments, which were not an issue last decade.

When virtualization entered the fold, the number of workloads increased and became more dynamic, but it wasn’t a major problem for the existing systems to handle this change, aside from an increase in scale. As applications evolved and configuration moved from being defined within the application server configuration, for example, database connections, connection pools, message queues, Memcache systems, and other components. Those have transitioned to being defined in the runtimes, and discovery tools increasingly have issues collecting configuration information. Most enterprises have dozens of configuration management tools and automation stacks. Ranging from legacy solutions provided by BMC, HP, IBM, CA, and others, but these teams have also added new-ish opens source such as Chef, Puppet, Ansible, Salt, and more. Today these teams are looking at orchestrating infrastructure and creating new layers, and examples include the open source project terraform. The reason for the fragmentation in configuration management is evolving applications and associated infrastructure, which also rely and depend on older applications, running on classical infrastructure (ex: VMWare, Tibco, and Mainframe). Over time the stack becomes complicated and costly. None of the new players in this space seem to support legacy technologies, and the legacy vendor solutions make it cumbersome to support modern architectures. The result is that we have a mess, and no path forward to reduce debt.

In today’s world with a high degree of self-service (public and private cloud including SaaS), containers, and orchestration these discovery tools do not work. The processes which consist of an ITSM strategy, often underpinned by ITIL (more on this later) no longer function in an efficient and highly automated system  Finally, building dependency maps and graphs are no longer possible or feasible via a centralized repository. Adding technology support to a non-functional process would not fix the problem. For example, CMDB discovery tools which add Docker support or attempt to handle Kubernetes or Swarm are missing the point and lack the capability to collecting data from ephemeral systems. The technology is not the only thing which has changed, but also the desired business outcomes. The net result is agility is paramount, implemented by cross functional product engineering teams. That shift requires a culture change within IT, and yesterday's solutions do not support these initiatives.

The business demands rapid innovation via incremental but continuous improvement, which results in a high frequency of change across infrastructure (physical and logical), applications, and environments. Discovering and controlling these systems is a challenge from both a security and audit perspective, but also from a service assurance perspective. Decentralized IT organizations driven by the business need to move quickly, and experiment and the mandate of innovation often contradict centralized IT organizations methodologies. Technology which relies upon access to systems that are often outside of the area of control within IT is an ever growing challenge. These issues require us to shift data collection from an approach of crawling and cataloging data towards instrumentation (scraping web services is fine if your data is lightweight, but typically depth is not captured in these exposed API endpoints). The approach of instrumentation provides a more accurate understanding of dependencies and user experience, and a dynamic way to understand relationships between physical and logical components, and allow us to create new use cases for this data to solve some of the problems the CMDB was designed to address. The next generation of  CMDB will be dynamically modeled. I have seen sprinklings of this in a couple of startups, but they are missing the right type of instrumentation to provide real-time capabilities. Events, logs, and metrics are not enough, ingesting topological models is not sufficient, this must be instrumentation across application and network.

My next post on this subject will focus on ITSM, which today is disjointed with both customer support and development. I am a strong believer in fixing these issues the right way with teamwork, and I’m looking forward to sharing my thoughts on this topic.

Monday, July 17, 2017

CIO Insights: Gartner CIO Survey Shows Canadian growth increasing

Being a Canadian, and to celebrating Canada turning 150 years old, I wanted to call out some interesting trend data I recently came across. Having spent a good amount of time visiting with customers in Canada in 2017, as I do each year I have noticed changes in the country. Quite possibly due to the changes in government and the political climate, or change and progress in the private sector.

In Canada, there are economic challenges today, mostly due to a high reliance on natural resources, and low market prices today. Even with these difficulties there are increased IT budgets in Canada, surpassing the global average growth rates based on the Gartner IT survey data.


When breaking this down further, only 38% of government employees projected budgetary growth versus 64% of non-government respondents in Canada. Comparing these to last year the numbers were 16% and 57% respectively. Canada sees increased investment from a technology perspective based on the survey group. Much of that investment is coming from the private sector versus the public sector
which is encouraging for the Canadian economy.

I still will be making a couple of trips to Canada in 2017 including speaking at the full stack conference in Toronto (http://2017.fsto.co/ ). I'll be presenting new content including new thoughts around monitoring and instrumentation. I'm looking forward to finalizing my presentation over the next couple months. I wish you a happy birthday Canada and look forward to seeing the results of these changes. See everyone in Toronto.

Friday, June 30, 2017

Digital Business Operations the new mode for IT Operations


Recent conversations I've had with others in the IT Operations space helped formulate an idea I have been working on for a while concerning the future of Ops. The old mode of operating, you can call it ITIL based, or mode 1, or whatever you prefer is not going to be the primary mode of operations in the long term. Senior leadership has acknowledged this and made concerted efforts to change their habits of hiring, spending, and roadmaps for technology and delivery of new functionality.

More advanced teams operating in an agile manner, or mode 2 consist of smaller more integrated teams made up of individuals with different skills. These teams are meeting today's digital business challenges. In most enterprises, these teams are part of a bimodal strategy, but bridging the gap between mode 1 and mode 2 is something few have solved. I'm personally not sure this is possible due to cultural and fundamental differences in beliefs and trust. In many organizations, there is a high degree of variability on the level of investment between mode 1 and mode 2, but most leaders agree the future is moving more of the staff towards mode 2 due to business demand. Listening to customers is key.

There is a fundamental shift occurring between both modes of operation that we need two terms to explain how these teams operate and what they require from people, process, and technology perspective. Bimodal is not quite bifurcated enough compared to what is happening in these enterprises today. I'm coining the term Digital Business Operations and IT Operations as these two teams. There will be an acceleration of new technologies and capabilities which will further separate the way these teams operate, making unification even more complicated than it is today. Thankfully there will be better infrastructure abstraction technologies which will allow each of these teams to operate independently (naturally for root-cause we'll need bridges, which is what our goal is at AppDynamics). Many believe the answer to this will be the adoption of private and hybrid PaaS technologies, but I find these are too complex, rooted in yesterday's problems. A better lighter weight approach will emerge built upon containers and orchestration making infrastructure abstraction simple versus the complexity we see today in private PaaS.

The changes in infrastructure, application architecture and management are still pockets of enterprises, and often experimental in nature. Similarly digital business and agility must allow for experimentation, but at some point, the experiment solidifies into a core business tenant. This solidification is what will occur within Digital Business Operations which will result in new more repeatable (or industrialized) ways to handle processes, toolchains, and workflows which today are implemented inconsistently between organizations.


Digital Business Operations requires a fundamental change in specific areas we've taken for granted in mode 1 IT Operations. These include process frameworks such as ITIL, service management (ITSM, Ticketing, Bug Tracking), automation, and configuration management (especially the concept of a CMDB). Each one of these is a topic I'll hopefully cover in future posts. I'll share some of the challenges, ideas as to how the vendors may or may not solve these issues, and some insight into what practitioners or first mover organizations are doing to address these problems.

Friday, June 16, 2017

Instrumentation is Changing... Finally!

 I've always been a fan of trying to solve complicated problems. As an end user, I applied various technologies and tools to diagnose some strange ones over the years. Applications have become increasingly decoupled and distributed requiring the monitoring and diagnostics to change significantly. Let's look at a short timeline of the changes which have occurred, why these changes were necessary, how they helped solve a technological shift, and which challenges remain.

Phase
Why
How Overcome
Component monitoring
Distributed systems became pervasive
High-end solutions (CA, BMC, HP, BMC) became commoditized (Solarwinds)
Event correlation
Too many monitoring tools (still applies) created information overload
Enterprises rely on antiquated tools, many have given up.
Log analytics
Diagnostics too challenging in distributed systems
Splunk unlocked it, but ELK has commoditized it.
Front-end monitoring
Customer centricity is the key for digital business
Hasn’t yet been well adopted, penetration is still under 20%.
Time-series metrics
Microservices created too many instances creating too much data
Has not been solved, but open source is now growing up to handle scale.
Tracing
Complex and distributed systems are difficult to diagnose
Beginning to evolve.

In each of these phases we've had commercial innovators, and over time they have been replaced by either open source or commoditized commercial solutions which are basic and low cost. Typically as there were frankly bigger problems to solve.

I would say that 2010 to 2015 were the era of log analytics. At this point the technology is pervasive, 95% of companies I speak to have a strategy. Most of them use a combination of tools, but inexpensive or free software, not necessarily a trait which is shared by hardware or cloud storage, are becoming the norm. Most enterprises are using at least two vendors today for log analytics. Typically one of them is Splunk, and very often the other is ELK. When this market was less mature ELK was more fragmented, but not the solidified ElasticStack platform which Elastic codified has made it a viable alternative.

The era of end-user monitoring peaked in 2015. Everyone was interested, but implementations remain rather small for the use cases of performance monitoring and operational needs. End user experience is a critical area for today's digital businesses, but often requires a level of maturity which is a challenge for most buyers. I do expect this to continue growing, but there will always be a gap between end user monitoring and backend monitoring. Most vendors who have tried to build this as a standalone product have failed to gain traction. A recent example of this is SOASTA, who had good technology, and decent growth, but failed to build a self-sustaining business. Ultimately SOASTA was sold to Akamai.

2016 was the era of time series metrics, where that market peaked, and we saw a massive amount of new technology companies flourish. Examples of this were increased traction by vendors like DataDog, or newer cases including Wavefront, SingalFx, and others. We've just seen the start of M&A in this area Solarwinds buying Librato in 2015, and more recently VMWare buying Wavefront in 2017. We also have the commercial monitoring entrant Circonus who have begun OEMing their time-series database under the IronDB moniker. The next phase of this market will be the maturity of time-series databases in the open source world. Examples may include  Prometheus, but I'm keeping an eye on InfluxDB. As these options mature for monitoring, IoT, and other time-series use cases especially solving easier data storage at scale. The other missing component is front-ends like Grafana improving significantly with workflows and easier usage. Ultimately we may see an ELK like stack emerge in the time-series world, but we'll have to wait and see.

Due to the increase in microservices and the resulting Docker containers, orchestration layers are beginning to take hold in many organizations. As a result of these shifts in application architectures, 2017 seems to have more of a focus on tracing. The importance of tracing is great to see take hold, as root-cause isolation in distributed systems requires good tracing (at least until ML matures in a meaningful way). Tracing also allows you to understand impact when you have service outages or degradations. Additionally, tracing can be used to tie together IT and business metrics and data. The trace is essentially the "glue" of monitoring so to speak. By tagging and tracing you are essentially creating a forensic trail, although this has yet to apply within security, it will! Gartner even began talking about the application to security not long ago in the research note titled Application Performance Monitoring and Application Security Monitoring Are Converging G00324828, Cameron Haight, Neil MacDonald. The detailed tracing is what companies like AppDynamics and Dynatrace do, and has been the core of their technologies since they were founded based on these concepts. These tools solve complex problems faster and perform much more than just tracing, but the trace is the glue in the technology. Unfortunately for buyers, these monitoring and diagnostic technologies come with a typically high price tag, but they are not optional for digital businesses.

Today's open source tracing projects require developers to do work, this is different than how commercial APM tools work which auto instrument and support common application frameworks. The open-source tooling is getting better with standards like OpenTracing, and front-ends such as Zipkin continually evolving. The problem is these technologies still lack the automation for capture that you see in commercial tools. How often do I trace? What do I trace? When do I trace? If you expect developers to make these decisions all the time I think there will be issues. Developers don't understand the macro performance of how their component will fit into the bigger picture. Also once a service is written and reused for other applications, it's hard to understand the performance implications. I am interested and currently experimenting with proxy level tracing to help extend tagging and tracing in areas where you may not want or need heavier agents. I hope to share more on this on a future blog.

The long-term goal, however, is to combine all of these siloed data sets and technologies by using more advanced correlation capabilities, in addition to applying new algorithms and machine learning to the problem, which is currently in its infancy. Over at AppDynamics, we do this based on a trace itself, but we are evolving new capabilities, and doing so in a unified manner, going back to a trace. Monitoring of digital businesses is going to continue to be an exciting space for quite a long time, requiring constant evolution to keep pace with evolving software and infrastructure architectures.

I’ll be giving an updated talk on monitoring and instrumentation which will cover a lot of what is in this article. We will also go into more depth around instrumentation and tracing. I will premiere this new talk this October at the Full Stack Conference October 23rd and 24th 2017 in Toronto http://2017.fsto.co for tickets. I am Looking forward to contributing to this great event.