Monday, March 12, 2018

Misunderstanding "Open Tracing" for the Enterprise

When first hearing of the OpenTracing project in 2016 there was excitement, finally an open standard for tracing. First, what is a trace? A trace is following a transaction from different services to build an end to end picture. The latency of each transaction segment is captured to determine which is slow, or causing performance issues. The trace may also include metadata such as metrics and logs, more on that later.

Great, so if this is open this will solve all interoperability issues we have, and allow me to use multiple APM and tracing tools at once? It will help avoid vendor or project lock-in, unlock cloud services which are opaque or invisible? Nope! Why not?

Today there are so many different implementations of tracing providing end to end transaction monitoring, and the reason why is that each project or vendor has different capabilities and use cases for the traces. Most tool users don't need to know the implementation details, but when manually instrumenting with an API, the implementation must be well understood along with the use cases for the data. When you look at language-specific deployments there is even more variable, what you can do in Java is worlds apart from what is available for Golang and Python. It gets even more complicated as every vendor and open source tool uses a different language to describe what they do and what they call things.

The Enterprise uses a wide range of technologies which must be cobbled together to make their applications work. Some of the custom apps written in Java, .NET and other languages, much of it a decade old. Other parts of the stack are packaged applications such as those provided by Oracle, Microsoft, SAP, and many more. These often work with messaging systems which span both open source and commercial tools using proprietary protocols such as those offered by Tibco, Mulesoft, Oracle, Microsoft, and open source projects such as ActiveMQ and RabbitMQ. Finally, there are modern technologies built in a newer language, for example, Golang or Python which often use native cloud services and PaaS platforms. All of these must work together for high-quality user experience and expected performance, resulting in desirable business outcomes. Don't forget the fact that each of these technologies often has data store and database requirements which must also work for them to function. The result is that identification, isolation, and remediation of problems is a big challenge.

In many Enterprise organizations, each of these "hops" are managed by a different team or subject matter expert who often uses another siloed tool for monitoring and diagnostics. The enterprise APM providers build end to end views across these technologies, both old and new. Unified views are incredibly hard to do technically and culturally, and even more difficult in production, under heavy load, with minimally affecting the performance of the instrumented transactions.

The retrospective to the Enterprise is smaller more modern companies who build things differently. They create customized stacks of open source, develop tools and technologies which can go very deep into the infrastructure layers. This subset of companies are not the Enterprise; they have typically been running in cloud or containers since the inception of the company. They make different decisions and are happy to leverage open source projects which are often customized extensively. They are writing their instrumentation for various purposes including monitoring and root-cause analysis. OpenTracing attempts to standardize the custom coded instrumentation with a standard API and language. Once the developer writes the instrumentation, how a tool or platform consumes data is not part of the standard, it's not a problem OpenTracing is trying to solve.

The result of this decision is that if a user has implemented OpenTracing with a specific vendor and language, let's call it "Tool  A" they must release libraries or implementations for OpenTracing for each language. Yes, each of the dozens of languages needs an implementation from each tool with specific language features to match what the tool does. If I were to switch to a different tool, such as "tool B" that would mean changing the libraries which the vendor hopefully supplied for each language. Implementation of a new tool requires code changes associated with connecting the library and any language specific implementation changes. For a mature language such as Java, this would require changing the library and the implementation at the same time since the propagation formats to the tools are incompatible. OpenTracing doesn't solve the interoperability problem, so what does the "open standard" attempting to solve? Well, for one thing, is that it allows those making gateways, proxies, and frameworks the ability to write instrumentation. That should, in theory, make it easier to get traces connected, but once again the requirement to change implementation details for each tool is a problem.

Enterprise APM tools do a lot more than tracing, for well-understood languages like Java, APM tools collect metrics from the runtime (JMX), from the code (BCI), they capture call stacks, SQL calls, and other details which are correlated back to the trace. Advanced tools can even do runtime instrumentation without code change using BCI. If you were doing this with OpenTracing, it would require code changes to make the API calls. These code changes are then part of your code base and must be maintained and managed by your team. Don't forget that Enterprise APM tools do a lot more than just backend tracing, and they capture metrics and traces on the front end (mobile/web), infrastructure, log capture, and even correlation to other APIs. These additional capabilities are out of scope for the OpenTracing project.

So if the goals are solving any of the problems an open standard would solve, OpenTracing doesn't do a lot. If you read the marketing put out by various vendors and foundations, one would think its the panacea, but the reality is far from the truth. OpenTracing does not standardize metrics, logs, or other structured data which tools consume.

For some reason there is a presumption that swapping agents out of an APM tool is difficult, this is entirely not the case. Vendors replace each other all the time, if interested in these stories, many published, and countless others remain behind closed doors. It's much easier to change instrumentation done at runtime versus those hardcoded into your products. This is especially the case when the API and standards are always changing, and many of the changes in OpenTracing have been breaking changes to the APIs.

Enterprise APM vendors must build a lot of instrumentation to support these various technologies, languages, and frameworks. Each of the companies had teams of people just maintaining framework instrumentation. Each tool vendor has dozens of engineers dedicated to writing these, which is a waste of resources, and could be more efficient if there were an actual open standard.

In the past six months, a working group formed thanks to the primary open source developers at Google, Pivotal, and Microsoft who created the OpenCensus project, focused on solving these issues including interoperability. OpenCensus is a single distribution of libraries that automatically collects traces and metrics from your app, displays them locally, and sends them to any analysis tool. OpenCensus is a pluggable implementation/framework that unifies instrumentation concepts that were traditionally separate, consisting of three major components: tracing, tagging, and metrics.

As part of this project, we are defining TraceContext which is a standard header which tool vendors and open source projects can use to propagate information, otherwise known as a wire protocol. We are building this for HTTP currently in the first version. The organizations behind this standard include your major APM vendors (for the most part), all major public cloud vendors, and some private cloud vendors.

Finally, although OpenTracing gets a tremendous buzz, it's not as popular as one might think. When looking at the distributed tracing Github OpenTracing is not heavily mentioned, in fact, the most used OSS projects are tools are versus APIs. Leading OSS projects include OpenZipkin and Jaeger (Uber/RedHat) along with Skywalking (Huawei). All of these projects have a wide range of commiters, especially OpenZipkin which started at Twitter, and is being community driven, while having ties with Pivotal. These organizations are all intimately involved in OpenCensus along with the main committers to these tools. Most of the marketing fails to mention these successful projects even though they have higher adoption, and are already compatible with many frameworks, API gateways, PaaS, and public clouds, unlike OpenTracing which has limited use cases aside from a handful of small APM vendors and RedHat who have a vested interest in its adoption. 

Since this is a wire protocol the public cloud providers, frameworks, proxies such as linkerd and envoy can implement it, and the data would be natively consumed by any tool, creating true interoperability. The value in this instrumentation is the analytics and context, not the instrumentation (for the most part). Getting behind this standard is vital for the APM industry as it will solve user concerns with open data exchange and interoperability. Please join the OpenCensus project and help us address the real problem, not create new future pain.

This blog post is the opinion of Jonah Kowall and is not affiliated with his employer or other parties.

2 comments:

Unknown said...

Hi Jonas. Regarding needing to modify source code, doesn't OpenCensus require that, too?
For example, the README at https://github.com/census-instrumentation/opencensus-java shows this:

public final class MyClassWithTracing {
private static final Tracer tracer = Tracing.getTracer();

public static void doWork() {
// Create a child Span of the current Span.
try (Scope ss = tracer.spanBuilder("MyChildWorkSpan").startScopedSpan()) {
doInitialWork();
tracer.getCurrentSpan().addAnnotation("Finished initial work");
doFinalWork();
}
}

private static void doInitialWork() {
// ...
tracer.getCurrentSpan().addAnnotation("Important.");
// ...
}

private static void doFinalWork() {
// ...
tracer.getCurrentSpan().addAnnotation("More important.");
// ...
}
}

I could be missing something, but this looks like modified code to me, too. For OpenTracing one might add Java annotations and in this OpenCensus we see one has to add some OC-specific Java code. In that sense, OC is no better than OT, it seems... thoughts?

Jonah Kowall said...

Yes, there is a manual mode, but there is also auto instrumentation for getting latency and traces. You can find the client here https://github.com/census-instrumentation/opencensus-java/tree/master/contrib/agent which sends data to Zipkin and Stack Driver Trace for now. Lots of ways to use this data and more agents may come over time, even from the commercial vendors at some point.