OTel探针如何保证与OTel SDK的兼容性
背景
对于探针来说,一般是对一些知名的开源中间件或者SDK进行自动的埋点,以帮助用户简单方便地收集span,metrics等观测数据。但是还是有一部分用户,他们对于可观测数据收集的需求比较高阶,并不满足于只能看到OTel探针收集到的Span,而是想要同时通过OTel SDK与Otel探针对应用程序进行全方位的埋点覆盖,本文将简单讲述OTel探针是如何保证两者的兼容性的。
关键问题
要在使用OTel探针的时候同时使用OTel SDK,首先要考虑以下两个核心问题。
问题1:用户的SDK与探针内的SDK版本不一致怎么办?
为了保持观测数据的一致性,在OTel探针内,也是使用OTel的SDK进行Span与Metrics的生成,那么问题来了,如果用户使用了一个X版本的OTel SDK,探针里使用了Y版本的SDK,他们的公共API可能并不是完全兼容的。这就要求我们的代码保证依赖的兼容性,不管用户使用什么版本的SDK,探针内的SDK都要能正常工作。
问题2:用户使用的SDK的Span怎么和探针产生的Span串起来
这个问题又可以拆解为两个子问题:
- 用户使用OTel SDK,之前可能配置了一个Span上报的端点,但是在接入OTel探针之后上报的端点可能发生了改变。举个例子,之前用户是上报到自建的服务端,现在需要上报到ARMS的服务端,那么之前SDK上报到自建服务端的Span怎么在ARMS里面串起来?
- 用户使用OTel SDK,生成的Span如何与探针中的Span关联父子关系?因为用户SDK与探针SDK中Span的生成逻辑可能并不互通,探针SDK可能无法感知到用户SDK中Span的存在,因此Span的串联成为了又一个相对棘手的问题。
OTel探针的实现
如何解决问题1:
OTel探针通过类加载器等机制隔离了用户的SDK与探针内的SDK,这里不再赘述。简单来说就是有两套SDK,用户一套,探针一套,两套互不干扰。
如何解决问题2:
探针通过对OTel SDK进行埋点来解决问题2,主要埋点的内容分为以下几个模块:

可以先参考以下文档了解一下OTel中上面这些概念:
- Baggage:https://opentelemetry.io/docs/concepts/signals/baggage/
- Propagators:https://opentelemetry.io/docs/concepts/context-propagation/
- Context & Span:https://opentelemetry.io/docs/concepts/signals/traces/#spans
- Tracer:https://opentelemetry.io/docs/concepts/signals/traces/#tracer-provider
首先我们来梳理一下在OTel SDK里面,创建一个Span的流程是怎么样的:
- 需要初始化对应的TraceProvider以及Propagators
- 根据TraceProvider以及Propagators创建Tracer
import io.opentelemetry.api.OpenTelemetry;import io.opentelemetry.api.common.Attributes;import io.opentelemetry.api.trace.Tracer;import io.opentelemetry.api.trace.propagation.W3CTraceContextPropagator;import io.opentelemetry.context.propagation.ContextPropagators;import io.opentelemetry.exporter.otlp.http.trace.OtlpHttpSpanExporter;import io.opentelemetry.sdk.OpenTelemetrySdk;import io.opentelemetry.sdk.resources.Resource;import io.opentelemetry.sdk.trace.SdkTracerProvider;import io.opentelemetry.sdk.trace.export.BatchSpanProcessor;import io.opentelemetry.semconv.resource.attributes.ResourceAttributes;
public class OpenTelemetrySupport {
    static {        // 获取OpenTelemetry Tracer        Resource resource = Resource.getDefault()                .merge(Resource.create(Attributes.of(                        ResourceAttributes.SERVICE_NAME, "",                        ResourceAttributes.SERVICE_VERSION, "",                        ResourceAttributes.DEPLOYMENT_ENVIRONMENT, "",                        ResourceAttributes.HOST_NAME, "${host-name}" // 请将 ${host-name} 替换为您的主机名,                )));
        SdkTracerProvider sdkTracerProvider = SdkTracerProvider.builder()                .addSpanProcessor(BatchSpanProcessor.builder(OtlpHttpSpanExporter.builder()                        .setEndpoint("http://tracing-analysis-dc-hz-internal.aliyuncs.com/adapt_ggxw4lnjuz@7323a5caae30263_ggxw4lnjuz@53df7ad2afe8301/api/otlp/traces")                        .build()).build())                .setResource(resource)                .build();
        OpenTelemetry openTelemetry = OpenTelemetrySdk.builder()                .setTracerProvider(sdkTracerProvider)                .setPropagators(ContextPropagators.create(W3CTraceContextPropagator.getInstance()))                .buildAndRegisterGlobal();
        tracer = openTelemetry.getTracer("OpenTelemetry Tracer", "1.0.0");    }
    private static Tracer tracer;
    public static Tracer getTracer() {        return tracer;    }}- 根据Tracer,生成出对应的Span,之后通过其startSpan与endSpan来上报对应的Span
import io.opentelemetry.api.trace.Span;import io.opentelemetry.api.trace.StatusCode;import io.opentelemetry.context.Scope;
public class Main {
    public static void parentMethod() {        Span span = OpenTelemetrySupport.getTracer().spanBuilder("parent span").startSpan();        try (Scope scope = span.makeCurrent()) {            span.setAttribute("good", "job");            childMethod();        } catch (Throwable t) {            span.setStatus(StatusCode.ERROR, "handle parent span error");        } finally {            span.end();        }    }
    public static void childMethod() {        Span span = OpenTelemetrySupport.getTracer().spanBuilder("child span").startSpan();        try (Scope scope = span.makeCurrent()) {            span.setAttribute("hello", "world");        } catch (Throwable t) {            span.setStatus(StatusCode.ERROR, "handle child span error");        } finally {            span.end();        }    }
    public static void main(String[] args) {        parentMethod();    }}兼容的做法很简单,就是在用户创建Span的关键流程上使用包装类对以上所有的操作进行代理,以创建Span为例,埋点代码如下:
/* * Copyright The OpenTelemetry Authors * SPDX-License-Identifier: Apache-2.0 */
package io.opentelemetry.javaagent.instrumentation.opentelemetryapi;
import static net.bytebuddy.matcher.ElementMatchers.isMethod;import static net.bytebuddy.matcher.ElementMatchers.isStatic;import static net.bytebuddy.matcher.ElementMatchers.named;
import application.io.opentelemetry.api.trace.Span;import application.io.opentelemetry.api.trace.SpanContext;import io.opentelemetry.javaagent.extension.instrumentation.TypeInstrumentation;import io.opentelemetry.javaagent.extension.instrumentation.TypeTransformer;import io.opentelemetry.javaagent.instrumentation.opentelemetryapi.trace.Bridging;import net.bytebuddy.asm.Advice;import net.bytebuddy.description.type.TypeDescription;import net.bytebuddy.matcher.ElementMatcher;
public class SpanInstrumentation implements TypeInstrumentation {  @Override  public ElementMatcher<TypeDescription> typeMatcher() {    return named("application.io.opentelemetry.api.trace.PropagatedSpan");  }
  @Override  public void transform(TypeTransformer transformer) {    transformer.applyAdviceToMethod(        isMethod().and(isStatic()).and(named("create")),        SpanInstrumentation.class.getName() + "$CreateAdvice");  }
  @SuppressWarnings("unused")  public static class CreateAdvice {
    // We replace the return value completely so don't need to call the method.    @Advice.OnMethodEnter(skipOn = Advice.OnDefaultValue.class)    public static boolean methodEnter() {      return false;    }
    @Advice.OnMethodExit    public static void methodExit(        @Advice.Argument(0) SpanContext applicationSpanContext,        @Advice.Return(readOnly = false) Span applicationSpan) {      applicationSpan =          Bridging.toApplication(              io.opentelemetry.api.trace.Span.wrap(Bridging.toAgent(applicationSpanContext)));    }  }}其先把用户使用的OTel SDK中的Context转化成探针中SDK的Context
public static io.opentelemetry.api.trace.SpanContext toAgent(SpanContext applicationContext) {    if (applicationContext.isRemote()) {      return io.opentelemetry.api.trace.SpanContext.createFromRemoteParent(          applicationContext.getTraceId(),          applicationContext.getSpanId(),          BridgedTraceFlags.toAgent(applicationContext.getTraceFlags()),          toAgent(applicationContext.getTraceState()));    } else {      return io.opentelemetry.api.trace.SpanContext.create(          applicationContext.getTraceId(),          applicationContext.getSpanId(),          BridgedTraceFlags.toAgent(applicationContext.getTraceFlags()),          toAgent(applicationContext.getTraceState()));    }  }此后,用这个探针 SDK中的Context创建一个探针 SDK的Span,此后将这个Span做一层代理转化成用户SDK中的Span:
public static Span toApplication(io.opentelemetry.api.trace.Span agentSpan) {    if (!agentSpan.getSpanContext().isValid()) {      // no need to wrap      return Span.getInvalid();    } else {      return new ApplicationSpan(agentSpan);    }  }class ApplicationSpan implements Span {
  private final io.opentelemetry.api.trace.Span agentSpan;
  ApplicationSpan(io.opentelemetry.api.trace.Span agentSpan) {    this.agentSpan = agentSpan;  }
  io.opentelemetry.api.trace.Span getAgentSpan() {    return agentSpan;  }
  @Override  @CanIgnoreReturnValue  public Span setAttribute(String key, String value) {    agentSpan.setAttribute(key, value);    return this;  }
  @Override  @CanIgnoreReturnValue  public Span setAttribute(String key, long value) {    agentSpan.setAttribute(key, value);    return this;  }
  @Override  @CanIgnoreReturnValue  public Span setAttribute(String key, double value) {    agentSpan.setAttribute(key, value);    return this;  }
  @Override  @CanIgnoreReturnValue  public Span setAttribute(String key, boolean value) {    agentSpan.setAttribute(key, value);    return this;  }
  @Override  @CanIgnoreReturnValue  public <T> Span setAttribute(AttributeKey<T> applicationKey, T value) {    @SuppressWarnings("unchecked")    io.opentelemetry.api.common.AttributeKey<T> agentKey = Bridging.toAgent(applicationKey);    if (agentKey != null) {      agentSpan.setAttribute(agentKey, value);    }    return this;  }
  @Override  @CanIgnoreReturnValue  public Span addEvent(String name) {    agentSpan.addEvent(name);    return this;  }
  @Override  @CanIgnoreReturnValue  public Span addEvent(String name, long timestamp, TimeUnit unit) {    agentSpan.addEvent(name, timestamp, unit);    return this;  }
  @Override  @CanIgnoreReturnValue  public Span addEvent(String name, Attributes applicationAttributes) {    agentSpan.addEvent(name, Bridging.toAgent(applicationAttributes));    return this;  }
  @Override  @CanIgnoreReturnValue  public Span addEvent(      String name, Attributes applicationAttributes, long timestamp, TimeUnit unit) {    agentSpan.addEvent(name, Bridging.toAgent(applicationAttributes), timestamp, unit);    return this;  }
  @Override  @CanIgnoreReturnValue  public Span setStatus(StatusCode status) {    agentSpan.setStatus(Bridging.toAgent(status));    return this;  }
  @Override  @CanIgnoreReturnValue  public Span setStatus(StatusCode status, String description) {    agentSpan.setStatus(Bridging.toAgent(status), description);    return this;  }
  @Override  @CanIgnoreReturnValue  public Span recordException(Throwable throwable) {    agentSpan.recordException(throwable);    return this;  }
  @Override  @CanIgnoreReturnValue  public Span recordException(Throwable throwable, Attributes attributes) {    agentSpan.recordException(throwable, Bridging.toAgent(attributes));    return this;  }
  @Override  @CanIgnoreReturnValue  public Span updateName(String name) {    agentSpan.updateName(name);    return this;  }
  @Override  public void end() {    agentSpan.end();  }
  @Override  public void end(long timestamp, TimeUnit unit) {    agentSpan.end(timestamp, unit);  }
  @Override  public SpanContext getSpanContext() {    return Bridging.toApplication(agentSpan.getSpanContext());  }
  @Override  public boolean isRecording() {    return agentSpan.isRecording();  }
  @Override  public boolean equals(@Nullable Object obj) {    if (obj == this) {      return true;    }    if (!(obj instanceof ApplicationSpan)) {      return false;    }    ApplicationSpan other = (ApplicationSpan) obj;    return agentSpan.equals(other.agentSpan);  }
  @Override  public String toString() {    return "ApplicationSpan{agentSpan=" + agentSpan + '}';  }
  @Override  public int hashCode() {    return agentSpan.hashCode();  }
  static class Builder implements SpanBuilder {
    private final io.opentelemetry.api.trace.SpanBuilder agentBuilder;
    Builder(io.opentelemetry.api.trace.SpanBuilder agentBuilder) {      this.agentBuilder = agentBuilder;    }
    @Override    @CanIgnoreReturnValue    public SpanBuilder setParent(Context applicationContext) {      agentBuilder.setParent(AgentContextStorage.getAgentContext(applicationContext));      return this;    }
    @Override    @CanIgnoreReturnValue    public SpanBuilder setNoParent() {      agentBuilder.setNoParent();      return this;    }
    @Override    @CanIgnoreReturnValue    public SpanBuilder addLink(SpanContext applicationSpanContext) {      agentBuilder.addLink(Bridging.toAgent(applicationSpanContext));      return this;    }
    @Override    @CanIgnoreReturnValue    public SpanBuilder addLink(        SpanContext applicationSpanContext, Attributes applicationAttributes) {      agentBuilder.addLink(Bridging.toAgent(applicationSpanContext));      return this;    }
    @Override    @CanIgnoreReturnValue    public SpanBuilder setAttribute(String key, String value) {      agentBuilder.setAttribute(key, value);      return this;    }
    @Override    @CanIgnoreReturnValue    public SpanBuilder setAttribute(String key, long value) {      agentBuilder.setAttribute(key, value);      return this;    }
    @Override    @CanIgnoreReturnValue    public SpanBuilder setAttribute(String key, double value) {      agentBuilder.setAttribute(key, value);      return this;    }
    @Override    @CanIgnoreReturnValue    public SpanBuilder setAttribute(String key, boolean value) {      agentBuilder.setAttribute(key, value);      return this;    }
    @Override    @CanIgnoreReturnValue    public <T> SpanBuilder setAttribute(AttributeKey<T> applicationKey, T value) {      @SuppressWarnings("unchecked")      io.opentelemetry.api.common.AttributeKey<T> agentKey = Bridging.toAgent(applicationKey);      if (agentKey != null) {        agentBuilder.setAttribute(agentKey, value);      }      return this;    }
    @Override    @CanIgnoreReturnValue    public SpanBuilder setSpanKind(SpanKind applicationSpanKind) {      io.opentelemetry.api.trace.SpanKind agentSpanKind = toAgentOrNull(applicationSpanKind);      if (agentSpanKind != null) {        agentBuilder.setSpanKind(agentSpanKind);      }      return this;    }
    @Override    @CanIgnoreReturnValue    public SpanBuilder setStartTimestamp(long startTimestamp, TimeUnit unit) {      agentBuilder.setStartTimestamp(startTimestamp, unit);      return this;    }
    @Override    public Span startSpan() {      return new ApplicationSpan(agentBuilder.startSpan());    }  }}可以看到,这个代理的ApplicationSpan实现了用户代码中OTel SDK的Span接口,里面的方法全部都是一个普通的代理转发。同时这个埋点把用户SDK中的createSpan逻辑进行了跳过,所以其实这段代码只会执行探针中的相关逻辑,从而避免了用户SDK与探针冲突。
总结
Otel探针通过对用户的Otel SDK进行埋点增强,从而保证了两者的兼容性。通过将Otel中的一些关键类进行包装代理,从而优雅的将SDK与Agent进行桥接。