OTel探针如何保证与OTel SDK的兼容性
背景
对于探针来说,一般是对一些知名的开源中间件或者SDK进行自动的埋点,以帮助用户简单方便地收集span,metrics等观测数据。但是还是有一部分用户,他们对于可观测数据收集的需求比较高阶,并不满足于只能看到OTel探针收集到的Span,而是想要同时通过OTel SDK与Otel探针对应用程序进行全方位的埋点覆盖,本文将简单讲述OTel探针是如何保证两者的兼容性的。
关键问题
要在使用OTel探针的时候同时使用OTel SDK,首先要考虑以下两个核心问题。
问题1:用户的SDK与探针内的SDK版本不一致怎么办?
为了保持观测数据的一致性,在OTel探针内,也是使用OTel的SDK进行Span与Metrics的生成,那么问题来了,如果用户使用了一个X版本的OTel SDK,探针里使用了Y版本的SDK,他们的公共API可能并不是完全兼容的。这就要求我们的代码保证依赖的兼容性,不管用户使用什么版本的SDK,探针内的SDK都要能正常工作。
问题2:用户使用的SDK的Span怎么和探针产生的Span串起来
这个问题又可以拆解为两个子问题:
- 用户使用OTel SDK,之前可能配置了一个Span上报的端点,但是在接入OTel探针之后上报的端点可能发生了改变。举个例子,之前用户是上报到自建的服务端,现在需要上报到ARMS的服务端,那么之前SDK上报到自建服务端的Span怎么在ARMS里面串起来?
- 用户使用OTel SDK,生成的Span如何与探针中的Span关联父子关系?因为用户SDK与探针SDK中Span的生成逻辑可能并不互通,探针SDK可能无法感知到用户SDK中Span的存在,因此Span的串联成为了又一个相对棘手的问题。
OTel探针的实现
如何解决问题1:
OTel探针通过类加载器等机制隔离了用户的SDK与探针内的SDK,这里不再赘述。简单来说就是有两套SDK,用户一套,探针一套,两套互不干扰。
如何解决问题2:
探针通过对OTel SDK进行埋点来解决问题2,主要埋点的内容分为以下几个模块:
可以先参考以下文档了解一下OTel中上面这些概念:
- Baggage:https://opentelemetry.io/docs/concepts/signals/baggage/
- Propagators:https://opentelemetry.io/docs/concepts/context-propagation/
- Context & Span:https://opentelemetry.io/docs/concepts/signals/traces/#spans
- Tracer:https://opentelemetry.io/docs/concepts/signals/traces/#tracer-provider
首先我们来梳理一下在OTel SDK里面,创建一个Span的流程是怎么样的:
- 需要初始化对应的TraceProvider以及Propagators
- 根据TraceProvider以及Propagators创建Tracer
import io.opentelemetry.api.OpenTelemetry;import io.opentelemetry.api.common.Attributes;import io.opentelemetry.api.trace.Tracer;import io.opentelemetry.api.trace.propagation.W3CTraceContextPropagator;import io.opentelemetry.context.propagation.ContextPropagators;import io.opentelemetry.exporter.otlp.http.trace.OtlpHttpSpanExporter;import io.opentelemetry.sdk.OpenTelemetrySdk;import io.opentelemetry.sdk.resources.Resource;import io.opentelemetry.sdk.trace.SdkTracerProvider;import io.opentelemetry.sdk.trace.export.BatchSpanProcessor;import io.opentelemetry.semconv.resource.attributes.ResourceAttributes;
public class OpenTelemetrySupport {
static { // 获取OpenTelemetry Tracer Resource resource = Resource.getDefault() .merge(Resource.create(Attributes.of( ResourceAttributes.SERVICE_NAME, "", ResourceAttributes.SERVICE_VERSION, "", ResourceAttributes.DEPLOYMENT_ENVIRONMENT, "", ResourceAttributes.HOST_NAME, "${host-name}" // 请将 ${host-name} 替换为您的主机名, )));
SdkTracerProvider sdkTracerProvider = SdkTracerProvider.builder() .addSpanProcessor(BatchSpanProcessor.builder(OtlpHttpSpanExporter.builder() .setEndpoint("http://tracing-analysis-dc-hz-internal.aliyuncs.com/adapt_ggxw4lnjuz@7323a5caae30263_ggxw4lnjuz@53df7ad2afe8301/api/otlp/traces") .build()).build()) .setResource(resource) .build();
OpenTelemetry openTelemetry = OpenTelemetrySdk.builder() .setTracerProvider(sdkTracerProvider) .setPropagators(ContextPropagators.create(W3CTraceContextPropagator.getInstance())) .buildAndRegisterGlobal();
tracer = openTelemetry.getTracer("OpenTelemetry Tracer", "1.0.0"); }
private static Tracer tracer;
public static Tracer getTracer() { return tracer; }}
- 根据Tracer,生成出对应的Span,之后通过其startSpan与endSpan来上报对应的Span
import io.opentelemetry.api.trace.Span;import io.opentelemetry.api.trace.StatusCode;import io.opentelemetry.context.Scope;
public class Main {
public static void parentMethod() { Span span = OpenTelemetrySupport.getTracer().spanBuilder("parent span").startSpan(); try (Scope scope = span.makeCurrent()) { span.setAttribute("good", "job"); childMethod(); } catch (Throwable t) { span.setStatus(StatusCode.ERROR, "handle parent span error"); } finally { span.end(); } }
public static void childMethod() { Span span = OpenTelemetrySupport.getTracer().spanBuilder("child span").startSpan(); try (Scope scope = span.makeCurrent()) { span.setAttribute("hello", "world"); } catch (Throwable t) { span.setStatus(StatusCode.ERROR, "handle child span error"); } finally { span.end(); } }
public static void main(String[] args) { parentMethod(); }}
兼容的做法很简单,就是在用户创建Span的关键流程上使用包装类对以上所有的操作进行代理,以创建Span为例,埋点代码如下:
/* * Copyright The OpenTelemetry Authors * SPDX-License-Identifier: Apache-2.0 */
package io.opentelemetry.javaagent.instrumentation.opentelemetryapi;
import static net.bytebuddy.matcher.ElementMatchers.isMethod;import static net.bytebuddy.matcher.ElementMatchers.isStatic;import static net.bytebuddy.matcher.ElementMatchers.named;
import application.io.opentelemetry.api.trace.Span;import application.io.opentelemetry.api.trace.SpanContext;import io.opentelemetry.javaagent.extension.instrumentation.TypeInstrumentation;import io.opentelemetry.javaagent.extension.instrumentation.TypeTransformer;import io.opentelemetry.javaagent.instrumentation.opentelemetryapi.trace.Bridging;import net.bytebuddy.asm.Advice;import net.bytebuddy.description.type.TypeDescription;import net.bytebuddy.matcher.ElementMatcher;
public class SpanInstrumentation implements TypeInstrumentation { @Override public ElementMatcher<TypeDescription> typeMatcher() { return named("application.io.opentelemetry.api.trace.PropagatedSpan"); }
@Override public void transform(TypeTransformer transformer) { transformer.applyAdviceToMethod( isMethod().and(isStatic()).and(named("create")), SpanInstrumentation.class.getName() + "$CreateAdvice"); }
@SuppressWarnings("unused") public static class CreateAdvice {
// We replace the return value completely so don't need to call the method. @Advice.OnMethodEnter(skipOn = Advice.OnDefaultValue.class) public static boolean methodEnter() { return false; }
@Advice.OnMethodExit public static void methodExit( @Advice.Argument(0) SpanContext applicationSpanContext, @Advice.Return(readOnly = false) Span applicationSpan) { applicationSpan = Bridging.toApplication( io.opentelemetry.api.trace.Span.wrap(Bridging.toAgent(applicationSpanContext))); } }}
其先把用户使用的OTel SDK中的Context转化成探针中SDK的Context
public static io.opentelemetry.api.trace.SpanContext toAgent(SpanContext applicationContext) { if (applicationContext.isRemote()) { return io.opentelemetry.api.trace.SpanContext.createFromRemoteParent( applicationContext.getTraceId(), applicationContext.getSpanId(), BridgedTraceFlags.toAgent(applicationContext.getTraceFlags()), toAgent(applicationContext.getTraceState())); } else { return io.opentelemetry.api.trace.SpanContext.create( applicationContext.getTraceId(), applicationContext.getSpanId(), BridgedTraceFlags.toAgent(applicationContext.getTraceFlags()), toAgent(applicationContext.getTraceState())); } }
此后,用这个探针 SDK中的Context创建一个探针 SDK的Span,此后将这个Span做一层代理转化成用户SDK中的Span:
public static Span toApplication(io.opentelemetry.api.trace.Span agentSpan) { if (!agentSpan.getSpanContext().isValid()) { // no need to wrap return Span.getInvalid(); } else { return new ApplicationSpan(agentSpan); } }
class ApplicationSpan implements Span {
private final io.opentelemetry.api.trace.Span agentSpan;
ApplicationSpan(io.opentelemetry.api.trace.Span agentSpan) { this.agentSpan = agentSpan; }
io.opentelemetry.api.trace.Span getAgentSpan() { return agentSpan; }
@Override @CanIgnoreReturnValue public Span setAttribute(String key, String value) { agentSpan.setAttribute(key, value); return this; }
@Override @CanIgnoreReturnValue public Span setAttribute(String key, long value) { agentSpan.setAttribute(key, value); return this; }
@Override @CanIgnoreReturnValue public Span setAttribute(String key, double value) { agentSpan.setAttribute(key, value); return this; }
@Override @CanIgnoreReturnValue public Span setAttribute(String key, boolean value) { agentSpan.setAttribute(key, value); return this; }
@Override @CanIgnoreReturnValue public <T> Span setAttribute(AttributeKey<T> applicationKey, T value) { @SuppressWarnings("unchecked") io.opentelemetry.api.common.AttributeKey<T> agentKey = Bridging.toAgent(applicationKey); if (agentKey != null) { agentSpan.setAttribute(agentKey, value); } return this; }
@Override @CanIgnoreReturnValue public Span addEvent(String name) { agentSpan.addEvent(name); return this; }
@Override @CanIgnoreReturnValue public Span addEvent(String name, long timestamp, TimeUnit unit) { agentSpan.addEvent(name, timestamp, unit); return this; }
@Override @CanIgnoreReturnValue public Span addEvent(String name, Attributes applicationAttributes) { agentSpan.addEvent(name, Bridging.toAgent(applicationAttributes)); return this; }
@Override @CanIgnoreReturnValue public Span addEvent( String name, Attributes applicationAttributes, long timestamp, TimeUnit unit) { agentSpan.addEvent(name, Bridging.toAgent(applicationAttributes), timestamp, unit); return this; }
@Override @CanIgnoreReturnValue public Span setStatus(StatusCode status) { agentSpan.setStatus(Bridging.toAgent(status)); return this; }
@Override @CanIgnoreReturnValue public Span setStatus(StatusCode status, String description) { agentSpan.setStatus(Bridging.toAgent(status), description); return this; }
@Override @CanIgnoreReturnValue public Span recordException(Throwable throwable) { agentSpan.recordException(throwable); return this; }
@Override @CanIgnoreReturnValue public Span recordException(Throwable throwable, Attributes attributes) { agentSpan.recordException(throwable, Bridging.toAgent(attributes)); return this; }
@Override @CanIgnoreReturnValue public Span updateName(String name) { agentSpan.updateName(name); return this; }
@Override public void end() { agentSpan.end(); }
@Override public void end(long timestamp, TimeUnit unit) { agentSpan.end(timestamp, unit); }
@Override public SpanContext getSpanContext() { return Bridging.toApplication(agentSpan.getSpanContext()); }
@Override public boolean isRecording() { return agentSpan.isRecording(); }
@Override public boolean equals(@Nullable Object obj) { if (obj == this) { return true; } if (!(obj instanceof ApplicationSpan)) { return false; } ApplicationSpan other = (ApplicationSpan) obj; return agentSpan.equals(other.agentSpan); }
@Override public String toString() { return "ApplicationSpan{agentSpan=" + agentSpan + '}'; }
@Override public int hashCode() { return agentSpan.hashCode(); }
static class Builder implements SpanBuilder {
private final io.opentelemetry.api.trace.SpanBuilder agentBuilder;
Builder(io.opentelemetry.api.trace.SpanBuilder agentBuilder) { this.agentBuilder = agentBuilder; }
@Override @CanIgnoreReturnValue public SpanBuilder setParent(Context applicationContext) { agentBuilder.setParent(AgentContextStorage.getAgentContext(applicationContext)); return this; }
@Override @CanIgnoreReturnValue public SpanBuilder setNoParent() { agentBuilder.setNoParent(); return this; }
@Override @CanIgnoreReturnValue public SpanBuilder addLink(SpanContext applicationSpanContext) { agentBuilder.addLink(Bridging.toAgent(applicationSpanContext)); return this; }
@Override @CanIgnoreReturnValue public SpanBuilder addLink( SpanContext applicationSpanContext, Attributes applicationAttributes) { agentBuilder.addLink(Bridging.toAgent(applicationSpanContext)); return this; }
@Override @CanIgnoreReturnValue public SpanBuilder setAttribute(String key, String value) { agentBuilder.setAttribute(key, value); return this; }
@Override @CanIgnoreReturnValue public SpanBuilder setAttribute(String key, long value) { agentBuilder.setAttribute(key, value); return this; }
@Override @CanIgnoreReturnValue public SpanBuilder setAttribute(String key, double value) { agentBuilder.setAttribute(key, value); return this; }
@Override @CanIgnoreReturnValue public SpanBuilder setAttribute(String key, boolean value) { agentBuilder.setAttribute(key, value); return this; }
@Override @CanIgnoreReturnValue public <T> SpanBuilder setAttribute(AttributeKey<T> applicationKey, T value) { @SuppressWarnings("unchecked") io.opentelemetry.api.common.AttributeKey<T> agentKey = Bridging.toAgent(applicationKey); if (agentKey != null) { agentBuilder.setAttribute(agentKey, value); } return this; }
@Override @CanIgnoreReturnValue public SpanBuilder setSpanKind(SpanKind applicationSpanKind) { io.opentelemetry.api.trace.SpanKind agentSpanKind = toAgentOrNull(applicationSpanKind); if (agentSpanKind != null) { agentBuilder.setSpanKind(agentSpanKind); } return this; }
@Override @CanIgnoreReturnValue public SpanBuilder setStartTimestamp(long startTimestamp, TimeUnit unit) { agentBuilder.setStartTimestamp(startTimestamp, unit); return this; }
@Override public Span startSpan() { return new ApplicationSpan(agentBuilder.startSpan()); } }}
可以看到,这个代理的ApplicationSpan实现了用户代码中OTel SDK的Span接口,里面的方法全部都是一个普通的代理转发。同时这个埋点把用户SDK中的createSpan逻辑进行了跳过,所以其实这段代码只会执行探针中的相关逻辑,从而避免了用户SDK与探针冲突。
总结
Otel探针通过对用户的Otel SDK进行埋点增强,从而保证了两者的兼容性。通过将Otel中的一些关键类进行包装代理,从而优雅的将SDK与Agent进行桥接。