可观测Go Agent如何实现无侵入Go应用监控

2025-01-20

引言

随着Kubernetes和容器化技术的普及，Go语言不仅在云原生基础组件领域广泛应用，也在各类业务场景中占据了重要地位。如今，越来越多的新兴业务选择Golang作为首选编程语言。得益于丰富的RPC框架（如Gin、Kratos、Kitex等），Golang在微服务生态中愈加成熟，并被用于很多重要的开源项目，如OpenTelemetry Collector、ETCD、Prometheus、Istio等。

但是跟Java相比，Golang在微服务生态上依然处于劣势，相比Java 可以使用字节码增强的技术来实现无侵入的应用监控能力，Golang没有成熟的对应方案，当前，大多数面向Golang应用的监控能力主要是通过SDK方式接入，如OTel SDK，需要开放人员手动进行埋点，手动埋点的方案就会存在以下的两个问题：

Trace需要每个调用点都需要进行埋点，同时要注意Trace上下文的传递，避免链路串联错误
Metrics统计，需要针对每次调用都进行统计，同时注意指标发散的问题
工作量非常大，对业务侵入性，每增加一个接口就需要同步增加对应的埋点

为了解决上述问题，可观测Go Agent应运而生。

可观测Agent架构

Java有JVM提供的基于字节码增强的能力可以进行无侵入的埋点，Golang没有类似的能力，因此这里我们是通过编译期注入的方案，在编译期完成埋点的注入，架构如下所示：

熟悉Golang编译流程的同学会比较熟悉，Go应用程序编译的大概流程如下所示：

创建临时目录，类似的语句为 <font style="color:rgb(0, 0, 0);">mkdir -p $WORK/b088/</font>。
查找依赖信息，类似的语句为 <font style="color:rgb(0, 0, 0);">cat >/var/folders/7c/xvg9tyv929d1mqbd44ygh8400000gp/T/go-build2899987616/b104/importcfg << 'EOF' # internal</font>，importcfg中包含所有的依赖信息。
compile编译出目标文件xxx.a
生成Link需要的配置文件，运行Link将上述目标文件转换为可执行文件
将可执行文件移到当前目录，删除临时目录

同时Go提供了-toolexec指定程序编译时候的工具，工具会在go build的时候介入编译过程，如下所示：

go build -toolexec xxx

那Go Agent如何在编译期完成埋点的注入呢，我们从以下几个方面来进行介绍：

查找埋点

一般微服务的代码非常多如何找到需要插入的点呢，这里使用了语法树的能力，通过语法树分析出来每个.go文件中的语法，下面介绍一下语法树如何使用的：

使用Lexer词法分析器对源文件进行语法分析，生成一个Token
Parser解析器通过检索分析生存AST语法树

Go本身有提供上述这些库，如下所示：

go/scanner：词法解析，将源代码分割成一个个token
go/token：token类型及相关结构体定义
go/ast：ast的结构定义
go/parser：语法分析，读取token流生成ast

通过下面这个Demo进行AST的介绍：

package hello

import "fmt"

func greet() {
    var msg = "Hello World!"
    fmt.Println(msg)
}

分析后的结果如下所示：

0  *ast.File {
     1  .  Package: 2:1
     2  .  Name: *ast.Ident {
     3  .  .  NamePos: 2:9
     4  .  .  Name: "hello"
     5  .  }
     6  .  Decls: []ast.Decl (len = 2) {
     7  .  .  0: *ast.GenDecl {
     8  .  .  .  TokPos: 4:1
     9  .  .  .  Tok: import
    10  .  .  .  Lparen: -
    11  .  .  .  Specs: []ast.Spec (len = 1) {
    12  .  .  .  .  0: *ast.ImportSpec {
    13  .  .  .  .  .  Path: *ast.BasicLit {
    14  .  .  .  .  .  .  ValuePos: 4:8
    15  .  .  .  .  .  .  Kind: STRING
    16  .  .  .  .  .  .  Value: "\"fmt\""
    17  .  .  .  .  .  }
    18  .  .  .  .  .  EndPos: -
    19  .  .  .  .  }
    20  .  .  .  }
    21  .  .  .  Rparen: -
    22  .  .  }
    23  .  .  1: *ast.FuncDecl {
    24  .  .  .  Name: *ast.Ident {
    25  .  .  .  .  NamePos: 6:6
    26  .  .  .  .  Name: "greet"
    27  .  .  .  .  Obj: *ast.Object {
    28  .  .  .  .  .  Kind: func
    29  .  .  .  .  .  Name: "greet"
    30  .  .  .  .  .  Decl: *(obj @ 23)
    31  .  .  .  .  }
    32  .  .  .  }
    33  .  .  .  Type: *ast.FuncType {
    34  .  .  .  .  Func: 6:1
    35  .  .  .  .  Params: *ast.FieldList {
    36  .  .  .  .  .  Opening: 6:11
    37  .  .  .  .  .  Closing: 6:12
    38  .  .  .  .  }
    39  .  .  .  }
    40  .  .  .  Body: *ast.BlockStmt {
    41  .  .  .  .  Lbrace: 6:14
    42  .  .  .  .  List: []ast.Stmt (len = 2) {
    43  .  .  .  .  .  0: *ast.DeclStmt {
    44  .  .  .  .  .  .  Decl: *ast.GenDecl {
    45  .  .  .  .  .  .  .  TokPos: 7:5
    46  .  .  .  .  .  .  .  Tok: var
    47  .  .  .  .  .  .  .  Lparen: -
    48  .  .  .  .  .  .  .  Specs: []ast.Spec (len = 1) {
    49  .  .  .  .  .  .  .  .  0: *ast.ValueSpec {
    50  .  .  .  .  .  .  .  .  .  Names: []*ast.Ident (len = 1) {
    51  .  .  .  .  .  .  .  .  .  .  0: *ast.Ident {
    52  .  .  .  .  .  .  .  .  .  .  .  NamePos: 7:9
    53  .  .  .  .  .  .  .  .  .  .  .  Name: "msg"
    54  .  .  .  .  .  .  .  .  .  .  .  Obj: *ast.Object {
    55  .  .  .  .  .  .  .  .  .  .  .  .  Kind: var
    56  .  .  .  .  .  .  .  .  .  .  .  .  Name: "msg"
    57  .  .  .  .  .  .  .  .  .  .  .  .  Decl: *(obj @ 49)
    58  .  .  .  .  .  .  .  .  .  .  .  .  Data: 0
    59  .  .  .  .  .  .  .  .  .  .  .  }
    60  .  .  .  .  .  .  .  .  .  .  }
    61  .  .  .  .  .  .  .  .  .  }
    62  .  .  .  .  .  .  .  .  .  Values: []ast.Expr (len = 1) {
    63  .  .  .  .  .  .  .  .  .  .  0: *ast.BasicLit {
    64  .  .  .  .  .  .  .  .  .  .  .  ValuePos: 7:15
    65  .  .  .  .  .  .  .  .  .  .  .  Kind: STRING
    66  .  .  .  .  .  .  .  .  .  .  .  Value: "\"Hello World!\""
    67  .  .  .  .  .  .  .  .  .  .  }
    68  .  .  .  .  .  .  .  .  .  }
    69  .  .  .  .  .  .  .  .  }
    70  .  .  .  .  .  .  .  }
    71  .  .  .  .  .  .  .  Rparen: -
    72  .  .  .  .  .  .  }
    73  .  .  .  .  .  }
    74  .  .  .  .  .  1: *ast.ExprStmt {
    75  .  .  .  .  .  .  X: *ast.CallExpr {
    76  .  .  .  .  .  .  .  Fun: *ast.SelectorExpr {
    77  .  .  .  .  .  .  .  .  X: *ast.Ident {
    78  .  .  .  .  .  .  .  .  .  NamePos: 8:5
    79  .  .  .  .  .  .  .  .  .  Name: "fmt"
    80  .  .  .  .  .  .  .  .  }
    81  .  .  .  .  .  .  .  .  Sel: *ast.Ident {
    82  .  .  .  .  .  .  .  .  .  NamePos: 8:9
    83  .  .  .  .  .  .  .  .  .  Name: "Println"
    84  .  .  .  .  .  .  .  .  }
    85  .  .  .  .  .  .  .  }
    86  .  .  .  .  .  .  .  Lparen: 8:16
    87  .  .  .  .  .  .  .  Args: []ast.Expr (len = 1) {
    88  .  .  .  .  .  .  .  .  0: *ast.Ident {
    89  .  .  .  .  .  .  .  .  .  NamePos: 8:17
    90  .  .  .  .  .  .  .  .  .  Name: "msg"
    91  .  .  .  .  .  .  .  .  .  Obj: *(obj @ 54)
    92  .  .  .  .  .  .  .  .  }
    93  .  .  .  .  .  .  .  }
    94  .  .  .  .  .  .  .  Ellipsis: -
    95  .  .  .  .  .  .  .  Rparen: 8:20
    96  .  .  .  .  .  .  }
    97  .  .  .  .  .  }
    98  .  .  .  .  }
    99  .  .  .  .  Rbrace: 9:1
   100  .  .  .  }
   101  .  .  }
   102  .  }
   103  .  FileStart: 1:1
   104  .  FileEnd: 9:3
   105  .  Scope: *ast.Scope {
   106  .  .  Objects: map[string]*ast.Object (len = 1) {
   107  .  .  .  "greet": *(obj @ 27)
   108  .  .  }
   109  .  }
   110  .  Imports: []*ast.ImportSpec (len = 1) {
   111  .  .  0: *(obj @ 12)
   112  .  }
   113  .  Unresolved: []*ast.Ident (len = 1) {
   114  .  .  0: *(obj @ 77)
   115  .  }
   116  .  GoVersion: ""
   117  }

其中ast.Ident 表示包名，ast.GenDecl表示函数以外的所有声明，如import、const、var、type等关键字，ast.FuncDecl代表函数声明和函数的内部参数等。

代码插入

通过上述的词法分析就可以得出当前Golang服务中的代码编写情况，然后修改这些分析出来的语法树，将监控相关的逻辑如生成span添加到语法树中。

我们在Agent中提供了一个代码插入的框架，以下是插入框架对应的API，其中可以标注进行埋点的规则，如针对哪个SDK、哪个版本范围、哪个函数、哪个类进行埋点，埋点的前后代码的是什么。

package api

type InstrumentPriority int

const (
  InstrumentPointDefault   InstrumentPriority = 0
  InstrumentPriorityLow    InstrumentPriority = 0
  InstrumentPriorityMedium InstrumentPriority = 1
  InstrumentPriorityHigh   InstrumentPriority = 2
)

type InstrumentRule struct {
  Version      string             // Version of the rule, e.g. "v1.9.1" ====>[start,end)
  PkgName      string             // Package name, e.g. "gin"
  FullPkgName  string             // Full package name, e.g. "github.com/gin-gonic/gin"
  FuncName     string             // Function name, e.g. "New"
  RecvTypeName string             // Receiver type name, e.g. "*gin.Engine" or "net/http.*Client"
  Priority     InstrumentPriority // Priority of the rule, indicates the order of the rule
  OnEnter      string             // OnEnter callback
  OnExit       string             // OnExit callback
}

type CallContext struct {
  SkipCall bool
    Context  map[string]interface{}
}

类似go redis的埋点如下：

r1 := api.NewRule("github.com/redis/go-redis/v9", "NewFailoverClient", "", "", "afterNewFailOverRedisClient").WithVersion("[9.0.5,9.5.2)")
  api.RegisterRule(r1)

其中afterNewFailOverRedisClient 就是我们想要插入到NewFailoverClient 函数中的代码，通过这个API我们非常方便的去定义我们的埋点方法，同时方便进行扩展。

混合编译

在查找到埋点的位置后，通过API完成埋点代码的插入，接下来就是进行混合编译阶段，编译过程中将插入的代码和已有的代码一起编译，编译完成后会生成对应的二进制文件，Go Agent编译代码的流程如下：

CGO_ENABLED=0 GOOS=linux GOARCH=amd64 ./aliyun-go-agent

可观测Trace/Metrics能力

介绍完整个编译、插入流程后，我们将介绍一下在Go Agent中我们注入的Trace和Metrics能力，Trace、Metrics作为可观测领域最重要的2个部分（后续我们还会提供Profiling的能力），对应用的稳定性监控至关重要。

Trace

Trace埋点

Trace其实就是链路追踪，一次调用可以通过一条链路信息找到所有的调用的接口、延时等数据，如下所示：

在Go Agent中我们在每个调用的埋点的开始处调用tracer.Start()

ctx, span := tracer.Start(req.Context(), req.URL.Path, opts...)

在埋点结束时候调用span.End()

span.End()

Trace上下文透传

在同一个应用的不同的埋点中如何保障trace的上下文传递不会丢失呢，这里我们在goroutine中增加一个tls context变量，goroutine是通过以下的结构体描述的

type g struct {
  // Stack parameters.
  // stack describes the actual stack memory: [stack.lo, stack.hi).
  // stackguard0 is the stack pointer compared in the Go stack growth prologue.
  // It is stack.lo+StackGuard normally, but can be StackPreempt to trigger a preemption.
  // stackguard1 is the stack pointer compared in the //go:systemstack stack growth prologue.
  // It is stack.lo+StackGuard on g0 and gsignal stacks.
  // It is ~0 on other goroutine stacks, to trigger a call to morestackc (and crash).
  stack       stack   // offset known to runtime/cgo
  stackguard0 uintptr // offset known to liblink
  stackguard1 uintptr // offset known to liblink

  _panic    *_panic // innermost panic - offset known to liblink
  _defer    *_defer // innermost defer
  m         *m      // current m; offset known to arm liblink
  sched     gobuf
  syscallsp uintptr // if status==Gsyscall, syscallsp = sched.sp to use during gc
  syscallpc uintptr // if status==Gsyscall, syscallpc = sched.pc to use during gc
  stktopsp  uintptr // expected sp at top of stack, to check in traceback
  // param is a generic pointer parameter field used to pass
  // values in particular contexts where other storage for the
  // parameter would be difficult to find. It is currently used
  // in four ways:
  // 1. When a channel operation wakes up a blocked goroutine, it sets param to
  //    point to the sudog of the completed blocking operation.
  // 2. By gcAssistAlloc1 to signal back to its caller that the goroutine completed
  //    the GC cycle. It is unsafe to do so in any other way, because the goroutine's
  //    stack may have moved in the meantime.
  // 3. By debugCallWrap to pass parameters to a new goroutine because allocating a
  //    closure in the runtime is forbidden.
  // 4. When a panic is recovered and control returns to the respective frame,
  //    param may point to a savedOpenDeferState.
  param        unsafe.Pointer
  atomicstatus atomic.Uint32
  stackLock    uint32 // sigprof/scang lock; TODO: fold in to atomicstatus
  goid         uint64
  schedlink    guintptr
  waitsince    int64      // approx time when the g become blocked
  waitreason   waitReason // if status==Gwaiting

  preempt       bool // preemption signal, duplicates stackguard0 = stackpreempt
  preemptStop   bool // transition to _Gpreempted on preemption; otherwise, just deschedule
  preemptShrink bool // shrink stack at synchronous safe point

  // asyncSafePoint is set if g is stopped at an asynchronous
  // safe point. This means there are frames on the stack
  // without precise pointer information.
  asyncSafePoint bool

  paniconfault bool // panic (instead of crash) on unexpected fault address
  gcscandone   bool // g has scanned stack; protected by _Gscan bit in status
  throwsplit   bool // must not split stack
  // activeStackChans indicates that there are unlocked channels
  // pointing into this goroutine's stack. If true, stack
  // copying needs to acquire channel locks to protect these
  // areas of the stack.
  activeStackChans bool
  // parkingOnChan indicates that the goroutine is about to
  // park on a chansend or chanrecv. Used to signal an unsafe point
  // for stack shrinking.
  parkingOnChan atomic.Bool
  // inMarkAssist indicates whether the goroutine is in mark assist.
  // Used by the execution tracer.
  inMarkAssist bool
  coroexit     bool // argument to coroswitch_m

  raceignore    int8  // ignore race detection events
  nocgocallback bool  // whether disable callback from C
  tracking      bool  // whether we're tracking this G for sched latency statistics
  trackingSeq   uint8 // used to decide whether to track this G
  trackingStamp int64 // timestamp of when the G last started being tracked
  runnableTime  int64 // the amount of time spent runnable, cleared when running, only used when tracking
  lockedm       muintptr
  sig           uint32
  writebuf      []byte
  sigcode0      uintptr
  sigcode1      uintptr
  sigpc         uintptr
  parentGoid    uint64          // goid of goroutine that created this goroutine
  gopc          uintptr         // pc of go statement that created this goroutine
  ancestors     *[]ancestorInfo // ancestor information goroutine(s) that created this goroutine (only used if debug.tracebackancestors)
  startpc       uintptr         // pc of goroutine function
  racectx       uintptr
  waiting       *sudog         // sudog structures this g is waiting on (that have a valid elem ptr); in lock order
  cgoCtxt       []uintptr      // cgo traceback context
  labels        unsafe.Pointer // profiler labels
  timer         *timer         // cached timer for time.Sleep
  selectDone    atomic.Uint32  // are we participating in a select and did someone win the race?

  coroarg *coro // argument during coroutine transfers

  // goroutineProfiled indicates the status of this goroutine's stack for the
  // current in-progress goroutine profile
  goroutineProfiled goroutineProfileStateHolder

  // Per-G tracer state.
  trace gTraceState

  // Per-G GC state

  // gcAssistBytes is this G's GC assist credit in terms of
  // bytes allocated. If this is positive, then the G has credit
  // to allocate gcAssistBytes bytes without assisting. If this
  // is negative, then the G must correct this by performing
  // scan work. We track this in bytes to make it fast to update
  // and check for debt in the malloc hot path. The assist ratio
  // determines how this corresponds to scan work debt.
  gcAssistBytes int64
}

我们通过编译时注入的方式，在其中增加一个变量，trace_tls用于保存trace的上下文信息。

trace_tls SpanContext

代码中我们直接使用tracer.start 时候，会通过埋点的方式去在g中查找trace的上下文信息，如果有新的goroutine创建，我们也会对newproc1 函数进行埋点，将父goroutine trace信息传递给子gouroutine，通过这样的方式确保了单条trace id对应的上下文都能串联在一起。

Metrics

指标的统计跟Trace类似，在每个埋点的地方对如调用次数、时间、错误、慢请求都进行记录，同时为了避免指标的发散带来的性能问题，我们通过指标收敛减少指标的数量，通过下面两个收敛器完成指标收敛：

常规收敛器：负责根据输入规则直接转换输出，例如转换url，转换sql语句等。
限制收敛器：实现对收敛后的总值域大小的限制**，**其内部使用不同方式维护了一套有大小上限Limit的白名单。一般逻辑为，当白名单已满且要收敛的值不在白名单中时，触发收敛逻辑。

Go Agent Plugin

支持20个常见的微服务框架、中间件SDK等，同时对OTel SDK可以兼容。

	plugin	仓库地址	低版本	高版本
1	net/http	https://pkg.go.dev/net/http	v1.18	v1.21
2	go-restful	https://github.com/emicklei/go-restful	v3.7.0	v3.12.0
3	fasthttp	https://github.com/valyala/fasthttp	v1.50.0	v1.54.0
4	go-zero	https://github.com/zeromicro/go-zero	v1.5.0	v1.6.5
5	echo	https://github.com/labstack/echo	v4.11.4	v4.12.0
6	gin	https://github.com/gin-gonic/gin	v1.8.0	v1.9.0
7	mux	https://github.com/gorilla/mux	v1.8.1
8	dubbo	https://github.com /apache/dubbo-go	v3.0.1	v3.1.0
9	kratos	https://github.com/go-kratos/kratos	v2.5.2	v2.7.3
10	go-micro	https://github.com/go-micro/go-micro	v4.9.0	v4.11.0
11	grpc	https://github.com/grpc/grpc-go	v1.55.0	v1.64.0
12	go-redis	https://github.com/redis/go-redis	v9.0.3	v9.0.5
13	rocketmq-client-go	https://github.com/apache/rocketmq-client-go	v2.1.0	v2.1.2
14	amqp	https://github.com/rabbitmq/amqp091-go	v1.9.0	v1.10.0
15	go标准库mysql	https://pkg.go.dev/database/sql	v1.18	v1.21
16	go-sql-driver	https://github.com/go-sql-driver/mysql	v1.4.0	v1.7.1
17	mongo	https://github.com/mongodb/mongo-go-driver	v1.11.1	v1.11.7
18	gorm	https://github.com/go-gorm/gorm	v1.22.0	v1.25.1
19	otel sdk	https://github.com/open-telemetry/opentelemetry-go	v1.6.0	v1.26.0
20	kitex	https://github.com/cloudwego/kitex	v0.9.0	v0.10.0

Go Agent兼容性

OTel SDK兼容

OTel SDK的兼容我们支持从v1.6.0版本到v1.26.0版本，在代码中如果已经使用OTel SDK添加埋点逻辑，如下所示，在代码中使用tracer.Start创建了自定义的span：

for {
    tracer := otel.GetTracerProvider().Tracer("")
    ctx, span := tracer.Start(context.Background(), "Client/User defined span")
    for i := 0; i < 3; i++ {
      otel.GetTextMapPropagator()
      //req, err := http.NewRequestWithContext(ctx, "GET", "http://localhost:9000/http-service1", nil)
      req, err := http.NewRequestWithContext(ctx, "GET", "http://otel-server:9000/http-service1", nil)
      if err != nil {
        fmt.Println(err.Error())
        continue
      }
      client := &http.Client{}
      resp, err := client.Do(req)
      if err != nil {
        fmt.Println(err.Error())
        continue
      }
      defer resp.Body.Close()
      b, err := io.ReadAll(resp.Body)
      if err != nil {
        fmt.Println(err.Error())
        continue
      }
      fmt.Println(string(b))
      fmt.Println(resp.Header)
      time.Sleep(time.Millisecond * 10)
    }
    sc := span.SpanContext()
    fmt.Println(sc.TraceID())
    fmt.Println(sc.SpanID())
    span.SetAttributes(attribute.String("client", "client-with-ot"))
    span.SetAttributes(attribute.Bool("user.defined", true))
    span.End()
    time.Sleep(time.Millisecond * 10)
  }