Source file src/runtime/trace.go
1 // Copyright 2023 The Go Authors. All rights reserved. 2 // Use of this source code is governed by a BSD-style 3 // license that can be found in the LICENSE file. 4 5 // # Go execution tracer 6 // 7 // The tracer captures a wide range of execution events like goroutine 8 // creation/blocking/unblocking, syscall enter/exit/block, GC-related events, 9 // changes of heap size, processor start/stop, etc and writes them to a buffer 10 // in a compact form. A precise nanosecond-precision timestamp and a stack 11 // trace is captured for most events. 12 // 13 // ## Design 14 // 15 // The basic idea behind the the execution tracer is to have per-M buffers that 16 // trace data may be written into. Each M maintains a write flag indicating whether 17 // its trace buffer is currently in use. 18 // 19 // Tracing is initiated by StartTrace, and proceeds in "generations," with each 20 // generation being marked by a call to traceAdvance, to advance to the next 21 // generation. Generations are a global synchronization point for trace data, 22 // and we proceed to a new generation by moving forward trace.gen. Each M reads 23 // trace.gen under its own write flag to determine which generation it is writing 24 // trace data for. To this end, each M has 2 slots for buffers: one slot for the 25 // previous generation, one slot for the current one. It uses tl.gen to select 26 // which buffer slot to write to. Simultaneously, traceAdvance uses the write flag 27 // to determine whether every thread is guaranteed to observe an updated 28 // trace.gen. Once it is sure, it may then flush any buffers that are left over 29 // from the previous generation safely, since it knows the Ms will not mutate 30 // it. 31 // 32 // Flushed buffers are processed by the ReadTrace function, which is called by 33 // the trace reader goroutine. The first goroutine to call ReadTrace is designated 34 // as the trace reader goroutine until tracing completes. (There may only be one at 35 // a time.) 36 // 37 // Once all buffers are flushed, any extra post-processing complete, and flushed 38 // buffers are processed by the trace reader goroutine, the trace emits an 39 // EndOfGeneration event to mark the global synchronization point in the trace. 40 // 41 // All other trace features, including CPU profile samples, stack information, 42 // string tables, etc. all revolve around this generation system, and typically 43 // appear in pairs: one for the previous generation, and one for the current one. 44 // Like the per-M buffers, which of the two is written to is selected using trace.gen, 45 // and anything managed this way must similarly be mutated only in traceAdvance or 46 // under the M's write flag. 47 // 48 // Trace events themselves are simple. They consist of a single byte for the event type, 49 // followed by zero or more LEB128-encoded unsigned varints. They are decoded using 50 // a pre-determined table for each trace version: internal/trace/tracev2.specs. 51 // 52 // To avoid relying on timestamps for correctness and validation, each G and P have 53 // sequence counters that are written into trace events to encode a partial order. 54 // The sequence counters reset on each generation. Ms do not need sequence counters 55 // because they are the source of truth for execution: trace events, and even whole 56 // buffers, are guaranteed to appear in order in the trace data stream, simply because 57 // that's the order the thread emitted them in. 58 // 59 // See traceruntime.go for the API the tracer exposes to the runtime for emitting events. 60 // 61 // In each generation, we ensure that we enumerate all goroutines, such that each 62 // generation's data is fully self-contained. This makes features like the flight 63 // recorder easy to implement. To this end, we guarantee that every live goroutine is 64 // listed at least once by emitting a status event for the goroutine, indicating its 65 // starting state. These status events are emitted based on context, generally based 66 // on the event that's about to be emitted. 67 // 68 // The traceEventWriter type encapsulates these details, and is the backbone of 69 // the API exposed in traceruntime.go, though there are deviations where necessary. 70 // 71 // This is the overall design, but as always, there are many details. Beyond this, 72 // look to the invariants and select corner cases below and the code itself for the 73 // source of truth. 74 // 75 // See https://go.dev/issue/60773 for a link to a more complete design with rationale, 76 // though parts of it are out-of-date. 77 // 78 // ## Invariants 79 // 80 // 1. An m that has a trace buffer MUST be on either the allm or sched.freem lists. 81 // 82 // Otherwise, traceAdvance might miss an M with a buffer that needs to be flushed. 83 // 84 // 2. Trace buffers MUST only be mutated in traceAdvance or under a traceAcquire/traceRelease. 85 // 86 // Otherwise, traceAdvance may race with Ms writing trace data when trying to flush buffers. 87 // 88 // 3. traceAdvance MUST NOT return until all of the current generation's buffers are flushed. 89 // 90 // Otherwise, callers cannot rely on all the data they need being available (for example, for 91 // the flight recorder). 92 // 93 // 4. P and goroutine state transition events MUST be emitted by an M that owns its ability 94 // to transition. 95 // 96 // What this means is that the M must either be the owner of the P, the owner of the goroutine, 97 // or owner of a non-running goroutine's _Gscan bit. There are a lot of bad things that can 98 // happen if this invariant isn't maintained, mostly around generating inconsistencies in the 99 // trace due to racy emission of events. 100 // 101 // 5. Acquisition of a P (pidleget or takeP/gcstopP) MUST NOT be performed under a traceAcquire/traceRelease pair. 102 // 103 // Notably, it's important that traceAcquire/traceRelease not cover a state in which the 104 // goroutine or P is not yet owned. For example, if traceAcquire is held across both wirep and 105 // pidleget, then we could end up emitting an event in the wrong generation. Suppose T1 106 // traceAcquires in generation 1, a generation transition happens, T2 emits a ProcStop and 107 // executes pidleput in generation 2, and finally T1 calls pidleget and emits ProcStart. 108 // The ProcStart must follow the ProcStop in the trace to make any sense, but ProcStop was 109 // emitted in a latter generation. 110 // 111 // 6. Goroutine state transitions, with the exception of transitions into _Grunning, MUST be 112 // performed under the traceAcquire/traceRelease pair where the event is emitted. 113 // 114 // Otherwise, traceAdvance may observe a goroutine state that is inconsistent with the 115 // events being emitted. traceAdvance inspects all goroutines' states in order to emit 116 // a status event for any goroutine that did not have an event emitted for it already. 117 // If the generation then advances in between that observation and the event being emitted, 118 // then the trace will contain a status that doesn't line up with the event. For example, 119 // if the event is emitted after the state transition _Gwaiting -> _Grunnable, then 120 // traceAdvance may observe the goroutine in _Grunnable, emit a status event, advance the 121 // generation, and the following generation contains a GoUnblock event. The trace parser 122 // will get confused because it sees that goroutine in _Grunnable in the previous generation 123 // trying to be transitioned from _Gwaiting into _Grunnable in the following one. Something 124 // similar happens if the trace event is emitted before the state transition, so that does 125 // not help either. 126 // 127 // Transitions to _Grunning do not have the same problem because traceAdvance is unable to 128 // observe running goroutines directly. It must stop them, or wait for them to emit an event. 129 // Note that it cannot even stop them with asynchronous preemption in any "bad" window between 130 // the state transition to _Grunning and the event emission because async preemption cannot 131 // stop goroutines in the runtime. 132 // 133 // 7. Goroutine state transitions into _Grunning MUST emit an event for the transition after 134 // the state transition. 135 // 136 // This follows from invariants (4), (5), and the explanation of (6). 137 // The relevant part of the previous invariant is that in order for the tracer to be unable to 138 // stop a goroutine, it must be in _Grunning and in the runtime. So to close any windows between 139 // event emission and the state transition, the event emission must happen *after* the transition 140 // to _Grunning. 141 // 142 // ## Select corner cases 143 // 144 // ### CGO calls / system calls 145 // 146 // CGO calls and system calls are mostly straightforward, except for P stealing. For historical 147 // reasons, this introduces a new trace-level P state called ProcSyscall which used to model 148 // _Psyscall (now _Psyscall_unused). This state is used to indicate in the trace that a P 149 // is eligible for stealing as part of the parser's ordering logic. 150 // 151 // Another quirk of this corner case is the ProcSyscallAbandoned trace-level P state, which 152 // is used only in status events to indicate a relaxation of verification requirements. It 153 // means that if the execution trace parser can't find the corresponding thread that the P 154 // was stolen from in the state it expects it to be, to accept the trace anyway. This is also 155 // historical. When _Psyscall still existed, one would steal and then ProcSteal, and there 156 // was no ordering between the ProcSteal and the subsequent GoSyscallEndBlocked. One clearly 157 // happened before the other, but since P stealing was a single atomic, there was no way 158 // to enforce the order. The GoSyscallEndBlocked thread could move on and end up in any 159 // state, and the GoSyscallEndBlocked could be in a completely different generation to the 160 // ProcSteal. Today this is no longer possible as the ProcSteal is always ordered before 161 // the GoSyscallEndBlocked event in the runtime. 162 // 163 // Both ProcSyscall and ProcSyscallAbandoned are likely no longer be necessary. 164 // 165 // ### CGO callbacks 166 // 167 // When a C thread calls into Go, the execution tracer models that as the creation of a new 168 // goroutine. When the thread exits back into C, that is modeled as the destruction of that 169 // goroutine. These are the GoCreateSyscall and GoDestroySyscall events, which represent the 170 // creation and destruction of a goroutine with its starting and ending states being _Gsyscall. 171 // 172 // This model is simple to reason about but contradicts the runtime implementation, which 173 // doesn't do this directly for performance reasons. The runtime implementation instead caches 174 // a G on the M created for the C thread. On Linux this M is then cached in the thread's TLS, 175 // and on other systems, the M is put on a global list on exit from Go. We need to do some 176 // extra work to make sure that this is modeled correctly in the the tracer. For example, 177 // a C thread exiting Go may leave a P hanging off of its M (whether that M is kept in TLS 178 // or placed back on a list). In order to correctly model goroutine creation and destruction, 179 // we must behave as if the P was at some point stolen by the runtime, if the C thread 180 // reenters Go with the same M (and thus, same P) once more. 181 182 package runtime 183 184 import ( 185 "internal/runtime/atomic" 186 "internal/trace/tracev2" 187 "unsafe" 188 ) 189 190 // Trace state. 191 192 // trace is global tracing context. 193 var trace struct { 194 // trace.lock must only be acquired on the system stack where 195 // stack splits cannot happen while it is held. 196 lock mutex 197 198 // Trace buffer management. 199 // 200 // First we check the empty list for any free buffers. If not, buffers 201 // are allocated directly from the OS. Once they're filled up and/or 202 // flushed, they end up on the full queue for trace.gen%2. 203 // 204 // The trace reader takes buffers off the full list one-by-one and 205 // places them into reading until they're finished being read from. 206 // Then they're placed onto the empty list. 207 // 208 // Protected by trace.lock. 209 reading *traceBuf // buffer currently handed off to user 210 empty *traceBuf // stack of empty buffers 211 full [2]traceBufQueue 212 workAvailable atomic.Bool 213 214 // State for the trace reader goroutine. 215 // 216 // Protected by trace.lock. 217 readerGen atomic.Uintptr // the generation the reader is currently reading for 218 flushedGen atomic.Uintptr // the last completed generation 219 headerWritten bool // whether ReadTrace has emitted trace header 220 endOfGenerationWritten bool // whether readTrace has emitted the end of the generation signal 221 222 // doneSema is used to synchronize the reader and traceAdvance. Specifically, 223 // it notifies traceAdvance that the reader is done with a generation. 224 // Both semaphores are 0 by default (so, acquires block). traceAdvance 225 // attempts to acquire for gen%2 after flushing the last buffers for gen. 226 // Meanwhile the reader releases the sema for gen%2 when it has finished 227 // processing gen. 228 doneSema [2]uint32 229 230 // Trace data tables for deduplicating data going into the trace. 231 // There are 2 of each: one for gen%2, one for 1-gen%2. 232 stackTab [2]traceStackTable // maps stack traces to unique ids 233 stringTab [2]traceStringTable // maps strings to unique ids 234 typeTab [2]traceTypeTable // maps type pointers to unique ids 235 236 // cpuLogRead accepts CPU profile samples from the signal handler where 237 // they're generated. There are two profBufs here: one for gen%2, one for 238 // 1-gen%2. These profBufs use a three-word header to hold the IDs of the P, G, 239 // and M (respectively) that were active at the time of the sample. Because 240 // profBuf uses a record with all zeros in its header to indicate overflow, 241 // we make sure to make the P field always non-zero: The ID of a real P will 242 // start at bit 1, and bit 0 will be set. Samples that arrive while no P is 243 // running (such as near syscalls) will set the first header field to 0b10. 244 // This careful handling of the first header field allows us to store ID of 245 // the active G directly in the second field, even though that will be 0 246 // when sampling g0. 247 // 248 // Initialization and teardown of these fields is protected by traceAdvanceSema. 249 cpuLogRead [2]*profBuf 250 signalLock atomic.Uint32 // protects use of the following member, only usable in signal handlers 251 cpuLogWrite [2]atomic.Pointer[profBuf] // copy of cpuLogRead for use in signal handlers, set without signalLock 252 cpuSleep *wakeableSleep 253 cpuLogDone <-chan struct{} 254 cpuBuf [2]*traceBuf 255 256 reader atomic.Pointer[g] // goroutine that called ReadTrace, or nil 257 258 // Fast mappings from enumerations to string IDs that are prepopulated 259 // in the trace. 260 markWorkerLabels [2][len(gcMarkWorkerModeStrings)]traceArg 261 goStopReasons [2][len(traceGoStopReasonStrings)]traceArg 262 goBlockReasons [2][len(traceBlockReasonStrings)]traceArg 263 264 // enabled indicates whether tracing is enabled, but it is only an optimization, 265 // NOT the source of truth on whether tracing is enabled. Tracing is only truly 266 // enabled if gen != 0. This is used as an optimistic fast path check. 267 // 268 // Transitioning this value from true -> false is easy (once gen is 0) 269 // because it's OK for enabled to have a stale "true" value. traceAcquire will 270 // always double-check gen. 271 // 272 // Transitioning this value from false -> true is harder. We need to make sure 273 // this is observable as true strictly before gen != 0. To maintain this invariant 274 // we only make this transition with the world stopped and use the store to gen 275 // as a publication barrier. 276 enabled bool 277 278 // enabledWithAllocFree is set if debug.traceallocfree is != 0 when tracing begins. 279 // It follows the same synchronization protocol as enabled. 280 enabledWithAllocFree bool 281 282 // Trace generation counter. 283 gen atomic.Uintptr 284 lastNonZeroGen uintptr // last non-zero value of gen 285 286 // shutdown is set when we are waiting for trace reader to finish after setting gen to 0 287 // 288 // Writes protected by trace.lock. 289 shutdown atomic.Bool 290 291 // Number of goroutines in syscall exiting slow path. 292 exitingSyscall atomic.Int32 293 294 // seqGC is the sequence counter for GC begin/end. 295 // 296 // Mutated only during stop-the-world. 297 seqGC uint64 298 299 // minPageHeapAddr is the minimum address of the page heap when tracing started. 300 minPageHeapAddr uint64 301 302 // debugMalloc is the value of debug.malloc before tracing began. 303 debugMalloc bool 304 } 305 306 // Trace public API. 307 308 var ( 309 traceAdvanceSema uint32 = 1 310 traceShutdownSema uint32 = 1 311 ) 312 313 // StartTrace enables tracing for the current process. 314 // While tracing, the data will be buffered and available via [ReadTrace]. 315 // StartTrace returns an error if tracing is already enabled. 316 // Most clients should use the [runtime/trace] package or the [testing] package's 317 // -test.trace flag instead of calling StartTrace directly. 318 func StartTrace() error { 319 if traceEnabled() || traceShuttingDown() { 320 return errorString("tracing is already enabled") 321 } 322 // Block until cleanup of the last trace is done. 323 semacquire(&traceShutdownSema) 324 semrelease(&traceShutdownSema) 325 326 // Hold traceAdvanceSema across trace start, since we'll want it on 327 // the other side of tracing being enabled globally. 328 semacquire(&traceAdvanceSema) 329 330 // Initialize CPU profile -> trace ingestion. 331 traceInitReadCPU() 332 333 // Compute the first generation for this StartTrace. 334 // 335 // Note: we start from the last non-zero generation rather than 1 so we 336 // can avoid resetting all the arrays indexed by gen%2 or gen%3. There's 337 // more than one of each per m, p, and goroutine. 338 firstGen := traceNextGen(trace.lastNonZeroGen) 339 340 // Reset GC sequencer. 341 trace.seqGC = 1 342 343 // Reset trace reader state. 344 trace.headerWritten = false 345 trace.readerGen.Store(firstGen) 346 trace.flushedGen.Store(0) 347 348 // Register some basic strings in the string tables. 349 traceRegisterLabelsAndReasons(firstGen) 350 351 // N.B. This may block for quite a while to get a frequency estimate. Do it 352 // here to minimize the time that the world is stopped. 353 frequency := traceClockUnitsPerSecond() 354 355 // Stop the world. 356 // 357 // What we need to successfully begin tracing is to make sure that the next time 358 // *any goroutine* hits a traceAcquire, it sees that the trace is enabled. 359 // 360 // Stopping the world gets us most of the way there, since it makes sure that goroutines 361 // stop executing. There is however one exception: goroutines without Ps concurrently 362 // exiting a syscall. We handle this by making sure that, after we update trace.gen, 363 // there isn't a single goroutine calling traceAcquire on the syscall slow path by checking 364 // trace.exitingSyscall. See the comment on the check below for more details. 365 // 366 // Note also that stopping the world is necessary to make sure sweep-related events are 367 // coherent. Since the world is stopped and sweeps are non-preemptible, we can never start 368 // the world and see an unpaired sweep 'end' event. Other parts of the tracer rely on this. 369 stw := stopTheWorld(stwStartTrace) 370 371 // Prevent sysmon from running any code that could generate events. 372 lock(&sched.sysmonlock) 373 374 // Grab the minimum page heap address. All Ps are stopped, so it's safe to read this since 375 // nothing can allocate heap memory. 376 trace.minPageHeapAddr = uint64(mheap_.pages.inUse.ranges[0].base.addr()) 377 378 // Reset mSyscallID on all Ps while we have them stationary and the trace is disabled. 379 for _, pp := range allp { 380 pp.trace.mSyscallID = -1 381 } 382 383 // Start tracing. 384 // 385 // Set trace.enabled. This is *very* subtle. We need to maintain the invariant that if 386 // trace.gen != 0, then trace.enabled is always observed as true. Simultaneously, for 387 // performance, we need trace.enabled to be read without any synchronization. 388 // 389 // We ensure this is safe by stopping the world, which acts a global barrier on almost 390 // every M, and explicitly synchronize with any other Ms that could be running concurrently 391 // with us. Today, there are only two such cases: 392 // - sysmon, which we synchronized with by acquiring sysmonlock. 393 // - goroutines exiting syscalls, which we synchronize with via trace.exitingSyscall. 394 // 395 // After trace.gen is updated, other Ms may start creating trace buffers and emitting 396 // data into them. 397 trace.enabled = true 398 if debug.traceallocfree.Load() != 0 { 399 // Enable memory events since the GODEBUG is set. 400 trace.debugMalloc = debug.malloc 401 trace.enabledWithAllocFree = true 402 debug.malloc = true 403 } 404 trace.gen.Store(firstGen) 405 406 // Wait for exitingSyscall to drain. 407 // 408 // It may not monotonically decrease to zero, but in the limit it will always become 409 // zero because the world is stopped and there are no available Ps for syscall-exited 410 // goroutines to run on. 411 // 412 // Because we set gen before checking this, and because exitingSyscall is always incremented 413 // *before* traceAcquire (which checks gen), we can be certain that when exitingSyscall is zero 414 // that any goroutine that goes to exit a syscall from then on *must* observe the new gen as 415 // well as trace.enabled being set to true. 416 // 417 // The critical section on each goroutine here is going to be quite short, so the likelihood 418 // that we observe a zero value is high. 419 for trace.exitingSyscall.Load() != 0 { 420 osyield() 421 } 422 423 // Record some initial pieces of information. 424 // 425 // N.B. This will also emit a status event for this goroutine. 426 tl := traceAcquire() 427 traceSyncBatch(firstGen, frequency) // Get this as early in the trace as possible. See comment in traceAdvance. 428 tl.Gomaxprocs(gomaxprocs) // Get this as early in the trace as possible. See comment in traceAdvance. 429 tl.STWStart(stwStartTrace) // We didn't trace this above, so trace it now. 430 431 // Record the fact that a GC is active, if applicable. 432 if gcphase == _GCmark || gcphase == _GCmarktermination { 433 tl.GCActive() 434 } 435 436 // Dump a snapshot of memory, if enabled. 437 if trace.enabledWithAllocFree { 438 traceSnapshotMemory(firstGen) 439 } 440 441 // Record the heap goal so we have it at the very beginning of the trace. 442 tl.HeapGoal() 443 traceRelease(tl) 444 445 unlock(&sched.sysmonlock) 446 startTheWorld(stw) 447 448 traceStartReadCPU() 449 traceAdvancer.start() 450 451 semrelease(&traceAdvanceSema) 452 return nil 453 } 454 455 // StopTrace stops tracing, if it was previously enabled. 456 // StopTrace only returns after all the reads for the trace have completed. 457 func StopTrace() { 458 traceAdvance(true) 459 } 460 461 // traceAdvance moves tracing to the next generation, and cleans up the current generation, 462 // ensuring that it's flushed out before returning. If stopTrace is true, it disables tracing 463 // altogether instead of advancing to the next generation. 464 // 465 // traceAdvanceSema must not be held. 466 // 467 // traceAdvance is called by runtime/trace and golang.org/x/exp/trace using linkname. 468 // 469 //go:linkname traceAdvance 470 func traceAdvance(stopTrace bool) { 471 semacquire(&traceAdvanceSema) 472 473 // Get the gen that we're advancing from. In this function we don't really care much 474 // about the generation we're advancing _into_ since we'll do all the cleanup in this 475 // generation for the next advancement. 476 gen := trace.gen.Load() 477 if gen == 0 { 478 // We may end up here traceAdvance is called concurrently with StopTrace. 479 semrelease(&traceAdvanceSema) 480 return 481 } 482 483 // Collect all the untraced Gs. 484 type untracedG struct { 485 gp *g 486 goid uint64 487 mid int64 488 stackID uint64 489 status uint32 490 waitreason waitReason 491 inMarkAssist bool 492 } 493 var untracedGs []untracedG 494 forEachGRace(func(gp *g) { 495 // Make absolutely sure all Gs are ready for the next 496 // generation. We need to do this even for dead Gs because 497 // they may come alive with a new identity, and its status 498 // traced bookkeeping might end up being stale. 499 // We may miss totally new goroutines, but they'll always 500 // have clean bookkeeping. 501 gp.trace.readyNextGen(gen) 502 // If the status was traced, nothing else to do. 503 if gp.trace.statusWasTraced(gen) { 504 return 505 } 506 // Scribble down information about this goroutine. 507 ug := untracedG{gp: gp, mid: -1} 508 systemstack(func() { 509 me := getg().m.curg 510 // We don't have to handle this G status transition because we 511 // already eliminated ourselves from consideration above. 512 casGToWaitingForSuspendG(me, _Grunning, waitReasonTraceGoroutineStatus) 513 // We need to suspend and take ownership of the G to safely read its 514 // goid. Note that we can't actually emit the event at this point 515 // because we might stop the G in a window where it's unsafe to write 516 // events based on the G's status. We need the global trace buffer flush 517 // coming up to make sure we're not racing with the G. 518 // 519 // It should be very unlikely that we try to preempt a running G here. 520 // The only situation that we might is that we're racing with a G 521 // that's running for the first time in this generation. Therefore, 522 // this should be relatively fast. 523 s := suspendG(gp) 524 if !s.dead { 525 ug.goid = s.g.goid 526 if s.g.m != nil { 527 ug.mid = int64(s.g.m.procid) 528 } 529 ug.status = readgstatus(s.g) &^ _Gscan 530 ug.waitreason = s.g.waitreason 531 ug.inMarkAssist = s.g.inMarkAssist 532 ug.stackID = traceStack(0, gp, &trace.stackTab[gen%2]) 533 } 534 resumeG(s) 535 casgstatus(me, _Gwaiting, _Grunning) 536 }) 537 if ug.goid != 0 { 538 untracedGs = append(untracedGs, ug) 539 } 540 }) 541 542 if !stopTrace { 543 // Re-register runtime goroutine labels and stop/block reasons. 544 traceRegisterLabelsAndReasons(traceNextGen(gen)) 545 } 546 547 // N.B. This may block for quite a while to get a frequency estimate. Do it 548 // here to minimize the time that we prevent the world from stopping. 549 frequency := traceClockUnitsPerSecond() 550 551 // Prevent the world from stopping. 552 // 553 // This is necessary to ensure the consistency of the STW events. If we're feeling 554 // adventurous we could lift this restriction and add a STWActive event, but the 555 // cost of maintaining this consistency is low. 556 // 557 // This is also a good time to preempt all the Ps and ensure they had a status traced. 558 semacquire(&worldsema) 559 560 // Go over each P and emit a status event for it if necessary. 561 // 562 // TODO(mknyszek): forEachP is very heavyweight. We could do better by integrating 563 // the statusWasTraced check into it, to avoid preempting unnecessarily. 564 forEachP(waitReasonTraceProcStatus, func(pp *p) { 565 tl := traceAcquire() 566 if !pp.trace.statusWasTraced(tl.gen) { 567 tl.writer().writeProcStatusForP(pp, false).end() 568 } 569 traceRelease(tl) 570 }) 571 572 // While we're still holding worldsema (preventing a STW and thus a 573 // change in the number of Ps), reset the status on dead Ps. 574 // They just appear as idle. 575 // 576 // TODO(mknyszek): Consider explicitly emitting ProcCreate and ProcDestroy 577 // events to indicate whether a P exists, rather than just making its 578 // existence implicit. 579 for _, pp := range allp[len(allp):cap(allp)] { 580 pp.trace.readyNextGen(gen) 581 } 582 583 // Prevent preemption to make sure we're not interrupted. 584 // 585 // We want to get through the rest as soon as possible. 586 mp := acquirem() 587 588 // Advance the generation or stop the trace. 589 trace.lastNonZeroGen = gen 590 if stopTrace { 591 systemstack(func() { 592 // Ordering is important here. Set shutdown first, then disable tracing, 593 // so that conditions like (traceEnabled() || traceShuttingDown()) have 594 // no opportunity to be false. Hold the trace lock so this update appears 595 // atomic to the trace reader. 596 lock(&trace.lock) 597 trace.shutdown.Store(true) 598 trace.gen.Store(0) 599 unlock(&trace.lock) 600 601 // Clear trace.enabled. It is totally OK for this value to be stale, 602 // because traceAcquire will always double-check gen. 603 trace.enabled = false 604 }) 605 } else { 606 trace.gen.Store(traceNextGen(gen)) 607 } 608 609 // Emit a sync batch which contains a ClockSnapshot. Also emit a ProcsChange 610 // event so we have one on record for each generation. Let's emit it as soon 611 // as possible so that downstream tools can rely on the value being there 612 // fairly soon in a generation. 613 // 614 // It's important that we do this before allowing stop-the-worlds again, 615 // because the procs count could change. 616 if !stopTrace { 617 tl := traceAcquire() 618 traceSyncBatch(tl.gen, frequency) 619 tl.Gomaxprocs(gomaxprocs) 620 traceRelease(tl) 621 } 622 623 // Emit a GCActive event in the new generation if necessary. 624 // 625 // It's important that we do this before allowing stop-the-worlds again, 626 // because that could emit global GC-related events. 627 if !stopTrace && (gcphase == _GCmark || gcphase == _GCmarktermination) { 628 tl := traceAcquire() 629 tl.GCActive() 630 traceRelease(tl) 631 } 632 633 // Preemption is OK again after this. If the world stops or whatever it's fine. 634 // We're just cleaning up the last generation after this point. 635 // 636 // We also don't care if the GC starts again after this for the same reasons. 637 releasem(mp) 638 semrelease(&worldsema) 639 640 // Snapshot allm and freem. 641 // 642 // Snapshotting after the generation counter update is sufficient. 643 // Because an m must be on either allm or sched.freem if it has an active trace 644 // buffer, new threads added to allm after this point must necessarily observe 645 // the new generation number (sched.lock acts as a barrier). 646 // 647 // Threads that exit before this point and are on neither list explicitly 648 // flush their own buffers in traceThreadDestroy. 649 // 650 // Snapshotting freem is necessary because Ms can continue to emit events 651 // while they're still on that list. Removal from sched.freem is serialized with 652 // this snapshot, so either we'll capture an m on sched.freem and race with 653 // the removal to flush its buffers (resolved by traceThreadDestroy acquiring 654 // the thread's write flag, which one of us must win, so at least its old gen buffer 655 // will be flushed in time for the new generation) or it will have flushed its 656 // buffers before we snapshotted it to begin with. 657 lock(&sched.lock) 658 mToFlush := allm 659 for mp := mToFlush; mp != nil; mp = mp.alllink { 660 mp.trace.link = mp.alllink 661 } 662 for mp := sched.freem; mp != nil; mp = mp.freelink { 663 mp.trace.link = mToFlush 664 mToFlush = mp 665 } 666 unlock(&sched.lock) 667 668 // Iterate over our snapshot, flushing every buffer until we're done. 669 // 670 // Because trace writers read the generation while the write flag is 671 // held, we can be certain that when there are no writers there are 672 // also no stale generation values left. Therefore, it's safe to flush 673 // any buffers that remain in that generation's slot. 674 const debugDeadlock = false 675 systemstack(func() { 676 // Track iterations for some rudimentary deadlock detection. 677 i := 0 678 detectedDeadlock := false 679 680 for mToFlush != nil { 681 prev := &mToFlush 682 for mp := *prev; mp != nil; { 683 if mp.trace.writing.Load() { 684 // The M is writing. Come back to it later. 685 prev = &mp.trace.link 686 mp = mp.trace.link 687 continue 688 } 689 // Flush the trace buffer. 690 // 691 // trace.lock needed for traceBufFlush, but also to synchronize 692 // with traceThreadDestroy, which flushes both buffers unconditionally. 693 lock(&trace.lock) 694 for exp, buf := range mp.trace.buf[gen%2] { 695 if buf != nil { 696 traceBufFlush(buf, gen) 697 mp.trace.buf[gen%2][exp] = nil 698 } 699 } 700 unlock(&trace.lock) 701 702 // Remove the m from the flush list. 703 *prev = mp.trace.link 704 mp.trace.link = nil 705 mp = *prev 706 } 707 // Yield only if we're going to be going around the loop again. 708 if mToFlush != nil { 709 osyield() 710 } 711 712 if debugDeadlock { 713 // Try to detect a deadlock. We probably shouldn't loop here 714 // this many times. 715 if i > 100000 && !detectedDeadlock { 716 detectedDeadlock = true 717 println("runtime: failing to flush") 718 for mp := mToFlush; mp != nil; mp = mp.trace.link { 719 print("runtime: m=", mp.id, "\n") 720 } 721 } 722 i++ 723 } 724 } 725 }) 726 727 // At this point, the old generation is fully flushed minus stack and string 728 // tables, CPU samples, and goroutines that haven't run at all during the last 729 // generation. 730 731 // Check to see if any Gs still haven't had events written out for them. 732 statusWriter := unsafeTraceWriter(gen, nil) 733 for _, ug := range untracedGs { 734 if ug.gp.trace.statusWasTraced(gen) { 735 // It was traced, we don't need to do anything. 736 continue 737 } 738 // It still wasn't traced. Because we ensured all Ms stopped writing trace 739 // events to the last generation, that must mean the G never had its status 740 // traced in gen between when we recorded it and now. If that's true, the goid 741 // and status we recorded then is exactly what we want right now. 742 status := goStatusToTraceGoStatus(ug.status, ug.waitreason) 743 statusWriter = statusWriter.writeGoStatus(ug.goid, ug.mid, status, ug.inMarkAssist, ug.stackID) 744 } 745 statusWriter.flush().end() 746 747 // Read everything out of the last gen's CPU profile buffer. 748 traceReadCPU(gen) 749 750 // Flush CPU samples, stacks, and strings for the last generation. This is safe, 751 // because we're now certain no M is writing to the last generation. 752 // 753 // Ordering is important here. traceCPUFlush may generate new stacks and dumping 754 // stacks may generate new strings. 755 traceCPUFlush(gen) 756 trace.stackTab[gen%2].dump(gen) 757 trace.typeTab[gen%2].dump(gen) 758 trace.stringTab[gen%2].reset(gen) 759 760 // That's it. This generation is done producing buffers. 761 systemstack(func() { 762 lock(&trace.lock) 763 trace.flushedGen.Store(gen) 764 unlock(&trace.lock) 765 }) 766 767 if stopTrace { 768 // Acquire the shutdown sema to begin the shutdown process. 769 semacquire(&traceShutdownSema) 770 771 // Finish off CPU profile reading. 772 traceStopReadCPU() 773 774 // Reset debug.malloc if necessary. Note that this is set in a racy 775 // way; that's OK. Some mallocs may still enter into the debug.malloc 776 // block, but they won't generate events because tracing is disabled. 777 // That is, it's OK if mallocs read a stale debug.malloc or 778 // trace.enabledWithAllocFree value. 779 if trace.enabledWithAllocFree { 780 trace.enabledWithAllocFree = false 781 debug.malloc = trace.debugMalloc 782 } 783 } 784 785 // Block until the trace reader has finished processing the last generation. 786 semacquire(&trace.doneSema[gen%2]) 787 if raceenabled { 788 raceacquire(unsafe.Pointer(&trace.doneSema[gen%2])) 789 } 790 791 // Double-check that things look as we expect after advancing and perform some 792 // final cleanup if the trace has fully stopped. 793 systemstack(func() { 794 lock(&trace.lock) 795 if !trace.full[gen%2].empty() { 796 throw("trace: non-empty full trace buffer for done generation") 797 } 798 if stopTrace { 799 if !trace.full[1-(gen%2)].empty() { 800 throw("trace: non-empty full trace buffer for next generation") 801 } 802 if trace.reading != nil || trace.reader.Load() != nil { 803 throw("trace: reading after shutdown") 804 } 805 // Free all the empty buffers. 806 for trace.empty != nil { 807 buf := trace.empty 808 trace.empty = buf.link 809 sysFree(unsafe.Pointer(buf), unsafe.Sizeof(*buf), &memstats.other_sys) 810 } 811 // Clear trace.shutdown and other flags. 812 trace.headerWritten = false 813 trace.shutdown.Store(false) 814 } 815 unlock(&trace.lock) 816 }) 817 818 if stopTrace { 819 // Clear the sweep state on every P for the next time tracing is enabled. 820 // 821 // It may be stale in the next trace because we may have ended tracing in 822 // the middle of a sweep on a P. 823 // 824 // It's fine not to call forEachP here because tracing is disabled and we 825 // know at this point that nothing is calling into the tracer, but we do 826 // need to look at dead Ps too just because GOMAXPROCS could have been called 827 // at any point since we stopped tracing, and we have to ensure there's no 828 // bad state on dead Ps too. Prevent a STW and a concurrent GOMAXPROCS that 829 // might mutate allp by making ourselves briefly non-preemptible. 830 mp := acquirem() 831 for _, pp := range allp[:cap(allp)] { 832 pp.trace.inSweep = false 833 pp.trace.maySweep = false 834 pp.trace.swept = 0 835 pp.trace.reclaimed = 0 836 } 837 releasem(mp) 838 } 839 840 // Release the advance semaphore. If stopTrace is true we're still holding onto 841 // traceShutdownSema. 842 // 843 // Do a direct handoff. Don't let one caller of traceAdvance starve 844 // other calls to traceAdvance. 845 semrelease1(&traceAdvanceSema, true, 0) 846 847 if stopTrace { 848 // Stop the traceAdvancer. We can't be holding traceAdvanceSema here because 849 // we'll deadlock (we're blocked on the advancer goroutine exiting, but it 850 // may be currently trying to acquire traceAdvanceSema). 851 traceAdvancer.stop() 852 semrelease(&traceShutdownSema) 853 } 854 } 855 856 func traceNextGen(gen uintptr) uintptr { 857 if gen == ^uintptr(0) { 858 // gen is used both %2 and %3 and we want both patterns to continue when we loop around. 859 // ^uint32(0) and ^uint64(0) are both odd and multiples of 3. Therefore the next generation 860 // we want is even and one more than a multiple of 3. The smallest such number is 4. 861 return 4 862 } 863 return gen + 1 864 } 865 866 // traceRegisterLabelsAndReasons re-registers mark worker labels and 867 // goroutine stop/block reasons in the string table for the provided 868 // generation. Note: the provided generation must not have started yet. 869 func traceRegisterLabelsAndReasons(gen uintptr) { 870 for i, label := range gcMarkWorkerModeStrings[:] { 871 trace.markWorkerLabels[gen%2][i] = traceArg(trace.stringTab[gen%2].put(gen, label)) 872 } 873 for i, str := range traceBlockReasonStrings[:] { 874 trace.goBlockReasons[gen%2][i] = traceArg(trace.stringTab[gen%2].put(gen, str)) 875 } 876 for i, str := range traceGoStopReasonStrings[:] { 877 trace.goStopReasons[gen%2][i] = traceArg(trace.stringTab[gen%2].put(gen, str)) 878 } 879 } 880 881 // ReadTrace returns the next chunk of binary tracing data, blocking until data 882 // is available. If tracing is turned off and all the data accumulated while it 883 // was on has been returned, ReadTrace returns nil. The caller must copy the 884 // returned data before calling ReadTrace again. 885 // ReadTrace must be called from one goroutine at a time. 886 func ReadTrace() (buf []byte) { 887 top: 888 var park bool 889 systemstack(func() { 890 buf, park = readTrace0() 891 }) 892 if park { 893 gopark(func(gp *g, _ unsafe.Pointer) bool { 894 if !trace.reader.CompareAndSwapNoWB(nil, gp) { 895 // We're racing with another reader. 896 // Wake up and handle this case. 897 return false 898 } 899 900 if g2 := traceReader(); gp == g2 { 901 // New data arrived between unlocking 902 // and the CAS and we won the wake-up 903 // race, so wake up directly. 904 return false 905 } else if g2 != nil { 906 printlock() 907 println("runtime: got trace reader", g2, g2.goid) 908 throw("unexpected trace reader") 909 } 910 911 return true 912 }, nil, waitReasonTraceReaderBlocked, traceBlockSystemGoroutine, 2) 913 goto top 914 } 915 return buf 916 } 917 918 // readTrace0 is ReadTrace's continuation on g0. This must run on the 919 // system stack because it acquires trace.lock. 920 // 921 //go:systemstack 922 func readTrace0() (buf []byte, park bool) { 923 if raceenabled { 924 // g0 doesn't have a race context. Borrow the user G's. 925 if getg().racectx != 0 { 926 throw("expected racectx == 0") 927 } 928 getg().racectx = getg().m.curg.racectx 929 // (This defer should get open-coded, which is safe on 930 // the system stack.) 931 defer func() { getg().racectx = 0 }() 932 } 933 934 // This function must not allocate while holding trace.lock: 935 // allocation can call heap allocate, which will try to emit a trace 936 // event while holding heap lock. 937 lock(&trace.lock) 938 939 if trace.reader.Load() != nil { 940 // More than one goroutine reads trace. This is bad. 941 // But we rather do not crash the program because of tracing, 942 // because tracing can be enabled at runtime on prod servers. 943 unlock(&trace.lock) 944 println("runtime: ReadTrace called from multiple goroutines simultaneously") 945 return nil, false 946 } 947 // Recycle the old buffer. 948 if buf := trace.reading; buf != nil { 949 buf.link = trace.empty 950 trace.empty = buf 951 trace.reading = nil 952 } 953 // Write trace header. 954 if !trace.headerWritten { 955 trace.headerWritten = true 956 unlock(&trace.lock) 957 return []byte("go 1.26 trace\x00\x00\x00"), false 958 } 959 960 // Read the next buffer. 961 962 if trace.readerGen.Load() == 0 { 963 trace.readerGen.Store(1) 964 } 965 var gen uintptr 966 for { 967 assertLockHeld(&trace.lock) 968 gen = trace.readerGen.Load() 969 970 // Check to see if we need to block for more data in this generation 971 // or if we need to move our generation forward. 972 if !trace.full[gen%2].empty() { 973 break 974 } 975 // Most of the time readerGen is one generation ahead of flushedGen, as the 976 // current generation is being read from. Then, once the last buffer is flushed 977 // into readerGen, flushedGen will rise to meet it. At this point, the tracer 978 // is waiting on the reader to finish flushing the last generation so that it 979 // can continue to advance. 980 if trace.flushedGen.Load() == gen { 981 // Write out the internal in-band end-of-generation signal. 982 if !trace.endOfGenerationWritten { 983 trace.endOfGenerationWritten = true 984 unlock(&trace.lock) 985 return []byte{byte(tracev2.EvEndOfGeneration)}, false 986 } 987 988 // Reset the flag. 989 trace.endOfGenerationWritten = false 990 991 // Handle shutdown. 992 if trace.shutdown.Load() { 993 unlock(&trace.lock) 994 995 // Wake up anyone waiting for us to be done with this generation. 996 // 997 // Do this after reading trace.shutdown, because the thread we're 998 // waking up is going to clear trace.shutdown. 999 if raceenabled { 1000 // Model synchronization on trace.doneSema, which te race 1001 // detector does not see. This is required to avoid false 1002 // race reports on writer passed to trace.Start. 1003 racerelease(unsafe.Pointer(&trace.doneSema[gen%2])) 1004 } 1005 semrelease(&trace.doneSema[gen%2]) 1006 1007 // We're shutting down, and the last generation is fully 1008 // read. We're done. 1009 return nil, false 1010 } 1011 // Handle advancing to the next generation. 1012 1013 // The previous gen has had all of its buffers flushed, and 1014 // there's nothing else for us to read. Advance the generation 1015 // we're reading from and try again. 1016 trace.readerGen.Store(trace.gen.Load()) 1017 unlock(&trace.lock) 1018 1019 // Wake up anyone waiting for us to be done with this generation. 1020 // 1021 // Do this after reading gen to make sure we can't have the trace 1022 // advance until we've read it. 1023 if raceenabled { 1024 // See comment above in the shutdown case. 1025 racerelease(unsafe.Pointer(&trace.doneSema[gen%2])) 1026 } 1027 semrelease(&trace.doneSema[gen%2]) 1028 1029 // Reacquire the lock and go back to the top of the loop. 1030 lock(&trace.lock) 1031 continue 1032 } 1033 // Wait for new data. 1034 // 1035 // We don't simply use a note because the scheduler 1036 // executes this goroutine directly when it wakes up 1037 // (also a note would consume an M). 1038 // 1039 // Before we drop the lock, clear the workAvailable flag. Work can 1040 // only be queued with trace.lock held, so this is at least true until 1041 // we drop the lock. 1042 trace.workAvailable.Store(false) 1043 unlock(&trace.lock) 1044 return nil, true 1045 } 1046 // Pull a buffer. 1047 tbuf := trace.full[gen%2].pop() 1048 trace.reading = tbuf 1049 unlock(&trace.lock) 1050 return tbuf.arr[:tbuf.pos], false 1051 } 1052 1053 // traceReader returns the trace reader that should be woken up, if any. 1054 // Callers should first check (traceEnabled() || traceShuttingDown()). 1055 // 1056 // This must run on the system stack because it acquires trace.lock. 1057 // 1058 //go:systemstack 1059 func traceReader() *g { 1060 gp := traceReaderAvailable() 1061 if gp == nil || !trace.reader.CompareAndSwapNoWB(gp, nil) { 1062 return nil 1063 } 1064 return gp 1065 } 1066 1067 // traceReaderAvailable returns the trace reader if it is not currently 1068 // scheduled and should be. Callers should first check that 1069 // (traceEnabled() || traceShuttingDown()) is true. 1070 func traceReaderAvailable() *g { 1071 // There are two conditions under which we definitely want to schedule 1072 // the reader: 1073 // - The reader is lagging behind in finishing off the last generation. 1074 // In this case, trace buffers could even be empty, but the trace 1075 // advancer will be waiting on the reader, so we have to make sure 1076 // to schedule the reader ASAP. 1077 // - The reader has pending work to process for it's reader generation 1078 // (assuming readerGen is not lagging behind). Note that we also want 1079 // to be careful *not* to schedule the reader if there's no work to do. 1080 // 1081 // We also want to be careful not to schedule the reader if there's no 1082 // reason to. 1083 if trace.flushedGen.Load() == trace.readerGen.Load() || trace.workAvailable.Load() { 1084 return trace.reader.Load() 1085 } 1086 return nil 1087 } 1088 1089 // Trace advancer goroutine. 1090 var traceAdvancer traceAdvancerState 1091 1092 type traceAdvancerState struct { 1093 timer *wakeableSleep 1094 done chan struct{} 1095 } 1096 1097 // start starts a new traceAdvancer. 1098 func (s *traceAdvancerState) start() { 1099 // Start a goroutine to periodically advance the trace generation. 1100 s.done = make(chan struct{}) 1101 s.timer = newWakeableSleep() 1102 go func() { 1103 for traceEnabled() { 1104 // Set a timer to wake us up 1105 s.timer.sleep(int64(debug.traceadvanceperiod)) 1106 1107 // Try to advance the trace. 1108 traceAdvance(false) 1109 } 1110 s.done <- struct{}{} 1111 }() 1112 } 1113 1114 // stop stops a traceAdvancer and blocks until it exits. 1115 func (s *traceAdvancerState) stop() { 1116 s.timer.wake() 1117 <-s.done 1118 close(s.done) 1119 s.timer.close() 1120 } 1121 1122 // traceAdvancePeriod is the approximate period between 1123 // new generations. 1124 const defaultTraceAdvancePeriod = 1e9 // 1 second. 1125 1126 // wakeableSleep manages a wakeable goroutine sleep. 1127 // 1128 // Users of this type must call init before first use and 1129 // close to free up resources. Once close is called, init 1130 // must be called before another use. 1131 type wakeableSleep struct { 1132 timer *timer 1133 1134 // lock protects access to wakeup, but not send/recv on it. 1135 lock mutex 1136 wakeup chan struct{} 1137 } 1138 1139 // newWakeableSleep initializes a new wakeableSleep and returns it. 1140 func newWakeableSleep() *wakeableSleep { 1141 s := new(wakeableSleep) 1142 lockInit(&s.lock, lockRankWakeableSleep) 1143 s.wakeup = make(chan struct{}, 1) 1144 s.timer = new(timer) 1145 f := func(s any, _ uintptr, _ int64) { 1146 s.(*wakeableSleep).wake() 1147 } 1148 s.timer.init(f, s) 1149 return s 1150 } 1151 1152 // sleep sleeps for the provided duration in nanoseconds or until 1153 // another goroutine calls wake. 1154 // 1155 // Must not be called by more than one goroutine at a time and 1156 // must not be called concurrently with close. 1157 func (s *wakeableSleep) sleep(ns int64) { 1158 s.timer.reset(nanotime()+ns, 0) 1159 lock(&s.lock) 1160 if raceenabled { 1161 raceacquire(unsafe.Pointer(&s.lock)) 1162 } 1163 wakeup := s.wakeup 1164 if raceenabled { 1165 racerelease(unsafe.Pointer(&s.lock)) 1166 } 1167 unlock(&s.lock) 1168 <-wakeup 1169 s.timer.stop() 1170 } 1171 1172 // wake awakens any goroutine sleeping on the timer. 1173 // 1174 // Safe for concurrent use with all other methods. 1175 func (s *wakeableSleep) wake() { 1176 // Grab the wakeup channel, which may be nil if we're 1177 // racing with close. 1178 lock(&s.lock) 1179 if raceenabled { 1180 raceacquire(unsafe.Pointer(&s.lock)) 1181 } 1182 if s.wakeup != nil { 1183 // Non-blocking send. 1184 // 1185 // Others may also write to this channel and we don't 1186 // want to block on the receiver waking up. This also 1187 // effectively batches together wakeup notifications. 1188 select { 1189 case s.wakeup <- struct{}{}: 1190 default: 1191 } 1192 } 1193 if raceenabled { 1194 racerelease(unsafe.Pointer(&s.lock)) 1195 } 1196 unlock(&s.lock) 1197 } 1198 1199 // close wakes any goroutine sleeping on the timer and prevents 1200 // further sleeping on it. 1201 // 1202 // Once close is called, the wakeableSleep must no longer be used. 1203 // 1204 // It must only be called once no goroutine is sleeping on the 1205 // timer *and* nothing else will call wake concurrently. 1206 func (s *wakeableSleep) close() { 1207 // Set wakeup to nil so that a late timer ends up being a no-op. 1208 lock(&s.lock) 1209 if raceenabled { 1210 raceacquire(unsafe.Pointer(&s.lock)) 1211 } 1212 wakeup := s.wakeup 1213 s.wakeup = nil 1214 1215 // Close the channel. 1216 close(wakeup) 1217 1218 if raceenabled { 1219 racerelease(unsafe.Pointer(&s.lock)) 1220 } 1221 unlock(&s.lock) 1222 return 1223 } 1224