Go －調(diào)度器

2018-07-20 來源：編程學(xué)習(xí)網(wǎng)

容器云強(qiáng)勢(shì)上線！快速搭建集群，上萬Linux鏡像隨意使用

寫得稍微有點(diǎn)亂，主要是按自己看代碼的順序來記錄的，也不是出書，就這樣吧。

PS: 新人不推薦剛學(xué) Golang 就去看調(diào)度器代碼，這部分代碼個(gè)人覺得寫得很亂。

調(diào)度

基本數(shù)據(jù)結(jié)構(gòu)

goroutine 在 runtime 中的數(shù)據(jù)結(jié)構(gòu):

// stack 描述的是 Go 的執(zhí)行棧，下界和上界分別為 [lo, hi]
// 如果從傳統(tǒng)內(nèi)存布局的角度來講，Go 的棧實(shí)際上是分配在 C 語言中的堆區(qū)的
// 所以才能比 ulimit -s 的 stack size 還要大(1GB)
type stack struct {
    lo uintptr
    hi uintptr
}

// g 的運(yùn)行現(xiàn)場(chǎng)
type gobuf struct {
    sp   uintptr    // sp 寄存器
    pc   uintptr    // pc 寄存器
    g    guintptr   // g 指針
    ctxt unsafe.Pointer // 這個(gè)似乎是用來輔助 gc 的
    ret  sys.Uintreg
    lr   uintptr    // 這是在 arm 上用的寄存器，不用關(guān)心
    bp   uintptr    // 開啟 GOEXPERIMENT=framepointer，才會(huì)有這個(gè)
}


type g struct {
    // 簡(jiǎn)單數(shù)據(jù)結(jié)構(gòu)，lo 和 hi 成員描述了棧的下界和上界內(nèi)存地址
    stack       stack
    // 在函數(shù)的棧增長(zhǎng) prologue 中用 sp 寄存器和 stackguard0 來做比較
    // 如果 sp 比 stackguard0 小(因?yàn)闂Ｏ虻偷刂贩较蛟鲩L(zhǎng))，那么就觸發(fā)棧拷貝和調(diào)度
    // 正常情況下 stackguard0 = stack.lo + StackGuard
    // 不過 stackguard0 在需要進(jìn)行調(diào)度時(shí)，會(huì)被修改為 StackPreempt
    // 以觸發(fā)搶占s
    stackguard0 uintptr
    // stackguard1 是在 C 棧增長(zhǎng) prologue 作對(duì)比的對(duì)象
    // 在 g0 和 gsignal 棧上，其值為 stack.lo+StackGuard
    // 在其它的棧上這個(gè)值是 ~0(按 0 取反)以觸發(fā) morestack 調(diào)用(并 crash)
    stackguard1 uintptr

    _panic         *_panic
    _defer         *_defer
    m              *m             // 當(dāng)前與 g 綁定的 m
    sched          gobuf          // goroutine 的現(xiàn)場(chǎng)
    syscallsp      uintptr        // if status==Gsyscall, syscallsp = sched.sp to use during gc
    syscallpc      uintptr        // if status==Gsyscall, syscallpc = sched.pc to use during gc
    stktopsp       uintptr        // expected sp at top of stack, to check in traceback
    param          unsafe.Pointer // wakeup 時(shí)的傳入?yún)?shù)
    atomicstatus   uint32
    stackLock      uint32 // sigprof/scang lock; TODO: fold in to atomicstatus
    goid           int64  // goroutine id
    waitsince      int64  // g 被阻塞之后的近似時(shí)間
    waitreason     string // if status==Gwaiting
    schedlink      guintptr
    preempt        bool     // 搶占標(biāo)記，這個(gè)為 true 時(shí)，stackguard0 是等于 stackpreempt 的
    throwsplit     bool     // must not split stack
    raceignore     int8     // ignore race detection events
    sysblocktraced bool     // StartTrace has emitted EvGoInSyscall about this goroutine
    sysexitticks   int64    // syscall 返回之后的 cputicks，用來做 tracing
    traceseq       uint64   // trace event sequencer
    tracelastp     puintptr // last P emitted an event for this goroutine
    lockedm        muintptr // 如果調(diào)用了 LockOsThread，那么這個(gè) g 會(huì)綁定到某個(gè) m 上
    sig            uint32
    writebuf       []byte
    sigcode0       uintptr
    sigcode1       uintptr
    sigpc          uintptr
    gopc           uintptr // 創(chuàng)建該 goroutine 的語句的指令地址
    startpc        uintptr // goroutine 函數(shù)的指令地址
    racectx        uintptr
    waiting        *sudog         // sudog structures this g is waiting on (that have a valid elem ptr); in lock order
    cgoCtxt        []uintptr      // cgo traceback context
    labels         unsafe.Pointer // profiler labels
    timer          *timer         // time.Sleep 緩存的定時(shí)器
    selectDone     uint32         // 該 g 是否正在參與 select，是否已經(jīng)有人從 select 中勝出
}

當(dāng) g 遇到阻塞，或需要等待的場(chǎng)景時(shí)，會(huì)被打包成 sudog 這樣一個(gè)結(jié)構(gòu)。一個(gè) g 可能被打包為多個(gè) sudog 分別掛在不同的等待隊(duì)列上:

// sudog 代表在等待列表里的 g，比如向 channel 發(fā)送/接收內(nèi)容時(shí)
// 之所以需要 sudog 是因?yàn)?g 和同步對(duì)象之間的關(guān)系是多對(duì)多的
// 一個(gè) g 可能會(huì)在多個(gè)等待隊(duì)列中，所以一個(gè) g 可能被打包為多個(gè) sudog
// 多個(gè) g 也可以等待在同一個(gè)同步對(duì)象上
// 因此對(duì)于一個(gè)同步對(duì)象就會(huì)有很多 sudog 了
// sudog 是從一個(gè)特殊的池中進(jìn)行分配的。用 acquireSudog 和 releaseSudog 來分配和釋放 sudog
type sudog struct {

    // 之后的這些字段都是被該 g 所掛在的 channel 中的 hchan.lock 來保護(hù)的
    // shrinkstack depends on
    // this for sudogs involved in channel ops.
    g *g

    // isSelect 表示一個(gè) g 是否正在參與 select 操作
    // 所以 g.selectDone 必須用 CAS 來操作，以勝出喚醒的競(jìng)爭(zhēng)
    isSelect bool
    next     *sudog
    prev     *sudog
    elem     unsafe.Pointer // data element (may point to stack)

    // 下面這些字段則永遠(yuǎn)都不會(huì)被并發(fā)訪問
    // 對(duì)于 channel 來說，waitlink 只會(huì)被 g 訪問
    // 對(duì)于信號(hào)量來說，所有的字段，包括上面的那些字段都只在持有 semaRoot 鎖時(shí)才可以訪問
    acquiretime int64
    releasetime int64
    ticket      uint32
    parent      *sudog // semaRoot binary tree
    waitlink    *sudog // g.waiting list or semaRoot
    waittail    *sudog // semaRoot
    c           *hchan // channel
}

線程在 runtime 中的結(jié)構(gòu)，對(duì)應(yīng)一個(gè) pthread，pthread 也會(huì)對(duì)應(yīng)唯一的內(nèi)核線程(task_struct):

type m struct {
    g0      *g     // 用來執(zhí)行調(diào)度指令的 goroutine
    morebuf gobuf  // gobuf arg to morestack
    divmod  uint32 // div/mod denominator for arm - known to liblink

    // Fields not known to debuggers.
    procid        uint64       // for debuggers, but offset not hard-coded
    gsignal       *g           // signal-handling g
    goSigStack    gsignalStack // Go-allocated signal handling stack
    sigmask       sigset       // storage for saved signal mask
    tls           [6]uintptr   // thread-local storage (for x86 extern register)
    mstartfn      func()
    curg          *g       // 當(dāng)前運(yùn)行的用戶 goroutine
    caughtsig     guintptr // goroutine running during fatal signal
    p             puintptr // attached p for executing go code (nil if not executing go code)
    nextp         puintptr
    id            int64
    mallocing     int32
    throwing      int32
    preemptoff    string // 該字段不等于空字符串的話，要保持 curg 始終在這個(gè) m 上運(yùn)行
    locks         int32
    softfloat     int32
    dying         int32
    profilehz     int32
    helpgc        int32
    spinning      bool // m 失業(yè)了，正在積極尋找工作~
    blocked       bool // m 正阻塞在 note 上
    inwb          bool // m 正在執(zhí)行 write barrier
    newSigstack   bool // minit on C thread called sigaltstack
    printlock     int8
    incgo         bool   // m 正在執(zhí)行 cgo call
    freeWait      uint32 // if == 0, safe to free g0 and delete m (atomic)
    fastrand      [2]uint32
    needextram    bool
    traceback     uint8
    ncgocall      uint64      // cgo 調(diào)用總計(jì)數(shù)
    ncgo          int32       // 當(dāng)前正在執(zhí)行的 cgo 訂單計(jì)數(shù)
    cgoCallersUse uint32      // if non-zero, cgoCallers in use temporarily
    cgoCallers    *cgoCallers // cgo traceback if crashing in cgo call
    park          note
    alllink       *m // on allm
    schedlink     muintptr
    mcache        *mcache
    lockedg       guintptr
    createstack   [32]uintptr    // stack that created this thread.
    freglo        [16]uint32     // d[i] lsb and f[i]
    freghi        [16]uint32     // d[i] msb and f[i+16]
    fflag         uint32         // floating point compare flags
    lockedExt     uint32         // tracking for external LockOSThread
    lockedInt     uint32         // tracking for internal lockOSThread
    nextwaitm     muintptr       // 正在等待鎖的下一個(gè) m
    waitunlockf   unsafe.Pointer // todo go func(*g, unsafe.pointer) bool
    waitlock      unsafe.Pointer
    waittraceev   byte
    waittraceskip int
    startingtrace bool
    syscalltick   uint32
    thread        uintptr // thread handle
    freelink      *m      // on sched.freem

    // these are here because they are too large to be on the stack
    // of low-level NOSPLIT functions.
    libcall   libcall
    libcallpc uintptr // for cpu profiler
    libcallsp uintptr
    libcallg  guintptr
    syscall   libcall // 存儲(chǔ) windows 平臺(tái)的 syscall 參數(shù)

    mOS
}

抽象數(shù)據(jù)結(jié)構(gòu)，可以認(rèn)為是 processor 的抽象，代表了任務(wù)執(zhí)行時(shí)的上下文，m 必須獲得 p 才能執(zhí)行:

type p struct {
    lock mutex

    id          int32
    status      uint32 // one of pidle/prunning/...
    link        puintptr
    schedtick   uint32     // 每次調(diào)用 schedule 時(shí)會(huì)加一
    syscalltick uint32     // 每次系統(tǒng)調(diào)用時(shí)加一
    sysmontick  sysmontick // 上次 sysmon 觀察到的 tick 時(shí)間
    m           muintptr   // 和相關(guān)聯(lián)的 m 的反向指針，如果 p 是 idle 的話，那這個(gè)指針是 nil
    mcache      *mcache
    racectx     uintptr

    deferpool    [5][]*_defer // pool of available defer structs of different sizes (see panic.go)
    deferpoolbuf [5][32]*_defer

    // Cache of goroutine ids, amortizes accesses to runtime·sched.goidgen.
    goidcache    uint64
    goidcacheend uint64

    // runnable 狀態(tài)的 goroutine。訪問時(shí)是不加鎖的
    runqhead uint32
    runqtail uint32
    runq     [256]guintptr
    // runnext 非空時(shí)，代表的是一個(gè) runnable 狀態(tài)的 G，
    // 這個(gè) G 是被 當(dāng)前 G 修改為 ready 狀態(tài)的，
    // 并且相比在 runq 中的 G 有更高的優(yōu)先級(jí)
    // 如果當(dāng)前 G 的還有剩余的可用時(shí)間，那么就應(yīng)該運(yùn)行這個(gè) G
    // 運(yùn)行之后，該 G 會(huì)繼承當(dāng)前 G 的剩余時(shí)間
    // If a set of goroutines is locked in a
    // communicate-and-wait pattern, this schedules that set as a
    // unit and eliminates the (potentially large) scheduling
    // latency that otherwise arises from adding the ready'd
    // goroutines to the end of the run queue.
    runnext guintptr

    // Available G's (status == Gdead)
    gfree    *g
    gfreecnt int32

    sudogcache []*sudog
    sudogbuf   [128]*sudog

    tracebuf traceBufPtr

    // traceSweep indicates the sweep events should be traced.
    // This is used to defer the sweep start event until a span
    // has actually been swept.
    traceSweep bool
    // traceSwept and traceReclaimed track the number of bytes
    // swept and reclaimed by sweeping in the current sweep loop.
    traceSwept, traceReclaimed uintptr

    palloc persistentAlloc // per-P to avoid mutex

    // Per-P GC state
    gcAssistTime         int64 // Nanoseconds in assistAlloc
    gcFractionalMarkTime int64 // Nanoseconds in fractional mark worker
    gcBgMarkWorker       guintptr
    gcMarkWorkerMode     gcMarkWorkerMode

    // 當(dāng)前標(biāo)記 worker 的開始時(shí)間，單位納秒
    gcMarkWorkerStartTime int64

    // gcw is this P's GC work buffer cache. The work buffer is
    // filled by write barriers, drained by mutator assists, and
    // disposed on certain GC state transitions.
    gcw gcWork

    // wbBuf is this P's GC write barrier buffer.
    //
    // TODO: Consider caching this in the running G.
    wbBuf wbBuf

    runSafePointFn uint32 // if 1, run sched.safePointFn at next safe point

    pad [sys.CacheLineSize]byte
}

全局調(diào)度器，全局只有一個(gè) schedt 類型的實(shí)例:

type schedt struct {
    // 下面兩個(gè)變量需以原子訪問訪問。保持在 struct 頂部，以使其在 32 位系統(tǒng)上可以對(duì)齊
    goidgen  uint64
    lastpoll uint64

    lock mutex

    // 當(dāng)修改 nmidle，nmidlelocked，nmsys，nmfreed 這些數(shù)值時(shí)
    // 需要記得調(diào)用 checkdead

    midle        muintptr // idle m's waiting for work
    nmidle       int32    // 當(dāng)前等待工作的空閑 m 計(jì)數(shù)
    nmidlelocked int32    // 當(dāng)前等待工作的被 lock 的 m 計(jì)數(shù)
    mnext        int64    // 當(dāng)前預(yù)繳創(chuàng)建的 m 數(shù)，并且該值會(huì)作為下一個(gè)創(chuàng)建的 m 的 ID
    maxmcount    int32    // 允許創(chuàng)建的最大的 m 數(shù)量
    nmsys        int32    // number of system m's not counted for deadlock
    nmfreed      int64    // cumulative number of freed m's

    ngsys uint32 // number of system goroutines; updated atomically

    pidle      puintptr // 空閑 p's
    npidle     uint32
    nmspinning uint32 // See "Worker thread parking/unparking" comment in proc.go.

    // 全局的可運(yùn)行 g 隊(duì)列
    runqhead guintptr
    runqtail guintptr
    runqsize int32

    // dead G 的全局緩存
    gflock       mutex
    gfreeStack   *g
    gfreeNoStack *g
    ngfree       int32

    // sudog 結(jié)構(gòu)的集中緩存
    sudoglock  mutex
    sudogcache *sudog

    // 不同大小的可用的 defer struct 的集中緩存池
    deferlock mutex
    deferpool [5]*_defer

    // 被設(shè)置了 m.exited 標(biāo)記之后的 m，這些 m 正在 freem 這個(gè)鏈表上等待被 free
    // 鏈表用 m.freelink 字段進(jìn)行鏈接
    freem *m

    gcwaiting  uint32 // gc is waiting to run
    stopwait   int32
    stopnote   note
    sysmonwait uint32
    sysmonnote note

    // safepointFn should be called on each P at the next GC
    // safepoint if p.runSafePointFn is set.
    safePointFn   func(*p)
    safePointWait int32
    safePointNote note

    profilehz int32 // cpu profiling rate

    procresizetime int64 // 上次修改 gomaxprocs 的納秒時(shí)間
    totaltime      int64 // ∫gomaxprocs dt up to procresizetime
}

g/p/m 的關(guān)系

Go 實(shí)現(xiàn)了所謂的 M:N 模型，執(zhí)行用戶代碼的 goroutine 可以認(rèn)為都是對(duì)等的 goroutine。不考慮 g0 和 gsignal 的話，我們可以簡(jiǎn)單地認(rèn)為調(diào)度就是將 m 綁定到 p，然后在 m 中不斷循環(huán)執(zhí)行調(diào)度函數(shù)(runtime.schedule)，尋找可用的 g 來執(zhí)行，下圖為 m 綁定到 p 時(shí)，可能得到的 g 的來源:

+--------------+
                                                |    binded    +-------------------------------------+
                                                +-------+------+                                     |
+------------------------------------+                  |                                            v                         +------------------------------------+
|                                    |                  |                         +------------------------------------+       |                                    |
|             +------------------+   |                  |                         |                                    |       |            +------------------+    |
|             | Local Run Queue  |   |                  |                         |             +------------------+   |       |            | Global Run Queue |    |
|   other P   +-+-+-+-+-+-+-+-+--+   |                  |                         |             | Local Run Queue  |   |       |  schedt    +--+-+-+-+-+-+-+---+    |
|               |G|G|G|G|G|G|G|      |                  |                         |    P        +-+-+-+-+-+-+-+-+--+   |       |               |G|G|G|G|G|G|        |
|               +-+-+-+-+-+-+-+      |                  |                         |               |G|G|G|G|G|G|G|      |       |               +-+-+-+-+-+-+        |
|                ^                   |                  |                         |               +-+-+-+-+-+-+-+      |       |                ^                   |
+----------------+-------------------+                  |                         |                ^                   |       +----------------+-------------------+
                 |                                      |                         +----------------+-------------------+                        |
                 |                                      |                                          |                                            |
                 |                                      |                                          |                                            |
                 |                                      |                                          |                                            |
                 |                                      |                                          |                                            |
                 |                                      |                                          |                                            |
                 |                                      |                                          |                                            |
                 |                                      |                                          |                                            |
                 |                                      |                                          |                                            |
                 |                                      |                                          |                                            |
                 |                                      |                                          |                                            |
                 |                                      |                                          |                                            |
                 |                                      v                                          |                                            |
          +------+-------+                             .-.      +----------------+                 |                                            |
          |    steal     +----------------------------( M )-----+    runqget     +-----------------+                                            |
          +--------------+                             `-'      +----------------+                                                              |
                                                        |                                                                                       |
                                                        |                                                                           +-----------+-----+
                                                        +---------------------------------------------------------------------------+   globrunqget   |
                                                        |                                                                           +-----------------+
                                                        |
                                                        |
                                                        |
                                                        |
                                                        |
                                                        |
                                             +----------+--------+
                                             |   get netpoll g   |
                                             +----------+--------+
                                                        |
                                                        |
                                                        |
                                                        |
                                                        |
                                         +--------------+--------------------+
                                         |              |                    |
                                         |              |                    |
                                         |   netpoll    v                    |
                                         |             +-+-+-+-+             |
                                         |             |G|G|G|G|             |
                                         |             +-+-+-+-+             |
                                         |                                   |
                                         +-----------------------------------+

這張圖展示了 g、p、m 三者之間的大致關(guān)系。m 是執(zhí)行實(shí)體，對(duì)應(yīng)的是操作系統(tǒng)線程�？梢钥吹� m 會(huì)從綁定的 p 的本地隊(duì)列、sched 中的全局隊(duì)列、netpoll 中獲取可運(yùn)行的 g，實(shí)在找不著還會(huì)去其它的 p 那里去偷。

p 如何初始化

程序啟動(dòng)時(shí)，會(huì)依次調(diào)用：

graph TD
runtime.schedinit -->  runtime.procresize

在 procresize 中會(huì)將全局 p 數(shù)組初始化，并將這些 p 串成鏈表放進(jìn) sched 全局調(diào)度器的 pidle 隊(duì)列中:

for i := nprocs - 1; i >= 0; i-- {
    p := allp[i]

    // ...
    // 設(shè)置 p 的狀態(tài)
    p.status = _Pidle
    // 初始化時(shí)，所有 p 的 runq 都是空的，所以一定會(huì)走這個(gè) if
    if runqempty(p) {
        // 將 p 放到全局調(diào)度器的 pidle 隊(duì)列中
        pidleput(p)
    } else {
        // ...
    }
}

pidleput 也比較簡(jiǎn)單，沒啥可說的:

func pidleput(_p_ *p) {
    if !runqempty(_p_) {
        throw("pidleput: P has non-empty run queue")
    }
    // 簡(jiǎn)單的鏈表操作
    _p_.link = sched.pidle
    sched.pidle.set(_p_)

    // pidle count + 1
    atomic.Xadd(&sched.npidle, 1)
}

所有 p 在程序啟動(dòng)的時(shí)候就已經(jīng)被初始化完畢了，除非手動(dòng)調(diào)用 runtime.GOMAXPROCS。

func GOMAXPROCS(n int) int {
    lock(&sched.lock)
    ret := int(gomaxprocs)
    unlock(&sched.lock)
    if n <= 0 || n == ret {
        return ret
    }

    stopTheWorld("GOMAXPROCS")

    // newprocs will be processed by startTheWorld
    newprocs = int32(n)

    startTheWorld()
    return ret
}

在 startTheWorld 中會(huì)調(diào)用 procresize。

g 如何創(chuàng)建

在用戶代碼里一般這么寫:

go func() {
    // do the stuff
}()

實(shí)際上會(huì)被翻譯成 runtime.newproc ，特權(quán)語法只是個(gè)語法糖。如果你要在其它語言里實(shí)現(xiàn)類似的東西，只要實(shí)現(xiàn)編譯器翻譯之后的內(nèi)容就好了。具體流程:

graph TD
runtime.newproc --> runtime.newproc1

newproc 干的事情也比較簡(jiǎn)單

func newproc(siz int32, fn *funcval) {
    // add 是一個(gè)指針運(yùn)算，跳過函數(shù)指針
    // 把棧上的參數(shù)起始地址找到
    argp := add(unsafe.Pointer(&fn), sys.PtrSize)
    pc := getcallerpc()
    systemstack(func() {
        newproc1(fn, (*uint8)(argp), siz, pc)
    })
}

// funcval 是一個(gè)變長(zhǎng)結(jié)構(gòu)，第一個(gè)成員是函數(shù)指針
// 所以上面的 add 是跳過這個(gè) fn
type funcval struct {
    fn uintptr
    // variable-size, fn-specific data here
}

runtime 里比較常見的 getcallerpc 和 getcallersp，代碼里的注釋寫的比較明白了:

// For example:
//
// func f(arg1, arg2, arg3 int) {
//    pc := getcallerpc()
//    sp := getcallersp(unsafe.Pointer(&arg1))
//}
//
// These two lines find the PC and SP immediately following
// the call to f (where f will return).
//

getcallerpc 返回的是調(diào)用函數(shù)之后的那條程序指令的地址，即 callee 函數(shù)返回時(shí)要執(zhí)行的下一條指令的地址。

systemstack 在 runtime 中用的也比較多，其功能為讓 m 切換到 g0 上執(zhí)行各種調(diào)度函數(shù)。至于啥是 g0，在講 m 的時(shí)候再說。

newproc1 的工作流程也比較簡(jiǎn)單:

graph TD
newproc1 --> newg
newg[gfget] --> nil{is nil?}
nil -->|yes|E[init stack]
nil -->|no|C[malg]
C --> D[set g status=> idle->dead]
D --> allgadd
E --> G[set g status=> dead-> runnable]
allgadd --> G
G --> runqput

刪掉了不關(guān)心的細(xì)節(jié)后的代碼:

func newproc1(fn *funcval, argp *uint8, narg int32, callerpc uintptr) {
    _g_ := getg()

    if fn == nil {
        _g_.m.throwing = -1 // do not dump full stacks
        throw("go of nil func value")
    }
    _g_.m.locks++ // disable preemption because it can be holding p in a local var
    siz := narg
    siz = (siz + 7) &^ 7


    _p_ := _g_.m.p.ptr()
    newg := gfget(_p_)
    if newg == nil {
        newg = malg(_StackMin)
        casgstatus(newg, _Gidle, _Gdead)
        allgadd(newg) // publishes with a g->status of Gdead so GC scanner doesn't look at uninitialized stack.
    }

    totalSize := 4*sys.RegSize + uintptr(siz) + sys.MinFrameSize // extra space in case of reads slightly beyond frame
    totalSize += -totalSize & (sys.SpAlign - 1)                  // align to spAlign
    sp := newg.stack.hi - totalSize
    spArg := sp

    // 初始化 g，g 的 gobuf 現(xiàn)場(chǎng)，g 的 m 的 curg
    // 以及各種寄存器
    memclrNoHeapPointers(unsafe.Pointer(&newg.sched), unsafe.Sizeof(newg.sched))
    newg.sched.sp = sp
    newg.stktopsp = sp
    newg.sched.pc = funcPC(goexit) + sys.PCQuantum // +PCQuantum so that previous instruction is in same function
    newg.sched.g = guintptr(unsafe.Pointer(newg))
    gostartcallfn(&newg.sched, fn)
    newg.gopc = callerpc
    newg.startpc = fn.fn
    if _g_.m.curg != nil {
        newg.labels = _g_.m.curg.labels
    }

    casgstatus(newg, _Gdead, _Grunnable)

    newg.goid = int64(_p_.goidcache)
    _p_.goidcache++
    runqput(_p_, newg, true)

    if atomic.Load(&sched.npidle) != 0 && atomic.Load(&sched.nmspinning) == 0 && mainStarted {
        wakep()
    }
    _g_.m.locks--
    if _g_.m.locks == 0 && _g_.preempt { // restore the preemption request in case we've cleared it in newstack
        _g_.stackguard0 = stackPreempt
    }
}

所以 go func 執(zhí)行的結(jié)果是調(diào)用 runqput 將 g 放進(jìn)了執(zhí)行隊(duì)列。但在放隊(duì)列之前還做了點(diǎn)小動(dòng)作:

newg.sched.pc = funcPC(goexit) + sys.PCQuantum // +PCQuantum so that previous instruction is in same function

gostartcallfn

// adjust Gobuf as if it executed a call to fn
// and then did an immediate gosave.
func gostartcallfn(gobuf *gobuf, fv *funcval) {
    var fn unsafe.Pointer
    if fv != nil {
        fn = unsafe.Pointer(fv.fn)
    } else {
        fn = unsafe.Pointer(funcPC(nilfunc))
    }
    gostartcall(gobuf, fn, unsafe.Pointer(fv))
}

// adjust Gobuf as if it executed a call to fn with context ctxt
// and then did an immediate gosave.
func gostartcall(buf *gobuf, fn, ctxt unsafe.Pointer) {
    sp := buf.sp
    if sys.RegSize > sys.PtrSize {
        sp -= sys.PtrSize
        *(*uintptr)(unsafe.Pointer(sp)) = 0
    }
    sp -= sys.PtrSize
    *(*uintptr)(unsafe.Pointer(sp)) = buf.pc // 注意這里，這個(gè)，這里的 buf.pc 實(shí)際上是 goexit 的 pc
    buf.sp = sp
    buf.pc = uintptr(fn)
    buf.ctxt = ctxt
}

在 gostartcall 中把 newproc1 時(shí)設(shè)置到 buf.pc 中的 goexit 的函數(shù)地址放到了 goroutine 的棧頂，然后重新設(shè)置 buf.pc 為 goroutine 函數(shù)的位置。這樣做的目的是為了在執(zhí)行完任何 goroutine 的函數(shù)時(shí)，通過 RET 指令，都能從棧頂把 sp 保存的 goexit 的指令 pop 到 pc 寄存器，效果相當(dāng)于任何 goroutine 執(zhí)行函數(shù)執(zhí)行完之后，都會(huì)去執(zhí)行 runtime.goexit，完成一些清理工作后再進(jìn)入 schedule。

在之后的 m 的 schedule 講解中會(huì)看到更詳細(xì)的調(diào)度循環(huán)過程。

runqput

因?yàn)槭欠?runq 而不是直接執(zhí)行，因而什么時(shí)候開始執(zhí)行并不是用戶代碼能決定得了的。再看看 runqput 這個(gè)函數(shù):

// runqput 嘗試把 g 放到本地執(zhí)行隊(duì)列中
// next 參數(shù)如果是 false 的話，runqput 會(huì)將 g 放到運(yùn)行隊(duì)列的尾部
// If next if false, runqput adds g to the tail of the runnable queue.
// If next is true, runqput puts g in the _p_.runnext slot.
// If the run queue is full, runnext puts g on the global queue.
// Executed only by the owner P.
func runqput(_p_ *p, gp *g, next bool) {
    if randomizeScheduler && next && fastrand()%2 == 0 {
        next = false
    }

    if next {
    retryNext:
        oldnext := _p_.runnext
        if !_p_.runnext.cas(oldnext, guintptr(unsafe.Pointer(gp))) {
            goto retryNext
        }
        if oldnext == 0 {
            return
        }
        // 把之前的 runnext 踢到正常的 runq 中
        gp = oldnext.ptr()
    }

retry:
    h := atomic.Load(&_p_.runqhead) // load-acquire, synchronize with consumers
    t := _p_.runqtail
    if t-h < uint32(len(_p_.runq)) {
        _p_.runq[t%uint32(len(_p_.runq))].set(gp)
        atomic.Store(&_p_.runqtail, t+1) // store-release, makes the item available for consumption
        return
    }
    if runqputslow(_p_, gp, h, t) {
        return
    }
    // 隊(duì)列沒有滿的話，上面的 put 操作會(huì)成功
    goto retry
}

runqputslow

// 因?yàn)?slow，所以會(huì)一次性把本地隊(duì)列里的多個(gè) g (包含當(dāng)前的這個(gè)) 放到全局隊(duì)列
// 只會(huì)被 g 的 owner P 執(zhí)行
func runqputslow(_p_ *p, gp *g, h, t uint32) bool {
    var batch [len(_p_.runq)/2 + 1]*g

    // 先從本地隊(duì)列抓一批 g
    n := t - h
    n = n / 2
    if n != uint32(len(_p_.runq)/2) {
        throw("runqputslow: queue is not full")
    }
    for i := uint32(0); i < n; i++ {
        batch[i] = _p_.runq[(h+i)%uint32(len(_p_.runq))].ptr()
    }
    if !atomic.Cas(&_p_.runqhead, h, h+n) { // cas-release, commits consume
        return false
    }
    batch[n] = gp

    if randomizeScheduler {
        for i := uint32(1); i <= n; i++ {
            j := fastrandn(i + 1)
            batch[i], batch[j] = batch[j], batch[i]
        }
    }

    // 把這些 goroutine 構(gòu)造成鏈表
    for i := uint32(0); i < n; i++ {
        batch[i].schedlink.set(batch[i+1])
    }

    // 將鏈表放到全局隊(duì)列中
    lock(&sched.lock)
    globrunqputbatch(batch[0], batch[n], int32(n+1))
    unlock(&sched.lock)
    return true
}

操作全局 sched 時(shí)，需要獲取全局 sched.lock 鎖，全局鎖爭(zhēng)搶的開銷較大，所以才稱之為 slow。p 和 g 在 m 中交互時(shí)，因?yàn)楝F(xiàn)場(chǎng)永遠(yuǎn)是單線程，所以很多時(shí)候不用加鎖。

m 工作機(jī)制

在 runtime 中有三種線程，一種是主線程，一種是用來跑 sysmon 的線程，一種是普通的用戶線程。主線程在 runtime 由對(duì)應(yīng)的全局變量: runtime.m0 來表示。用戶線程就是普通的線程了，和 p 綁定，執(zhí)行 g 中的任務(wù)。雖然說是有三種，實(shí)際上前兩種線程整個(gè) runtime 就只有一個(gè)實(shí)例。用戶線程才會(huì)有很多實(shí)例。

主線程 m0

主線程中用來跑 runtime.main ，流程線性執(zhí)行，沒有跳轉(zhuǎn):

graph TD
runtime.main --> A[init max stack size]
A --> B[systemstack execute -> newm -> sysmon]
B --> runtime.lockOsThread
runtime.lockOsThread --> runtime.init
runtime.init --> runtime.gcenable
runtime.gcenable --> main.init
main.init --> main.main

sysmon 線程

sysmon 是在 runtime.main 中啟動(dòng)的，不過需要注意的是 sysmon 并不是在 m0 上執(zhí)行的。因?yàn)?

systemstack(func() {
    newm(sysmon, nil)
})

創(chuàng)建了新的 m，但這個(gè) m 又與普通的線程不一樣，因?yàn)椴恍枰壎?p 就可以執(zhí)行。是與整個(gè)調(diào)度系統(tǒng)脫離的。

sysmon 內(nèi)部是個(gè)死循環(huán)，主要負(fù)責(zé)以下幾件事情:

checkdead，檢查是否所有 goroutine 都已經(jīng)鎖死，如果是的話，直接調(diào)用 runtime.throw，強(qiáng)制退出。這個(gè)操作只在啟動(dòng)的時(shí)候做一次
將 netpoll 返回的結(jié)果注入到全局 sched 的任務(wù)隊(duì)列
收回因?yàn)?syscall 而長(zhǎng)時(shí)間阻塞的 p，同時(shí)搶占那些執(zhí)行時(shí)間過長(zhǎng)的 g
如果 span 內(nèi)存閑置超過 5min，那么釋放掉

流程圖:

graph TD
sysmon --> usleep
usleep --> checkdead
checkdead --> |every 10ms|C[netpollinited && lastpoll != 0]
C --> |yes|netpoll
netpoll --> injectglist
injectglist --> retake
C --> |no|retake
retake --> A[check forcegc needed]
A --> B[scavenge heap once in a while]
B --> usleep

// sysmon 不需要綁定 P 就可以運(yùn)行，所以不允許 write barriers
//
//go:nowritebarrierrec
func sysmon() {
    lock(&sched.lock)
    sched.nmsys++
    checkdead()
    unlock(&sched.lock)

    // 如果一個(gè) heap span 在一次GC 之后 5min 都沒有被使用過
    // 那么把它交還給操作系統(tǒng)
    scavengelimit := int64(5 * 60 * 1e9)

    if debug.scavenge > 0 {
        // Scavenge-a-lot for testing.
        forcegcperiod = 10 * 1e6
        scavengelimit = 20 * 1e6
    }

    lastscavenge := nanotime()
    nscavenge := 0

    lasttrace := int64(0)
    idle := 0 // how many cycles in succession we had not wokeup somebody
    delay := uint32(0)
    for {
        if idle == 0 { // 初始化時(shí) 20us sleep
            delay = 20
        } else if idle > 50 { // start doubling the sleep after 1ms...
            delay *= 2
        }
        if delay > 10*1000 { // 最多到 10ms
            delay = 10 * 1000
        }
        usleep(delay)
        if debug.schedtrace <= 0 && (sched.gcwaiting != 0 || atomic.Load(&sched.npidle) == uint32(gomaxprocs)) {
            lock(&sched.lock)
            if atomic.Load(&sched.gcwaiting) != 0 || atomic.Load(&sched.npidle) == uint32(gomaxprocs) {
                atomic.Store(&sched.sysmonwait, 1)
                unlock(&sched.lock)
                // Make wake-up period small enough
                // for the sampling to be correct.
                maxsleep := forcegcperiod / 2
                if scavengelimit < forcegcperiod {
                    maxsleep = scavengelimit / 2
                }
                shouldRelax := true
                if osRelaxMinNS > 0 {
                    next := timeSleepUntil()
                    now := nanotime()
                    if next-now < osRelaxMinNS {
                        shouldRelax = false
                    }
                }
                if shouldRelax {
                    osRelax(true)
                }
                notetsleep(&sched.sysmonnote, maxsleep)
                if shouldRelax {
                    osRelax(false)
                }
                lock(&sched.lock)
                atomic.Store(&sched.sysmonwait, 0)
                noteclear(&sched.sysmonnote)
                idle = 0
                delay = 20
            }
            unlock(&sched.lock)
        }
        // trigger libc interceptors if needed
        if *cgo_yield != nil {
            asmcgocall(*cgo_yield, nil)
        }
        // 如果 10ms 沒有 poll 過 network，那么就 netpoll 一次
        lastpoll := int64(atomic.Load64(&sched.lastpoll))
        now := nanotime()
        if netpollinited() && lastpoll != 0 && lastpoll+10*1000*1000 < now {
            atomic.Cas64(&sched.lastpoll, uint64(lastpoll), uint64(now))
            gp := netpoll(false) // 非阻塞 -- 返回一個(gè) goroutine 的列表
            if gp != nil {
                // Need to decrement number of idle locked M's
                // (pretending that one more is running) before injectglist.
                // Otherwise it can lead to the following situation:
                // injectglist grabs all P's but before it starts M's to run the P's,
                // another M returns from syscall, finishes running its G,
                // observes that there is no work to do and no other running M's
                // and reports deadlock.
                incidlelocked(-1)
                injectglist(gp)
                incidlelocked(1)
            }
        }
        // 接收在 syscall 狀態(tài)阻塞的 P
        // 搶占長(zhǎng)時(shí)間運(yùn)行的 G
        if retake(now) != 0 {
            idle = 0
        } else {
            idle++
        }
        // 檢查是否需要 force GC(兩分鐘一次的)
        if t := (gcTrigger{kind: gcTriggerTime, now: now}); t.test() && atomic.Load(&forcegc.idle) != 0 {
            lock(&forcegc.lock)
            forcegc.idle = 0
            forcegc.g.schedlink = 0
            injectglist(forcegc.g)
            unlock(&forcegc.lock)
        }
        // 每過一段時(shí)間掃描一次堆
        if lastscavenge+scavengelimit/2 < now {
            mheap_.scavenge(int32(nscavenge), uint64(now), uint64(scavengelimit))
            lastscavenge = now
            nscavenge++
        }
        if debug.schedtrace > 0 && lasttrace+int64(debug.schedtrace)*1000000 <= now {
            lasttrace = now
            schedtrace(debug.scheddetail > 0)
        }
    }
}

checkdead

// 檢查死鎖的場(chǎng)景
// 該檢查基于當(dāng)前正在運(yùn)行的 M 的數(shù)量，如果 0，那么就是 deadlock 了
// 檢查的時(shí)候必須持有 sched.lock 鎖
func checkdead() {
    // 對(duì)于 -buildmode=c-shared 或者 -buildmode=c-archive 來說
    // 沒有 goroutine 正在運(yùn)行也是 OK 的。因?yàn)檎{(diào)用這個(gè)庫的程序應(yīng)該是在運(yùn)行的
    if islibrary || isarchive {
        return
    }

    // If we are dying because of a signal caught on an already idle thread,
    // freezetheworld will cause all running threads to block.
    // And runtime will essentially enter into deadlock state,
    // except that there is a thread that will call exit soon.
    if panicking > 0 {
        return
    }

    run := mcount() - sched.nmidle - sched.nmidlelocked - sched.nmsys
    if run > 0 {
        return
    }
    if run < 0 {
        print("runtime: checkdead: nmidle=", sched.nmidle, " nmidlelocked=", sched.nmidlelocked, " mcount=", mcount(), " nmsys=", sched.nmsys, "\n")
        throw("checkdead: inconsistent counts")
    }

    grunning := 0
    lock(&allglock)
    for i := 0; i < len(allgs); i++ {
        gp := allgs[i]
        if isSystemGoroutine(gp) {
            continue
        }
        s := readgstatus(gp)
        switch s &^ _Gscan {
        case _Gwaiting:
            grunning++
        case _Grunnable,
            _Grunning,
            _Gsyscall:
            unlock(&allglock)
            print("runtime: checkdead: find g ", gp.goid, " in status ", s, "\n")
            throw("checkdead: runnable g")
        }
    }
    unlock(&allglock)
    if grunning == 0 { // possible if main goroutine calls runtime·Goexit()
        throw("no goroutines (main called runtime.Goexit) - deadlock!")
    }

    // Maybe jump time forward for playground.
    gp := timejump()
    if gp != nil {
        casgstatus(gp, _Gwaiting, _Grunnable)
        globrunqput(gp)
        _p_ := pidleget()
        if _p_ == nil {
            throw("checkdead: no p for timer")
        }
        mp := mget()
        if mp == nil {
            // There should always be a free M since
            // nothing is running.
            throw("checkdead: no m for timer")
        }
        mp.nextp.set(_p_)
        notewakeup(&mp.park)
        return
    }

    getg().m.throwing = -1 // do not dump full stacks
    throw("all goroutines are asleep - deadlock!")
}

retake

// forcePreemptNS is the time slice given to a G before it is
// preempted.
const forcePreemptNS = 10 * 1000 * 1000 // 10ms

func retake(now int64) uint32 {
    n := 0
    // Prevent allp slice changes. This lock will be completely
    // uncontended unless we're already stopping the world.
    lock(&allpLock)
    // We can't use a range loop over allp because we may
    // temporarily drop the allpLock. Hence, we need to re-fetch
    // allp each time around the loop.
    for i := 0; i < len(allp); i++ {
        _p_ := allp[i]
        if _p_ == nil {
            // 在 procresize 修改了 allp 但還沒有創(chuàng)建新的 p 的時(shí)候
            // 會(huì)有這種情況
            continue
        }
        pd := &_p_.sysmontick
        s := _p_.status
        if s == _Psyscall {
            // 從 syscall 接管 P，如果它進(jìn)行 syscall 已經(jīng)經(jīng)過了一個(gè) sysmon 的 tick(至少 20us)
            t := int64(_p_.syscalltick)
            if int64(pd.syscalltick) != t {
                pd.syscalltick = uint32(t)
                pd.syscallwhen = now
                continue
            }
            // 一方面如果沒有其它工作可做的話，我們不想接管 p
            // 但另一方面為了避免 sysmon 線程陷入沉睡，我們最終還是會(huì)接管這些 p
            if runqempty(_p_) && atomic.Load(&sched.nmspinning)+atomic.Load(&sched.npidle) > 0 && pd.syscallwhen+10*1000*1000 > now {
                continue
            }
            // 解開 allplock 的鎖，然后就可以持有 sched.lock 鎖了
            unlock(&allpLock)
            // Need to decrement number of idle locked M's
            // (pretending that one more is running) before the CAS.
            // Otherwise the M from which we retake can exit the syscall,
            // increment nmidle and report deadlock.
            incidlelocked(-1)
            if atomic.Cas(&_p_.status, s, _Pidle) {
                if trace.enabled {
                    traceGoSysBlock(_p_)
                    traceProcStop(_p_)
                }
                n++
                _p_.syscalltick++
                handoffp(_p_)
            }
            incidlelocked(1)
            lock(&allpLock)
        } else if s == _Prunning {
            // 如果 G 運(yùn)行時(shí)間太長(zhǎng)，那么搶占它
            t := int64(_p_.schedtick)
            if int64(pd.schedtick) != t {
                pd.schedtick = uint32(t)
                pd.schedwhen = now
                continue
            }
            if pd.schedwhen+forcePreemptNS > now {
                continue
            }
            preemptone(_p_)
        }
    }
    unlock(&allpLock)
    return uint32(n)
}

普通線程

普通線程就是我們 G/P/M 模型里的 M 了，M 對(duì)應(yīng)的就是操作系統(tǒng)的線程。

線程創(chuàng)建

上面在創(chuàng)建 sysmon 線程的時(shí)候也看到了，創(chuàng)建線程的函數(shù)是 newm。

graph TD
newm --> newm1
newm1 --> newosproc
newosproc --> clone

最終會(huì)走到 linux 創(chuàng)建線程的系統(tǒng)調(diào)用 clone ，代碼里大段和 cgo 相關(guān)的內(nèi)容我們就不關(guān)心了，摘掉 cgo 相關(guān)的邏輯后的代碼如下:

// 創(chuàng)建一個(gè)新的 m。該 m 會(huì)在啟動(dòng)時(shí)調(diào)用函數(shù) fn，或者 schedule 函數(shù)
// fn 需要是 static 類型，且不能是在堆上分配的閉包。
// 運(yùn)行 m 時(shí)，m.p 是有可能為 nil 的，所以不允許 write barriers
//go:nowritebarrierrec
func newm(fn func(), _p_ *p) {
    mp := allocm(_p_, fn)
    mp.nextp.set(_p_)
    mp.sigmask = initSigmask
    newm1(mp)
}

傳入的 p 會(huì)被賦值給 m 的 nextp 成員，在 m 執(zhí)行 schedule 時(shí)，會(huì)將 nextp 拿出來，進(jìn)行之后真正的綁定操作(其實(shí)就是把 nextp 賦值為 nil，并把這個(gè) nextp 賦值給 m.p，把 m 賦值給 p.m)。

func newm1(mp *m) {
    execLock.rlock() // Prevent process clone.
    newosproc(mp, unsafe.Pointer(mp.g0.stack.hi))
    execLock.runlock()
}

func newosproc(mp *m, stk unsafe.Pointer) {
    // Disable signals during clone, so that the new thread starts
    // with signals disabled. It will enable them in minit.
    var oset sigset
    sigprocmask(_SIG_SETMASK, &sigset_all, &oset)
    ret := clone(cloneFlags, stk, unsafe.Pointer(mp), unsafe.Pointer(mp.g0), unsafe.Pointer(funcPC(mstart)))
    sigprocmask(_SIG_SETMASK, &oset, nil)

    if ret < 0 {
        print("runtime: failed to create new OS thread (have ", mcount(), " already; errno=", -ret, ")\n")
        if ret == -_EAGAIN {
            println("runtime: may need to increase max user processes (ulimit -u)")
        }
        throw("newosproc")
    }
}

工作流程

首先空閑的 m 會(huì)被丟進(jìn)全局調(diào)度器的 midle 隊(duì)列中，在需要 m 的時(shí)候，會(huì)先從這里取:

//go:nowritebarrierrec
// 嘗試從 midle 列表中獲取一個(gè) m
// 必須鎖全局的 sched
// 可能在 STW 期間執(zhí)行，所以不允許 write barriers
func mget() *m {
    mp := sched.midle.ptr()
    if mp != nil {
        sched.midle = mp.schedlink
        sched.nmidle--
    }
    return mp
}

取不到的話就會(huì)調(diào)用之前提到的 newm 來創(chuàng)建新線程，創(chuàng)建的線程是不會(huì)被銷毀的，哪怕之后不需要這么多 m 了，也就只是會(huì)把 m 放在 midle 中。

什么時(shí)候會(huì)創(chuàng)建線程呢，可以追蹤一下 newm 的調(diào)用方:

graph TD
main --> |sysmon|newm
startTheWorld --> startTheWorldWithSema
gcMarkTermination --> startTheWorldWithSema
gcStart--> startTheWorldWithSema
startTheWorldWithSema --> |helpgc|newm
startTheWorldWithSema --> |run p|newm
startm --> mget
mget --> |if no free m|newm
startTemplateThread --> |templateThread|newm
LockOsThread --> startTemplateThread
main --> |iscgo|startTemplateThread
handoffp --> startm
wakep --> startm
injectglist --> startm

基本上來講，m 都是按需創(chuàng)建的。如果 sched.midle 中沒有空閑的 m 了，現(xiàn)在又需要，那么就會(huì)去創(chuàng)建一個(gè)。

創(chuàng)建好的線程需要綁定到 p 之后才會(huì)開始執(zhí)行，執(zhí)行過程中也可能被剝奪掉 p。比如前面 retake 的流程，就會(huì)將 g 的 stackguard0 修改為 stackPreempt，待下一次進(jìn)入 newstack 時(shí)，會(huì)判斷是否有該搶占標(biāo)記，有的話，就會(huì)放棄運(yùn)行。這也就是所謂的協(xié)作式搶占。

工作線程執(zhí)行的內(nèi)容核心其實(shí)就只有倆: schedule() 和 findrunnable() 。

schedule

graph TD
schedule --> A[schedtick%61 == 0]
A --> |yes|globrunqget
A --> |no|runqget
globrunqget --> C[gp == nil]
C --> |no|execute
C --> |yes|runqget
runqget --> B[gp == nil]
B --> |no|execute
B --> |yes|findrunnable
findrunnable --> execute

// 調(diào)度器調(diào)度一輪要執(zhí)行的函數(shù): 尋找一個(gè) runnable 狀態(tài)的 goroutine，并 execute 它
// 調(diào)度函數(shù)是循環(huán)，永遠(yuǎn)都不會(huì)返回
func schedule() {
    _g_ := getg()

    if _g_.m.locks != 0 {
        throw("schedule: holding locks")
    }

    if _g_.m.lockedg != 0 {
        stoplockedm()
        execute(_g_.m.lockedg.ptr(), false) // Never returns.
    }

    // 執(zhí)行 cgo 調(diào)用的 g 不能被 schedule 走
    // 因?yàn)?cgo 調(diào)用使用 m 的 g0 棧
    if _g_.m.incgo {
        throw("schedule: in cgo")
    }

top:
    if sched.gcwaiting != 0 {
        gcstopm()
        goto top
    }
    if _g_.m.p.ptr().runSafePointFn != 0 {
        runSafePointFn()
    }

    var gp *g
    var inheritTime bool
    if trace.enabled || trace.shutdown {
        gp = traceReader()
        if gp != nil {
            casgstatus(gp, _Gwaiting, _Grunnable)
            traceGoUnpark(gp, 0)
        }
    }
    if gp == nil && gcBlackenEnabled != 0 {
        gp = gcController.findRunnableGCWorker(_g_.m.p.ptr())
    }
    if gp == nil {
        // 每調(diào)度幾次就檢查一下全局的 runq 來確保公平
        // 否則兩個(gè) goroutine 就可以通過互相調(diào)用
        // 完全占用本地的 runq 了
        if _g_.m.p.ptr().schedtick%61 == 0 && sched.runqsize > 0 {
            lock(&sched.lock)
            gp = globrunqget(_g_.m.p.ptr(), 1)
            unlock(&sched.lock)
        }
    }
    if gp == nil {
        gp, inheritTime = runqget(_g_.m.p.ptr())
        if gp != nil && _g_.m.spinning {
            throw("schedule: spinning with local work")
        }
    }
    if gp == nil {
        gp, inheritTime = findrunnable() // 在找到 goroutine 之前會(huì)一直阻塞下去
    }

    // 當(dāng)前線程將要執(zhí)行 goroutine，并且不會(huì)再進(jìn)入 spinning 狀態(tài)
    // 所以如果它被標(biāo)記為 spinning，我們需要 reset 這個(gè)狀態(tài)
    // 可能會(huì)重啟一個(gè)新的 spinning 狀態(tài)的 M
    if _g_.m.spinning {
        resetspinning()
    }

    if gp.lockedm != 0 {
        // Hands off own p to the locked m,
        // then blocks waiting for a new p.
        startlockedm(gp)
        goto top
    }

    execute(gp, inheritTime)
}

m 中所謂的調(diào)度循環(huán)實(shí)際上就是一直在執(zhí)行下圖中的 loop:

graph TD
schedule --> execute
execute --> gogo
gogo --> goexit
goexit --> goexit1
goexit1 --> goexit0
goexit0 --> schedule

execute

// Schedules gp to run on the current M.
// If inheritTime is true, gp inherits the remaining time in the
// current time slice. Otherwise, it starts a new time slice.
// Never returns.
//
// Write barriers are allowed because this is called immediately after
// acquiring a P in several places.
//
//go:yeswritebarrierrec
func execute(gp *g, inheritTime bool) {
    _g_ := getg() // 這個(gè)可能是 m 的 g0

    casgstatus(gp, _Grunnable, _Grunning)
    gp.waitsince = 0
    gp.preempt = false
    gp.stackguard0 = gp.stack.lo + _StackGuard
    if !inheritTime {
        _g_.m.p.ptr().schedtick++
    }
    _g_.m.curg = gp // 把當(dāng)前 g 的位置讓給 m
    gp.m = _g_.m // 把 gp 指向 m，建立雙向關(guān)系

    gogo(&gp.sched)
}

比較簡(jiǎn)單，綁定 g 和 m，然后 gogo 執(zhí)行綁定的 g 中的函數(shù)。

gogo

runtime.gogo 是匯編完成的，功能就是執(zhí)行 go func() 的這個(gè) func() ，可以看到功能主要是把 g 對(duì)象的 gobuf 里的內(nèi)容搬到寄存器里。然后從 gobuf.pc 寄存器存儲(chǔ)的指令位置開始繼續(xù)向后執(zhí)行。

// void gogo(Gobuf*)
// restore state from Gobuf; longjmp
TEXT runtime·gogo(SB), NOSPLIT, $16-8
    MOVQ    buf+0(FP), BX        // gobuf
    MOVQ    gobuf_g(BX), DX
    MOVQ    0(DX), CX        // make sure g != nil
    get_tls(CX)
    MOVQ    DX, g(CX)
    MOVQ    gobuf_sp(BX), SP    // restore SP
    MOVQ    gobuf_ret(BX), AX
    MOVQ    gobuf_ctxt(BX), DX
    MOVQ    gobuf_bp(BX), BP
    MOVQ    $0, gobuf_sp(BX)    // clear to help garbage collector
    MOVQ    $0, gobuf_ret(BX)
    MOVQ    $0, gobuf_ctxt(BX)
    MOVQ    $0, gobuf_bp(BX)
    MOVQ    gobuf_pc(BX), BX
    JMP    BX

當(dāng)然，這里還是有一些和手寫匯編不太一樣的，看著比較奇怪的地方， gobuf_sp(BX) 這種寫法按說標(biāo)準(zhǔn) plan9 匯編中 gobuf_sp 只是個(gè) symbol ，沒有任何偏移量的意思，但這里卻用名字來代替了其偏移量，這是怎么回事呢？

實(shí)際上這是 runtime 的特權(quán)，是需要鏈接器配合完成的，再來看看 gobuf 在 runtime 中的 struct 定義開頭部分的注釋:

// The offsets of sp, pc, and g are known to (hard-coded in) libmach.

這下知道怎么回事了吧，鏈接器會(huì)幫助我們把這個(gè)換成偏移量。。

Goexit

Goexit :

// Goexit terminates the goroutine that calls it. No other goroutine is affected.
// Goexit runs all deferred calls before terminating the goroutine. Because Goexit
// is not a panic, any recover calls in those deferred functions will return nil.
//
// Calling Goexit from the main goroutine terminates that goroutine
// without func main returning. Since func main has not returned,
// the program continues execution of other goroutines.
// If all other goroutines exit, the program crashes.
func Goexit() {
    // Run all deferred functions for the current goroutine.
    // This code is similar to gopanic, see that implementation
    // for detailed comments.
    gp := getg()
    for {
        d := gp._defer
        if d == nil {
            break
        }
        if d.started {
            if d._panic != nil {
                d._panic.aborted = true
                d._panic = nil
            }
            d.fn = nil
            gp._defer = d.link
            freedefer(d)
            continue
        }
        d.started = true
        reflectcall(nil, unsafe.Pointer(d.fn), deferArgs(d), uint32(d.siz), uint32(d.siz))
        if gp._defer != d {
            throw("bad defer entry in Goexit")
        }
        d._panic = nil
        d.fn = nil
        gp._defer = d.link
        freedefer(d)
        // Note: we ignore recovers here because Goexit isn't a panic
    }
    goexit1()
}

// Finishes execution of the current goroutine.
func goexit1() {
    if raceenabled {
        racegoend()
    }
    if trace.enabled {
        traceGoEnd()
    }
    mcall(goexit0)
}

// The top-most function running on a goroutine
// returns to goexit+PCQuantum.
TEXT runtime·goexit(SB),NOSPLIT,$0-0
    BYTE    $0x90    // NOP
    CALL    runtime·goexit1(SB)    // does not return
    // traceback from goexit1 must hit code range of goexit
    BYTE    $0x90    // NOP

mcall :

// func mcall(fn func(*g))
// Switch to m->g0's stack, call fn(g).
// Fn must never return. It should gogo(&g->sched)
// to keep running g.
TEXT runtime·mcall(SB), NOSPLIT, $0-8
    MOVQ    fn+0(FP), DI

    get_tls(CX)
    MOVQ    g(CX), AX    // save state in g->sched
    MOVQ    0(SP), BX    // caller's PC
    MOVQ    BX, (g_sched+gobuf_pc)(AX)
    LEAQ    fn+0(FP), BX    // caller's SP
    MOVQ    BX, (g_sched+gobuf_sp)(AX)
    MOVQ    AX, (g_sched+gobuf_g)(AX)
    MOVQ    BP, (g_sched+gobuf_bp)(AX)

    // switch to m->g0 & its stack, call fn
    MOVQ    g(CX), BX
    MOVQ    g_m(BX), BX
    MOVQ    m_g0(BX), SI
    CMPQ    SI, AX    // if g == m->g0 call badmcall
    JNE    3(PC)
    MOVQ    $runtime·badmcall(SB), AX
    JMP    AX
    MOVQ    SI, g(CX)    // g = m->g0
    MOVQ    (g_sched+gobuf_sp)(SI), SP    // sp = m->g0->sched.sp
    PUSHQ    AX
    MOVQ    DI, DX
    MOVQ    0(DI), DI
    CALL    DI
    POPQ    AX
    MOVQ    $runtime·badmcall2(SB), AX
    JMP    AX
    RET

wakep

// Tries to add one more P to execute G's.
// Called when a G is made runnable (newproc, ready).
func wakep() {
    // be conservative about spinning threads
    if !atomic.Cas(&sched.nmspinning, 0, 1) {
        return
    }
    startm(nil, true)
}

// Schedules some M to run the p (creates an M if necessary).
// If p==nil, tries to get an idle P, if no idle P's does nothing.
// May run with m.p==nil, so write barriers are not allowed.
// If spinning is set, the caller has incremented nmspinning and startm will
// either decrement nmspinning or set m.spinning in the newly started M.
//go:nowritebarrierrec
func startm(_p_ *p, spinning bool) {
    lock(&sched.lock)
    if _p_ == nil {
        _p_ = pidleget()
        if _p_ == nil {
             unlock(&sched.lock)
             if spinning {
                 // The caller incremented nmspinning, but there are no idle Ps,
                 // so it's okay to just undo the increment and give up.
                 if int32(atomic.Xadd(&sched.nmspinning, -1)) < 0 {
                     throw("startm: negative nmspinning")
                 }
             }
             return
        }
    }
    mp := mget()
    unlock(&sched.lock)
    if mp == nil {
        var fn func()
        if spinning {
            // The caller incremented nmspinning, so set m.spinning in the new M.
            fn = mspinning
        }
        newm(fn, _p_)
        return
    }
    if mp.spinning {
        throw("startm: m is spinning")
    }
    if mp.nextp != 0 {
        throw("startm: m has p")
    }
    if spinning && !runqempty(_p_) {
        throw("startm: p has runnable gs")
    }
    // The caller incremented nmspinning, so set m.spinning in the new M.
    mp.spinning = spinning
    mp.nextp.set(_p_)
    notewakeup(&mp.park)
}

goroutine 掛起

// Puts the current goroutine into a waiting state and calls unlockf.
// If unlockf returns false, the goroutine is resumed.
// unlockf must not access this G's stack, as it may be moved between
// the call to gopark and the call to unlockf.
func gopark(unlockf func(*g, unsafe.Pointer) bool, lock unsafe.Pointer, reason string, traceEv byte, traceskip int) {
    mp := acquirem()
    gp := mp.curg
    status := readgstatus(gp)
    if status != _Grunning && status != _Gscanrunning {
        throw("gopark: bad g status")
    }
    mp.waitlock = lock
    mp.waitunlockf = *(*unsafe.Pointer)(unsafe.Pointer(&unlockf))
    gp.waitreason = reason
    mp.waittraceev = traceEv
    mp.waittraceskip = traceskip
    releasem(mp)
    // can't do anything that might move the G between Ms here.
    mcall(park_m)
}

func goready(gp *g, traceskip int) {
    systemstack(func() {
        ready(gp, traceskip, true)
    })
}

// Mark gp ready to run.
func ready(gp *g, traceskip int, next bool) {
    if trace.enabled {
        traceGoUnpark(gp, traceskip)
    }

    status := readgstatus(gp)

    // Mark runnable.
    _g_ := getg()
    _g_.m.locks++ // disable preemption because it can be holding p in a local var
    if status&^_Gscan != _Gwaiting {
        dumpgstatus(gp)
        throw("bad g->status in ready")
    }

    // status is Gwaiting or Gscanwaiting, make Grunnable and put on runq
    casgstatus(gp, _Gwaiting, _Grunnable)
    runqput(_g_.m.p.ptr(), gp, next)
    if atomic.Load(&sched.npidle) != 0 && atomic.Load(&sched.nmspinning) == 0 {
        wakep()
    }
    _g_.m.locks--
    if _g_.m.locks == 0 && _g_.preempt { // restore the preemption request in Case we've cleared it in newstack
        _g_.stackguard0 = stackPreempt
    }
}

func notesleep(n *note) {
    gp := getg()
    if gp != gp.m.g0 {
        throw("notesleep not on g0")
    }
    ns := int64(-1)
    if *cgo_yield != nil {
        // Sleep for an arbitrary-but-moderate interval to poll libc interceptors.
        ns = 10e6
    }
    for atomic.Load(key32(&n.key)) == 0 {
        gp.m.blocked = true
        futexsleep(key32(&n.key), 0, ns)
        if *cgo_yield != nil {
            asmcgocall(*cgo_yield, nil)
        }
        gp.m.blocked = false
    }
}

// One-time notifications.
func noteclear(n *note) {
    n.key = 0
}

func notewakeup(n *note) {
    old := atomic.Xchg(key32(&n.key), 1)
    if old != 0 {
        print("notewakeup - double wakeup (", old, ")\n")
        throw("notewakeup - double wakeup")
    }
    futexwakeup(key32(&n.key), 1)
}

findrunnable

findrunnable 比較復(fù)雜，流程圖先把 gc 相關(guān)的省略掉了:

graph TD
runqget --> A[gp == nil]
A --> |no|return
A --> |yes|globrunqget
globrunqget --> B[gp == nil]
B --> |no| return
B --> |yes| C[netpollinited && lastpoll != 0]
C --> |yes|netpoll
netpoll --> K[gp == nil]
K --> |no|return
K --> |yes|runqsteal
C --> |no|runqsteal
runqsteal --> D[gp == nil]
D --> |no|return
D --> |yes|E[globrunqget]
E --> F[gp == nil]
F --> |no| return
F --> |yes| G[check all p's runq]
G --> H[runq is empty]
H --> |no|runqget
H --> |yes|I[netpoll]
I --> J[gp == nil]
J --> |no| return
J --> |yes| stopm
stopm --> runqget

// 找到一個(gè)可執(zhí)行的 goroutine 來 execute
// 會(huì)嘗試從其它的 P 那里偷 g，從全局隊(duì)列中拿，或者 network 中 poll
func findrunnable() (gp *g, inheritTime bool) {
    _g_ := getg()

    // The conditions here and in handoffp must agree: if
    // findrunnable would return a G to run, handoffp must start
    // an M.

top:
    _p_ := _g_.m.p.ptr()
    if sched.gcwaiting != 0 {
        gcstopm()
        goto top
    }
    if _p_.runSafePointFn != 0 {
        runSafePointFn()
    }
    if fingwait && fingwake {
        if gp := wakefing(); gp != nil {
            ready(gp, 0, true)
        }
    }
    if *cgo_yield != nil {
        asmcgocall(*cgo_yield, nil)
    }

    // 本地 runq
    if gp, inheritTime := runqget(_p_); gp != nil {
        return gp, inheritTime
    }

    // 全局 runq
    if sched.runqsize != 0 {
        lock(&sched.lock)
        gp := globrunqget(_p_, 0)
        unlock(&sched.lock)
        if gp != nil {
            return gp, false
        }
    }

    // Poll network.
    // netpoll 是我們執(zhí)行 work-stealing 之前的一個(gè)優(yōu)化
    // 如果沒有任何的 netpoll 等待者，或者線程被阻塞在 netpoll 中，我們可以安全地跳過這段邏輯
    // 如果在阻塞的線程中存在任何邏輯上的競(jìng)爭(zhēng)(e.g. 已經(jīng)從 netpoll 中返回，但還沒有設(shè)置 lastpoll)
    // 該線程還是會(huì)將下面的 netpoll 阻塞住
    if netpollinited() && atomic.Load(&netpollWaiters) > 0 && atomic.Load64(&sched.lastpoll) != 0 {
        if gp := netpoll(false); gp != nil { // 非阻塞
            // netpoll 返回 goroutine 鏈表，用 schedlink 連接
            injectglist(gp.schedlink.ptr())
            casgstatus(gp, _Gwaiting, _Grunnable)
            if trace.enabled {
                traceGoUnpark(gp, 0)
            }
            return gp, false
        }
    }

    // 從其它 p 那里偷 g
    procs := uint32(gomaxprocs)
    if atomic.Load(&sched.npidle) == procs-1 {
        // GOMAXPROCS=1 或者除了我們其它的 p 都是 idle
        // 新的工作可能從 syscall/cgocall，網(wǎng)絡(luò)或者定時(shí)器中來。
        // 上面這些任務(wù)都不會(huì)被放到本地的 runq，所有沒有可以 stealing 的點(diǎn)
        goto stop
    }
    // 如果正在自旋的 M 的數(shù)量 >= 忙著的 P，那么阻塞
    // 這是為了
    // 當(dāng) GOMAXPROCS 遠(yuǎn)大于 1，但程序的并行度又很低的時(shí)候
    // 防止過量的 CPU 消耗
    if !_g_.m.spinning && 2*atomic.Load(&sched.nmspinning) >= procs-atomic.Load(&sched.npidle) {
        goto stop
    }
    if !_g_.m.spinning {
        _g_.m.spinning = true
        atomic.Xadd(&sched.nmspinning, 1)
    }
    for i := 0; i < 4; i++ {
        for enum := stealOrder.start(fastrand()); !enum.done(); enum.next() {
            if sched.gcwaiting != 0 {
                goto top
            }
            stealRunNextG := i > 2 // first look for ready queues with more than 1 g
            if gp := runqsteal(_p_, allp[enum.position()], stealRunNextG); gp != nil {
                return gp, false
            }
        }
    }

stop:

    // 沒有可以干的事情。如果我們正在 GC 的標(biāo)記階段，可以安全地掃描和加深對(duì)象的顏色，
    // 這樣可以進(jìn)行空閑時(shí)間的標(biāo)記，而不是直接放棄 P
    if gcBlackenEnabled != 0 && _p_.gcBgMarkWorker != 0 && gcMarkWorkAvailable(_p_) {
        _p_.gcMarkWorkerMode = gcMarkWorkerIdleMode
        gp := _p_.gcBgMarkWorker.ptr()
        casgstatus(gp, _Gwaiting, _Grunnable)
        if trace.enabled {
            traceGoUnpark(gp, 0)
        }
        return gp, false
    }

    // Before we drop our P, make a snapshot of the allp slice,
    // which can change underfoot once we no longer block
    // safe-points. We don't need to snapshot the contents because
    // everything up to cap(allp) is immutable.
    allpSnapshot := allp

    // 返回 P 并阻塞
    lock(&sched.lock)
    if sched.gcwaiting != 0 || _p_.runSafePointFn != 0 {
        unlock(&sched.lock)
        goto top
    }
    if sched.runqsize != 0 {
        gp := globrunqget(_p_, 0)
        unlock(&sched.lock)
        return gp, false
    }
    if releasep() != _p_ {
        throw("findrunnable: wrong p")
    }
    pidleput(_p_)
    unlock(&sched.lock)

    // Delicate dance: thread transitions from spinning to non-spinning state,
    // potentially concurrently with submission of new goroutines. We must
    // drop nmspinning first and then check all per-P queues again (with
    // #StoreLoad memory barrier in between). If we do it the other way around,
    // another thread can submit a goroutine after we've checked all run queues
    // but before we drop nmspinning; as the result nobody will unpark a thread
    // to run the goroutine.
    // If we discover new work below, we need to restore m.spinning as a signal
    // for resetspinning to unpark a new worker thread (because there can be more
    // than one starving goroutine). However, if after discovering new work
    // we also observe no idle Ps, it is OK to just park the current thread:
    // the system is fully loaded so no spinning threads are required.
    // Also see "Worker thread parking/unparking" comment at the top of the file.
    wasSpinning := _g_.m.spinning
    if _g_.m.spinning {
        _g_.m.spinning = false
        if int32(atomic.Xadd(&sched.nmspinning, -1)) < 0 {
            throw("findrunnable: negative nmspinning")
        }
    }

    // 再檢查一下所有的 runq
    for _, _p_ := range allpSnapshot {
        if !runqempty(_p_) {
            lock(&sched.lock)
            _p_ = pidleget()
            unlock(&sched.lock)
            if _p_ != nil {
                acquirep(_p_)
                if wasSpinning {
                    _g_.m.spinning = true
                    atomic.Xadd(&sched.nmspinning, 1)
                }
                goto top
            }
            break
        }
    }

    // 再檢查 gc 空閑 g
    if gcBlackenEnabled != 0 && gcMarkWorkAvailable(nil) {
        lock(&sched.lock)
        _p_ = pidleget()
        if _p_ != nil && _p_.gcBgMarkWorker == 0 {
            pidleput(_p_)
            _p_ = nil
        }
        unlock(&sched.lock)
        if _p_ != nil {
            acquirep(_p_)
            if wasSpinning {
                _g_.m.spinning = true
                atomic.Xadd(&sched.nmspinning, 1)
            }
            // Go back to idle GC check.
            goto stop
        }
    }

    // poll network
    if netpollinited() && atomic.Load(&netpollWaiters) > 0 && atomic.Xchg64(&sched.lastpoll, 0) != 0 {
        if _g_.m.p != 0 {
            throw("findrunnable: netpoll with p")
        }
        if _g_.m.spinning {
            throw("findrunnable: netpoll with spinning")
        }
        gp := netpoll(true) // 阻塞到返回為止
        atomic.Store64(&sched.lastpoll, uint64(nanotime()))
        if gp != nil {
            lock(&sched.lock)
            _p_ = pidleget()
            unlock(&sched.lock)
            if _p_ != nil {
                acquirep(_p_)
                injectglist(gp.schedlink.ptr())
                casgstatus(gp, _Gwaiting, _Grunnable)
                if trace.enabled {
                    traceGoUnpark(gp, 0)
                }
                return gp, false
            }
            injectglist(gp)
        }
    }
    stopm()
    goto top
}

m 和 p 解綁定

handoffp

graph TD

mexit --> A[is m0?]
A --> |yes|B[handoffp]
A --> |no| C[iterate allm]
C --> |m found|handoffp
C --> |m not found| throw

forEachP --> |p status == syscall| handoffp

stoplockedm --> handoffp

entersyscallblock --> entersyscallblock_handoff
entersyscallblock_handoff --> handoffp

retake --> |p status == syscall| handoffp

最終會(huì)把 p 放回全局的 pidle 隊(duì)列中:

// Hands off P from syscall or locked M.
// Always runs without a P, so write barriers are not allowed.
//go:nowritebarrierrec
func handoffp(_p_ *p) {
	// handoffp must start an M in any situation where
	// findrunnable would return a G to run on _p_.

	// if it has local work, start it straight away
	if !runqempty(_p_) || sched.runqsize != 0 {
		startm(_p_, false)
		return
	}
	// if it has GC work, start it straight away
	if gcBlackenEnabled != 0 && gcMarkWorkAvailable(_p_) {
		startm(_p_, false)
		return
	}
	// no local work, check that there are no spinning/idle M's,
	// otherwise our help is not required
	if atomic.Load(&sched.nmspinning)+atomic.Load(&sched.npidle) == 0 && atomic.Cas(&sched.nmspinning, 0, 1) { // TODO: fast atomic
		startm(_p_, true)
		return
	}
	lock(&sched.lock)
	if sched.gcwaiting != 0 {
		_p_.status = _Pgcstop
		sched.stopwait--
		if sched.stopwait == 0 {
			notewakeup(&sched.stopnote)
		}
		unlock(&sched.lock)
		return
	}
	if _p_.runSafePointFn != 0 && atomic.Cas(&_p_.runSafePointFn, 1, 0) {
		sched.safePointFn(_p_)
		sched.safePointWait--
		if sched.safePointWait == 0 {
			notewakeup(&sched.safePointNote)
		}
	}
	if sched.runqsize != 0 {
		unlock(&sched.lock)
		startm(_p_, false)
		return
	}
	// If this is the last running P and nobody is polling network,
	// need to wakeup another M to poll network.
	if sched.npidle == uint32(gomaxprocs-1) && atomic.Load64(&sched.lastpoll) != 0 {
		unlock(&sched.lock)
		startm(_p_, false)
		return
	}
	pidleput(_p_)
	unlock(&sched.lock)
}

g 的狀態(tài)遷移

graph LR
start{newg} --> Gidle
Gidle --> |oneNewExtraM|Gdead
Gidle --> |newproc1|Gdead

Gdead --> |newproc1|Grunnable
Gdead --> |needm|Gsyscall

Gscanrunning --> |scang|Grunning

Grunnable --> |execute|Grunning

Gany --> |casgcopystack|Gcopystack

Gcopystack --> |todotodo|Grunning

Gsyscall --> |dropm|Gdead
Gsyscall --> |exitsyscall0|Grunnable
Gsyscall --> |exitsyscall|Grunning

Grunning --> |goschedImpl|Grunnable
Grunning --> |goexit0|Gdead
Grunning --> |newstack|Gcopystack
Grunning --> |reentersyscall|Gsyscall
Grunning --> |entersyscallblock|Gsyscall
Grunning --> |markroot|Gwaiting
Grunning --> |gcAssistAlloc1|Gwaiting
Grunning --> |park_m|Gwaiting
Grunning --> |gcMarkTermination|Gwaiting
Grunning --> |gcBgMarkWorker|Gwaiting
Grunning --> |newstack|Gwaiting

Gwaiting --> |gcMarkTermination|Grunning
Gwaiting --> |gcBgMarkWorker|Grunning
Gwaiting --> |markroot|Grunning
Gwaiting --> |gcAssistAlloc1|Grunning
Gwaiting --> |newstack|Grunning
Gwaiting --> |findRunnableGCWorker|Grunnable
Gwaiting --> |ready|Grunnable
Gwaiting --> |findrunnable|Grunnable
Gwaiting --> |injectglist|Grunnable
Gwaiting --> |schedule|Grunnable
Gwaiting --> |park_m|Grunnable
Gwaiting --> |procresize|Grunnable
Gwaiting --> |checkdead|Grunnable

圖上的 Gany 代表任意狀態(tài)，GC 時(shí)的狀態(tài)切換比較多，如果只關(guān)注正常情況下的狀態(tài)轉(zhuǎn)換，可以把 markroot、gcMark 之類的先忽略掉。

p 的狀態(tài)遷移

graph LR

Pidle --> |acquirep1|Prunning

Psyscall --> |retake|Pidle
Psyscall --> |entersyscall_gcwait|Pgcstop
Psyscall --> |exitsyscallfast|Prunning

Pany --> |gcstopm|Pgcstop
Pany --> |forEachP|Pidle
Pany --> |releasep|Pidle
Pany --> |handoffp|Pgcstop
Pany --> |procresize release current p use allp 0|Pidle
Pany --> |procresize when init|Pgcstop
Pany --> |procresize when free old p| Pdead
Pany --> |procresize after resize use current p|Prunning
Pany --> |reentersyscall|Psyscall
Pany --> |stopTheWorldWithSema|Pgcstop

搶占流程

函數(shù)執(zhí)行是在 goroutine 的棧上，這個(gè)棧在函數(shù)執(zhí)行期間是有可能溢出的，我們前面也看到了，如果一個(gè)函數(shù)用到了棧，會(huì)將 stackguard0 和 sp 寄存器進(jìn)行比較，如果 sp > stackguard0，說明棧已經(jīng)增長(zhǎng)到溢出，因?yàn)闂Ｊ菑膬?nèi)存高地址向低地址方向增長(zhǎng)的。

那么這個(gè)比較過程是在哪里完成的呢？這一步是由編譯器完成的，我們看看一個(gè)函數(shù)編譯后的結(jié)果，這段代碼來自 go-internals:

0x0000 TEXT    "".main(SB), $24-0
  ;; stack-split prologue
  0x0000 MOVQ    (TLS), CX
  0x0009 CMPQ    SP, 16(CX)
  0x000d JLS    58

  0x000f SUBQ    $24, SP
  0x0013 MOVQ    BP, 16(SP)
  0x0018 LEAQ    16(SP), BP
  ;; ...omitted FUNCDATA stuff...
  0x001d MOVQ    $137438953482, AX
  0x0027 MOVQ    AX, (SP)
  ;; ...omitted PCDATA stuff...
  0x002b CALL    "".add(SB)
  0x0030 MOVQ    16(SP), BP
  0x0035 ADDQ    $24, SP
  0x0039 RET

  ;; stack-split epilogue
  0x003a NOP
  ;; ...omitted PCDATA stuff...
  0x003a CALL    runtime.morestack_noctxt(SB)
  0x003f JMP    0

函數(shù)開頭被插的這段指令，即是將 g struct 中的 stackguard 與 SP 寄存器進(jìn)行對(duì)比，JLS 表示 SP < 16(CX) 的話即跳轉(zhuǎn)。

;; stack-split prologue
  0x0000 MOVQ    (TLS), CX
  0x0009 CMPQ    SP, 16(CX)
  0x000d JLS    58

這里因?yàn)?CX 寄存器存儲(chǔ)的是 g 的起始地址，而 16(CX) 指的是 g 結(jié)構(gòu)體偏移 16 個(gè)字節(jié)的位置，可以回顧一下 g 結(jié)構(gòu)體定義，16 個(gè)字節(jié)恰好是跳過了第一個(gè)成員 stack(16字節(jié)) 之后的 stackguard0 的位置。

58 轉(zhuǎn)為 16 進(jìn)制即是 0x3a。

;; stack-split epilogue
  0x003a NOP
  ;; ...omitted PCDATA stuff...
  0x003a CALL    runtime.morestack_noctxt(SB)
  0x003f JMP    0

morestack_noctxt:

// morestack but not preserving ctxt.
TEXT runtime·morestack_noctxt(SB),NOSPLIT,$0
    MOVL    $0, DX
    JMP    runtime·morestack(SB)

morestack:

TEXT runtime·morestack(SB),NOSPLIT,$0-0
    // Cannot grow scheduler stack (m->g0).
    get_tls(CX)
    MOVQ    g(CX), BX
    MOVQ    g_m(BX), BX
    MOVQ    m_g0(BX), SI
    CMPQ    g(CX), SI
    JNE    3(PC)
    CALL    runtime·badmorestackg0(SB)
    INT    $3

    // Cannot grow signal stack (m->gsignal).
    MOVQ    m_gsignal(BX), SI
    CMPQ    g(CX), SI
    JNE    3(PC)
    CALL    runtime·badmorestackgsignal(SB)
    INT    $3

    // Called from f.
    // Set m->morebuf to f's caller.
    MOVQ    8(SP), AX    // f's caller's PC
    MOVQ    AX, (m_morebuf+gobuf_pc)(BX)
    LEAQ    16(SP), AX    // f's caller's SP
    MOVQ    AX, (m_morebuf+gobuf_sp)(BX)
    get_tls(CX)
    MOVQ    g(CX), SI
    MOVQ    SI, (m_morebuf+gobuf_g)(BX)

    // Set g->sched to context in f.
    MOVQ    0(SP), AX // f's PC
    MOVQ    AX, (g_sched+gobuf_pc)(SI)
    MOVQ    SI, (g_sched+gobuf_g)(SI)
    LEAQ    8(SP), AX // f's SP
    MOVQ    AX, (g_sched+gobuf_sp)(SI)
    MOVQ    BP, (g_sched+gobuf_bp)(SI)
    MOVQ    DX, (g_sched+gobuf_ctxt)(SI)

    // Call newstack on m->g0's stack.
    MOVQ    m_g0(BX), BX
    MOVQ    BX, g(CX)
    MOVQ    (g_sched+gobuf_sp)(BX), SP
    CALL    runtime·newstack(SB)
    MOVQ    $0, 0x1003    // crash if newstack returns
    RET

newstack:

// Called from runtime·morestack when more stack is needed.
// Allocate larger stack and relocate to new stack.
// Stack growth is multiplicative, for constant amortized cost.
//
// g->atomicstatus will be Grunning or Gscanrunning upon entry.
// If the GC is trying to stop this g then it will set preemptscan to true.
//
// This must be nowritebarrierrec because it can be called as part of
// stack growth from other nowritebarrierrec functions, but the
// compiler doesn't check this.
//
//go:nowritebarrierrec
func newstack() {
    thisg := getg()
    // TODO: double check all gp. shouldn't be getg().
    if thisg.m.morebuf.g.ptr().stackguard0 == stackFork {
        throw("stack growth after fork")
    }
    if thisg.m.morebuf.g.ptr() != thisg.m.curg {
        print("runtime: newstack called from g=", hex(thisg.m.morebuf.g), "\n"+"\tm=", thisg.m, " m->curg=", thisg.m.curg, " m->g0=", thisg.m.g0, " m->gsignal=", thisg.m.gsignal, "\n")
        morebuf := thisg.m.morebuf
        traceback(morebuf.pc, morebuf.sp, morebuf.lr, morebuf.g.ptr())
        throw("runtime: wrong goroutine in newstack")
    }

    gp := thisg.m.curg

    if thisg.m.curg.throwsplit {
        // Update syscallsp, syscallpc in case traceback uses them.
        morebuf := thisg.m.morebuf
        gp.syscallsp = morebuf.sp
        gp.syscallpc = morebuf.pc
        pcname, pcoff := "(unknown)", uintptr(0)
        f := findfunc(gp.sched.pc)
        if f.valid() {
            pcname = funcname(f)
            pcoff = gp.sched.pc - f.entry
        }
        print("runtime: newstack at ", pcname, "+", hex(pcoff),
            " sp=", hex(gp.sched.sp), " stack=[", hex(gp.stack.lo), ", ", hex(gp.stack.hi), "]\n",
            "\tmorebuf={pc:", hex(morebuf.pc), " sp:", hex(morebuf.sp), " lr:", hex(morebuf.lr), "}\n",
            "\tsched={pc:", hex(gp.sched.pc), " sp:", hex(gp.sched.sp), " lr:", hex(gp.sched.lr), " ctxt:", gp.sched.ctxt, "}\n")

        thisg.m.traceback = 2 // Include runtime frames
        traceback(morebuf.pc, morebuf.sp, morebuf.lr, gp)
        throw("runtime: stack split at bad time")
    }

    morebuf := thisg.m.morebuf
    thisg.m.morebuf.pc = 0
    thisg.m.morebuf.lr = 0
    thisg.m.morebuf.sp = 0
    thisg.m.morebuf.g = 0

    // NOTE: stackguard0 may change underfoot, if another thread
    // is about to try to preempt gp. Read it just once and use that same
    // value now and below.
    preempt := atomic.Loaduintptr(&gp.stackguard0) == stackPreempt

    // Be conservative about where we preempt.
    // We are interested in preempting user Go code, not runtime code.
    // If we're holding locks, mallocing, or preemption is disabled, don't
    // preempt.
    // This check is very early in newstack so that even the status change
    // from Grunning to Gwaiting and back doesn't happen in this case.
    // That status change by itself can be viewed as a small preemption,
    // because the GC might change Gwaiting to Gscanwaiting, and then
    // this goroutine has to wait for the GC to finish before continuing.
    // If the GC is in some way dependent on this goroutine (for example,
    // it needs a lock held by the goroutine), that small preemption turns
    // into a real deadlock.
    if preempt {
        if thisg.m.locks != 0 || thisg.m.mallocing != 0 || thisg.m.preemptoff != "" || thisg.m.p.ptr().status != _Prunning {
            // Let the goroutine keep running for now.
            // gp->preempt is set, so it will be preempted next time.
            gp.stackguard0 = gp.stack.lo + _StackGuard
            gogo(&gp.sched) // never return
        }
    }

    if gp.stack.lo == 0 {
        throw("missing stack in newstack")
    }
    sp := gp.sched.sp
    if sys.ArchFamily == sys.AMD64 || sys.ArchFamily == sys.I386 {
        // The call to morestack cost a word.
        sp -= sys.PtrSize
    }
    if stackDebug >= 1 || sp < gp.stack.lo {
        print("runtime: newstack sp=", hex(sp), " stack=[", hex(gp.stack.lo), ", ", hex(gp.stack.hi), "]\n",
            "\tmorebuf={pc:", hex(morebuf.pc), " sp:", hex(morebuf.sp), " lr:", hex(morebuf.lr), "}\n",
            "\tsched={pc:", hex(gp.sched.pc), " sp:", hex(gp.sched.sp), " lr:", hex(gp.sched.lr), " ctxt:", gp.sched.ctxt, "}\n")
    }
    if sp < gp.stack.lo {
        print("runtime: gp=", gp, ", gp->status=", hex(readgstatus(gp)), "\n ")
        print("runtime: split stack overflow: ", hex(sp), " < ", hex(gp.stack.lo), "\n")
        throw("runtime: split stack overflow")
    }

    if preempt {
        if gp == thisg.m.g0 {
            throw("runtime: preempt g0")
        }
        if thisg.m.p == 0 && thisg.m.locks == 0 {
            throw("runtime: g is running but p is not")
        }
        // Synchronize with scang.
        casgstatus(gp, _Grunning, _Gwaiting)
        if gp.preemptscan {
            for !castogscanstatus(gp, _Gwaiting, _Gscanwaiting) {
                // Likely to be racing with the GC as
                // it sees a _Gwaiting and does the
                // stack scan. If so, gcworkdone will
                // be set and gcphasework will simply
                // return.
            }
            if !gp.gcscandone {
                // gcw is safe because we're on the
                // system stack.
                gcw := &gp.m.p.ptr().gcw
                scanstack(gp, gcw)
                if gcBlackenPromptly {
                    gcw.dispose()
                }
                gp.gcscandone = true
            }
            gp.preemptscan = false
            gp.preempt = false
            casfrom_Gscanstatus(gp, _Gscanwaiting, _Gwaiting)
            // This clears gcscanvalid.
            casgstatus(gp, _Gwaiting, _Grunning)
            gp.stackguard0 = gp.stack.lo + _StackGuard
            gogo(&gp.sched) // never return
        }

        // Act like goroutine called runtime.Gosched.
        casgstatus(gp, _Gwaiting, _Grunning)
        gopreempt_m(gp) // never return
    }

    // Allocate a bigger segment and move the stack.
    oldsize := gp.stack.hi - gp.stack.lo
    newsize := oldsize * 2
    if newsize > maxstacksize {
        print("runtime: goroutine stack exceeds ", maxstacksize, "-byte limit\n")
        throw("stack overflow")
    }

    // The goroutine must be executing in order to call newstack,
    // so it must be Grunning (or Gscanrunning).
    casgstatus(gp, _Grunning, _Gcopystack)

    // The concurrent GC will not scan the stack while we are doing the copy since
    // the gp is in a Gcopystack status.
    copystack(gp, newsize, true)
    if stackDebug >= 1 {
        print("stack grow done\n")
    }
    casgstatus(gp, _Gcopystack, _Grunning)
    gogo(&gp.sched)
}

總結(jié)一下流程:

graph TD
start[entering func] --> cmp[sp < stackguard0]
cmp --> |yes| morestack_noctxt
cmp --> |no|final[execute func]
morestack_noctxt --> morestack
morestack --> newstack
newstack --> preempt

搶占都是在 newstack 中完成，但搶占標(biāo)記是在 Go 源代碼中的其它位置來進(jìn)行標(biāo)記的:

我們來看看 stackPreempt 是在哪些位置賦值給 stackguard0 的:

graph LR

unlock --> |in case cleared in newstack|restorePreempt
ready --> |in case cleared in newstack|restorePreempt
startTheWorldWithSema --> |in case cleared in newstack|restorePreempt
allocm --> |in case cleared in newstack|restorePreempt
exitsyscall --> |in case cleared in newstack|restorePreempt
newproc1--> |in case cleared in newstack|restorePreempt
releasem -->  |in case cleared in newstack|restorePreempt

scang --> setPreempt
reentersyscall --> setPreempt
entersyscallblock --> setPreempt
preemptone--> setPreempt

enlistWorker --> preemptone
retake --> preemptone
preemptall --> preemptone
freezetheworld --> preemptall
stopTheWorldWithSema --> preemptall
forEachP --> preemptall
startpanic_m --> freezetheworld
gcMarkDone --> forEachP

可見只有 gc 和 retake 才會(huì)去真正地?fù)屨?g，并沒有其它的入口，其它的地方就只是恢復(fù)一下可能在 newstack 中被清除掉的搶占標(biāo)記。

當(dāng)然，這里 entersyscall 和 entersyscallblock 比較特殊，雖然這倆函數(shù)的實(shí)現(xiàn)中有設(shè)置搶占標(biāo)記，但實(shí)際上這兩段邏輯是不會(huì)被走到的。因?yàn)?syscall 執(zhí)行時(shí)是在 m 的 g0 棧上，如果在執(zhí)行時(shí)被搶占，那么會(huì)直接 throw，而無法恢復(fù)。

來自：http://xargin.com/go-scheduler/

標(biāo)簽： idc isp linux 安全代碼網(wǎng)絡(luò)

版權(quán)申明：本站文章部分自網(wǎng)絡(luò)，如有侵權(quán)，請(qǐng)聯(lián)系：west999com@outlook.com
特別注意：本站所有轉(zhuǎn)載文章言論不代表本站觀點(diǎn)！
本站所提供的圖片等素材，版權(quán)歸原作者所有，如需使用，請(qǐng)與原作者聯(lián)系。