Skip to content

runtime: moving invariant code out of function and loop gives small gains in GreenTea GC #76212

@archanaravindar

Description

@archanaravindar

Go version

go version go1.26-devel_12c8d14d94 linux/amd64

Output of go env in your module/workspace:
[archana@dell-r640-007 sweet]$ go env
AR='ar'
CC='gcc'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_ENABLED='1'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
CXX='g++'
GCCGO='gccgo'
GO111MODULE=''
GOAMD64='v3'
GOARCH='amd64'
GOAUTH='netrc'
GOBIN=''
GOCACHE='/home/archana/.cache/go-build'
GOCACHEPROG=''
GODEBUG=''
GOENV='/home/archana/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFIPS140='off'
GOFLAGS=''
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build3105282545=/tmp/go-build -gno-record-gcc-switches'
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMOD='/home/archana/benchmarks/go.mod'
GOMODCACHE='/home/archana/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/archana/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/lib/golang'
GOSUMDB='sum.golang.org'
GOTELEMETRY='local'
GOTELEMETRYDIR='/home/archana/.config/go/telemetry'
GOTMPDIR=''
GOTOOLCHAIN='local'
GOTOOLDIR='/usr/lib/golang/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.24.6 (Red Hat 1.24.6-1.el10_0)'
GOWORK=''
PKG_CONFIG='pkg-config'

In the Green tea GC which is now ON by default in the version of Go tested,
heapBitsSmallForAddrInline lies in the hot path in scanObjectsSmall as it is invoked within a nested loop
the first two instructions of heapBitsSmallForAddrInline computes values that do not change for a given object size and span
and is hence loop invariant
Manually hoisting this code out of the loop as follows exhibits gains in some sweet benchmarks across multiple architectures
and does not cause statistically significant regressions in other benchmarks within sweet

func scanObjectsSmall(base, objSize uintptr, elems uint16, gcw *gcWork, scans *gc.ObjMask) {
        nptrs := 0
        for i, bits := range scans {
                if i*(goarch.PtrSize*8) > int(elems) {
                        break
                }
                n := sys.OnesCount64(uint64(bits))
--->                hbitsBase, _ := spanHeapBitsRange(base, gc.PageSize, objSize)
--->                hbits := (*byte)(unsafe.Pointer(hbitsBase))
                for range n {
                        j := sys.TrailingZeros64(uint64(bits))
                        bits &^= 1 << j

                        b := base + uintptr(i*(goarch.PtrSize*8)+j)*objSize
                        ptrBits := heapBitsSmallForAddrInlineWithHB(hbits, base, b, objSize)
...
func heapBitsSmallForAddrInlineWithHB(hbits *byte, spanBase, addr, elemsize uintptr) uintptr {
        // These objects are always small enough that their bitmaps
        // fit in a single word, so just load the word or two we need.
        //              
        // Mirrors mspan.writeHeapBitsSmall.
        //
        // We should be using heapBits(), but unfortunately it introduces
        // both bounds checks panics and throw which causes us to exceed
        // the nosplit limit in quite a few cases.
        i := (addr - spanBase) / goarch.PtrSize / ptrBits
        j := (addr - spanBase) / goarch.PtrSize % ptrBits
        bits := elemsize / goarch.PtrSize
        word0 := (*uintptr)(unsafe.Pointer(addb(hbits, goarch.PtrSize*(i+0))))
        word1 := (*uintptr)(unsafe.Pointer(addb(hbits, goarch.PtrSize*(i+1))))

...

Ideally it would have been more effective if the compiler had moved this invariant code itself,
perhaps if heapBitsSmallForAddrInline was inlined by the compiler it would have been able to
hoist the invariant code outside?
According to inlining report:
this function is deemed to be expensive to be inlined by the compiler
./mgcmark_greenteagc.go:1091:6: cannot inline heapBitsSmallForAddrInline: function too complex: cost 166 exceeds budget 80

Moving it out of the outer loop entirely performed worse than the current version
presumably due to the early loop exit

A similar pattern occurs in heapBitsSmallForAddr as well which is called from typePointersOfUnchecked but its not clear at this point yet, whether
it will be as hot as scanObjectsSmall

Metadata

Metadata

Assignees

No one assigned

    Labels

    ImplementationIssues describing a semantics-preserving change to the Go implementation.NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.Performancecompiler/runtimeIssues related to the Go compiler and/or runtime.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions