-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Description
Go version
go version go1.26-devel_12c8d14d94 linux/amd64
Output of go env in your module/workspace:
[archana@dell-r640-007 sweet]$ go env
AR='ar'
CC='gcc'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_ENABLED='1'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
CXX='g++'
GCCGO='gccgo'
GO111MODULE=''
GOAMD64='v3'
GOARCH='amd64'
GOAUTH='netrc'
GOBIN=''
GOCACHE='/home/archana/.cache/go-build'
GOCACHEPROG=''
GODEBUG=''
GOENV='/home/archana/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFIPS140='off'
GOFLAGS=''
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build3105282545=/tmp/go-build -gno-record-gcc-switches'
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMOD='/home/archana/benchmarks/go.mod'
GOMODCACHE='/home/archana/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/archana/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/lib/golang'
GOSUMDB='sum.golang.org'
GOTELEMETRY='local'
GOTELEMETRYDIR='/home/archana/.config/go/telemetry'
GOTMPDIR=''
GOTOOLCHAIN='local'
GOTOOLDIR='/usr/lib/golang/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.24.6 (Red Hat 1.24.6-1.el10_0)'
GOWORK=''
PKG_CONFIG='pkg-config'
In the Green tea GC which is now ON by default in the version of Go tested,
heapBitsSmallForAddrInline lies in the hot path in scanObjectsSmall as it is invoked within a nested loop
the first two instructions of heapBitsSmallForAddrInline computes values that do not change for a given object size and span
and is hence loop invariant
Manually hoisting this code out of the loop as follows exhibits gains in some sweet benchmarks across multiple architectures
and does not cause statistically significant regressions in other benchmarks within sweet
func scanObjectsSmall(base, objSize uintptr, elems uint16, gcw *gcWork, scans *gc.ObjMask) {
nptrs := 0
for i, bits := range scans {
if i*(goarch.PtrSize*8) > int(elems) {
break
}
n := sys.OnesCount64(uint64(bits))
---> hbitsBase, _ := spanHeapBitsRange(base, gc.PageSize, objSize)
---> hbits := (*byte)(unsafe.Pointer(hbitsBase))
for range n {
j := sys.TrailingZeros64(uint64(bits))
bits &^= 1 << j
b := base + uintptr(i*(goarch.PtrSize*8)+j)*objSize
ptrBits := heapBitsSmallForAddrInlineWithHB(hbits, base, b, objSize)
...
func heapBitsSmallForAddrInlineWithHB(hbits *byte, spanBase, addr, elemsize uintptr) uintptr {
// These objects are always small enough that their bitmaps
// fit in a single word, so just load the word or two we need.
//
// Mirrors mspan.writeHeapBitsSmall.
//
// We should be using heapBits(), but unfortunately it introduces
// both bounds checks panics and throw which causes us to exceed
// the nosplit limit in quite a few cases.
i := (addr - spanBase) / goarch.PtrSize / ptrBits
j := (addr - spanBase) / goarch.PtrSize % ptrBits
bits := elemsize / goarch.PtrSize
word0 := (*uintptr)(unsafe.Pointer(addb(hbits, goarch.PtrSize*(i+0))))
word1 := (*uintptr)(unsafe.Pointer(addb(hbits, goarch.PtrSize*(i+1))))
...
Ideally it would have been more effective if the compiler had moved this invariant code itself,
perhaps if heapBitsSmallForAddrInline was inlined by the compiler it would have been able to
hoist the invariant code outside?
According to inlining report:
this function is deemed to be expensive to be inlined by the compiler
./mgcmark_greenteagc.go:1091:6: cannot inline heapBitsSmallForAddrInline: function too complex: cost 166 exceeds budget 80
Moving it out of the outer loop entirely performed worse than the current version
presumably due to the early loop exit
A similar pattern occurs in heapBitsSmallForAddr as well which is called from typePointersOfUnchecked but its not clear at this point yet, whether
it will be as hot as scanObjectsSmall