[POC/WIP] Node-local CPU attribution based on SQL Workload Identifiers #156972

dhartunian · 2025-11-05T22:36:26Z

This is a POC and contains 3 pieces of functionality combined together:

Adds a resourceattr package that exposes the ability to record cpu measurements tagged with a workloadID. These are periodically logged as a sorted table every 10 seconds (see screenshot example)
Plumbs a workloadID through SQL into the KV BatchRequest to enable resource attribution
Adds a ResourceAttr instance to Node -> Store -> Replica and uses the workloadID attached to the BatchRequest header to record CPU timings with attribution.

Result (left: resource-attributed cpu timings; right: workloadID to SQL statement mapping using sql activity):

Notes to reviewers/readers:

The WorkloadID is generated on SQL side via ConstructStatementFingerprintID in pkg/sql/conn_executor_exec.go
The CPU recording is happening in MeasureReqCPUNanos that's called in pkg/kv/kvserver/replica_send.go
The ResourceAttr is initialized in NewNode in pkg/server/node.go which kicks off the periodic logging
Should I keep the workloadID plumbing and expand it to be complete, or should we explore a context-based propagation mechanism? The latter will be more challenging to make cheap and "always on".
I will have further work on the ResourceAttr struct to make it lock-free and enable richer workloadID generation by packing multiple categories/ids into the uint64.

cockroach-teamcity · 2025-11-05T22:36:47Z

This change is

petermattis

I will have further work on the ResourceAttr struct to make it lock-free and enable richer workloadID generation by packing multiple categories/ids into the uint64.

As we were discussing at lunch, we may need to aggregate by different keys for the same operation. For example, I think we'll want CPU usage by tenant, user, resource group, and application. Really big design challenge for how to do this efficiently. Packing this all into a single 64-bit ID might be hard. On the other hand, users, resource groups, and tenants get created infrequently. You could imagine having some system table where we register workload IDs which map to the aggregation keys for that workload. So workload ID 1 could map to "user A, tenant B, resource group C". The measurement code just records workload ID 1 used cpu, but periodically (every few seconds) we expand this out to those aggregations keys. This approach could cause a combinatorial explosion of workload IDs. I haven't thought about this problem much. Do you have other thoughts or better ideas? This feels exactly like a high cardinality metrics problem.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

tbg

Nice, great to see you working on this! I didn't look at the plumbing and didn't scrutinize the collector, since you mentioned it's just a first thing that would need to improve. At this point, I'm mostly interested in Peter's question below, i.e. in how the data model will work and how you envision the numbers will be crunched and ultimately made available for operator consumption.

Should I keep the workloadID plumbing and expand it to be complete, or should we explore a context-based propagation mechanism? The latter will be more challenging to make cheap and "always on".

I'm skeptical of context-based plumbing. As you say it's often inefficient, though with the fastValuesCtx maybe that is less of an argument. Still, my preference would be to plumb the ID explicitly.

As we were discussing at lunch, we may need to aggregate by different keys for the same operation. For example, I think we'll want CPU usage by tenant, user, resource group, and application

This "lookup" problem already exists even in this prototype, but just isn't handled, right? We may have three nodes, and are maintaining some in-memory counters for a given stmtID fingerprint. If the gateway is n1, but most of the execution ends up being on n3 (maybe the lease is there), n3 won't be able to translate this fingerprint back to the SQL statement - only a node that's ever seen the original fingerprint (pre-hashing) can do that.

But if ALL input data gets hashed into the ID (so fingerprint, app, tenant, etc) at the respective gateway node and gateway nodes maintain the ability to (for some time at least) invert that for the fingerprints they created, then if each node periodically reports the information it does have to a replicated table - say

(reporting_node_id, stmt_fingerprint_or_null, ..., app_or_null, stmt_fingerprint_hash, cpu_nanos)

then we should "always" be able to patch things up after the fact.

For the example above (gateway always n1, execution always on n3) we might have two records in total: the one from n3 has only the hash and a significant CPU time measurement. The one from n1 has the hash but also all of the additional information (original fingerprint, app name, etc) and records very little CPU time. But the two can be combined in an additional aggregation step (by hash, coalescing the other "hash input fields") and then we have the record we want.

This feels exactly like a high cardinality metrics problem.

The "high cardinality" being something like num_active_tenants x num_active_users x num_active_apps x num_active_fingerprints, right? I'd think that "almost always" the stmt fingerprint already determines the other dimensions (it's rare for many tenants wanting to run identical queries), so I'd expect the sum of active fingerprints to dominate most of the time.
Is there anything that's fundamentally new here compared to today's SQL statistics (it's easy to run a workload that never re-uses fingerprints, for example)? I imagine we'll have some limit on each dimension and some "fair" way to drop data when there's more cardinality than we're willing to handle.

@tbg reviewed 43 of 43 files at r1, 12 of 12 files at r2, 6 of 6 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained

dhartunian added 4 commits November 5, 2025 16:20

sql: plumbing a workload ID into KV

461c9d2

kv: add resourceattr cpu recording to stores

1315ab6

resourceattr: nicer output and sorting

b00877c

*: minor cleanup

0137b86

dhartunian added the do-not-merge bors won't merge a PR with this label. label Nov 5, 2025

dhartunian requested a review from tbg November 6, 2025 18:10

petermattis reviewed Nov 6, 2025

View reviewed changes

tbg reviewed Nov 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[POC/WIP] Node-local CPU attribution based on SQL Workload Identifiers #156972

[POC/WIP] Node-local CPU attribution based on SQL Workload Identifiers #156972

Uh oh!

dhartunian commented Nov 5, 2025

Uh oh!

cockroach-teamcity commented Nov 5, 2025

Uh oh!

petermattis left a comment

Uh oh!

tbg left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[POC/WIP] Node-local CPU attribution based on SQL Workload Identifiers #156972

Are you sure you want to change the base?

[POC/WIP] Node-local CPU attribution based on SQL Workload Identifiers #156972

Uh oh!

Conversation

dhartunian commented Nov 5, 2025

This is a POC and contains 3 pieces of functionality combined together:

Notes to reviewers/readers:

Uh oh!

cockroach-teamcity commented Nov 5, 2025

Uh oh!

petermattis left a comment

Choose a reason for hiding this comment

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants