Skip to content

Conversation

@dhartunian
Copy link
Collaborator

This is a POC and contains 3 pieces of functionality combined together:

  1. Adds a resourceattr package that exposes the ability to record cpu measurements tagged with a workloadID. These are periodically logged as a sorted table every 10 seconds (see screenshot example)
  2. Plumbs a workloadID through SQL into the KV BatchRequest to enable resource attribution
  3. Adds a ResourceAttr instance to Node -> Store -> Replica and uses the workloadID attached to the BatchRequest header to record CPU timings with attribution.

Result (left: resource-attributed cpu timings; right: workloadID to SQL statement mapping using sql activity):
Screenshot 2025-11-05 at 16 14 25 1


Notes to reviewers/readers:

  • The WorkloadID is generated on SQL side via ConstructStatementFingerprintID in pkg/sql/conn_executor_exec.go
  • The CPU recording is happening in MeasureReqCPUNanos that's called in pkg/kv/kvserver/replica_send.go
  • The ResourceAttr is initialized in NewNode in pkg/server/node.go which kicks off the periodic logging
  • Should I keep the workloadID plumbing and expand it to be complete, or should we explore a context-based propagation mechanism? The latter will be more challenging to make cheap and "always on".
  • I will have further work on the ResourceAttr struct to make it lock-free and enable richer workloadID generation by packing multiple categories/ids into the uint64.

@dhartunian dhartunian added the do-not-merge bors won't merge a PR with this label. label Nov 5, 2025
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@dhartunian dhartunian requested a review from tbg November 6, 2025 18:10
Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will have further work on the ResourceAttr struct to make it lock-free and enable richer workloadID generation by packing multiple categories/ids into the uint64.

As we were discussing at lunch, we may need to aggregate by different keys for the same operation. For example, I think we'll want CPU usage by tenant, user, resource group, and application. Really big design challenge for how to do this efficiently. Packing this all into a single 64-bit ID might be hard. On the other hand, users, resource groups, and tenants get created infrequently. You could imagine having some system table where we register workload IDs which map to the aggregation keys for that workload. So workload ID 1 could map to "user A, tenant B, resource group C". The measurement code just records workload ID 1 used cpu, but periodically (every few seconds) we expand this out to those aggregations keys. This approach could cause a combinatorial explosion of workload IDs. I haven't thought about this problem much. Do you have other thoughts or better ideas? This feels exactly like a high cardinality metrics problem.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, great to see you working on this! I didn't look at the plumbing and didn't scrutinize the collector, since you mentioned it's just a first thing that would need to improve. At this point, I'm mostly interested in Peter's question below, i.e. in how the data model will work and how you envision the numbers will be crunched and ultimately made available for operator consumption.

  • Should I keep the workloadID plumbing and expand it to be complete, or should we explore a context-based propagation mechanism? The latter will be more challenging to make cheap and "always on".

I'm skeptical of context-based plumbing. As you say it's often inefficient, though with the fastValuesCtx maybe that is less of an argument. Still, my preference would be to plumb the ID explicitly.

As we were discussing at lunch, we may need to aggregate by different keys for the same operation. For example, I think we'll want CPU usage by tenant, user, resource group, and application

This "lookup" problem already exists even in this prototype, but just isn't handled, right? We may have three nodes, and are maintaining some in-memory counters for a given stmtID fingerprint. If the gateway is n1, but most of the execution ends up being on n3 (maybe the lease is there), n3 won't be able to translate this fingerprint back to the SQL statement - only a node that's ever seen the original fingerprint (pre-hashing) can do that.

But if ALL input data gets hashed into the ID (so fingerprint, app, tenant, etc) at the respective gateway node and gateway nodes maintain the ability to (for some time at least) invert that for the fingerprints they created, then if each node periodically reports the information it does have to a replicated table - say

(reporting_node_id, stmt_fingerprint_or_null, ..., app_or_null, stmt_fingerprint_hash, cpu_nanos)

then we should "always" be able to patch things up after the fact.

For the example above (gateway always n1, execution always on n3) we might have two records in total: the one from n3 has only the hash and a significant CPU time measurement. The one from n1 has the hash but also all of the additional information (original fingerprint, app name, etc) and records very little CPU time. But the two can be combined in an additional aggregation step (by hash, coalescing the other "hash input fields") and then we have the record we want.

This feels exactly like a high cardinality metrics problem.

The "high cardinality" being something like num_active_tenants x num_active_users x num_active_apps x num_active_fingerprints, right? I'd think that "almost always" the stmt fingerprint already determines the other dimensions (it's rare for many tenants wanting to run identical queries), so I'd expect the sum of active fingerprints to dominate most of the time.
Is there anything that's fundamentally new here compared to today's SQL statistics (it's easy to run a workload that never re-uses fingerprints, for example)? I imagine we'll have some limit on each dimension and some "fair" way to drop data when there's more cardinality than we're willing to handle.

@tbg reviewed 43 of 43 files at r1, 12 of 12 files at r2, 6 of 6 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge bors won't merge a PR with this label.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants