-
Notifications
You must be signed in to change notification settings - Fork 4k
[POC/WIP] Node-local CPU attribution based on SQL Workload Identifiers #156972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[POC/WIP] Node-local CPU attribution based on SQL Workload Identifiers #156972
Conversation
petermattis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will have further work on the
ResourceAttrstruct to make it lock-free and enable richer workloadID generation by packing multiple categories/ids into the uint64.
As we were discussing at lunch, we may need to aggregate by different keys for the same operation. For example, I think we'll want CPU usage by tenant, user, resource group, and application. Really big design challenge for how to do this efficiently. Packing this all into a single 64-bit ID might be hard. On the other hand, users, resource groups, and tenants get created infrequently. You could imagine having some system table where we register workload IDs which map to the aggregation keys for that workload. So workload ID 1 could map to "user A, tenant B, resource group C". The measurement code just records workload ID 1 used cpu, but periodically (every few seconds) we expand this out to those aggregations keys. This approach could cause a combinatorial explosion of workload IDs. I haven't thought about this problem much. Do you have other thoughts or better ideas? This feels exactly like a high cardinality metrics problem.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @tbg)
tbg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, great to see you working on this! I didn't look at the plumbing and didn't scrutinize the collector, since you mentioned it's just a first thing that would need to improve. At this point, I'm mostly interested in Peter's question below, i.e. in how the data model will work and how you envision the numbers will be crunched and ultimately made available for operator consumption.
- Should I keep the
workloadIDplumbing and expand it to be complete, or should we explore a context-based propagation mechanism? The latter will be more challenging to make cheap and "always on".
I'm skeptical of context-based plumbing. As you say it's often inefficient, though with the fastValuesCtx maybe that is less of an argument. Still, my preference would be to plumb the ID explicitly.
As we were discussing at lunch, we may need to aggregate by different keys for the same operation. For example, I think we'll want CPU usage by tenant, user, resource group, and application
This "lookup" problem already exists even in this prototype, but just isn't handled, right? We may have three nodes, and are maintaining some in-memory counters for a given stmtID fingerprint. If the gateway is n1, but most of the execution ends up being on n3 (maybe the lease is there), n3 won't be able to translate this fingerprint back to the SQL statement - only a node that's ever seen the original fingerprint (pre-hashing) can do that.
But if ALL input data gets hashed into the ID (so fingerprint, app, tenant, etc) at the respective gateway node and gateway nodes maintain the ability to (for some time at least) invert that for the fingerprints they created, then if each node periodically reports the information it does have to a replicated table - say
(reporting_node_id, stmt_fingerprint_or_null, ..., app_or_null, stmt_fingerprint_hash, cpu_nanos)
then we should "always" be able to patch things up after the fact.
For the example above (gateway always n1, execution always on n3) we might have two records in total: the one from n3 has only the hash and a significant CPU time measurement. The one from n1 has the hash but also all of the additional information (original fingerprint, app name, etc) and records very little CPU time. But the two can be combined in an additional aggregation step (by hash, coalescing the other "hash input fields") and then we have the record we want.
This feels exactly like a high cardinality metrics problem.
The "high cardinality" being something like num_active_tenants x num_active_users x num_active_apps x num_active_fingerprints, right? I'd think that "almost always" the stmt fingerprint already determines the other dimensions (it's rare for many tenants wanting to run identical queries), so I'd expect the sum of active fingerprints to dominate most of the time.
Is there anything that's fundamentally new here compared to today's SQL statistics (it's easy to run a workload that never re-uses fingerprints, for example)? I imagine we'll have some limit on each dimension and some "fair" way to drop data when there's more cardinality than we're willing to handle.
@tbg reviewed 43 of 43 files at r1, 12 of 12 files at r2, 6 of 6 files at r3.
Reviewable status:complete! 0 of 0 LGTMs obtained
This is a POC and contains 3 pieces of functionality combined together:
resourceattrpackage that exposes the ability to record cpu measurements tagged with a workloadID. These are periodically logged as a sorted table every 10 seconds (see screenshot example)workloadIDthrough SQL into the KVBatchRequestto enable resource attributionResourceAttrinstance to Node -> Store -> Replica and uses theworkloadIDattached to theBatchRequestheader to record CPU timings with attribution.Result (left: resource-attributed cpu timings; right: workloadID to SQL statement mapping using sql activity):

Notes to reviewers/readers:
ConstructStatementFingerprintIDinpkg/sql/conn_executor_exec.goMeasureReqCPUNanosthat's called inpkg/kv/kvserver/replica_send.goResourceAttris initialized inNewNodeinpkg/server/node.gowhich kicks off the periodic loggingworkloadIDplumbing and expand it to be complete, or should we explore a context-based propagation mechanism? The latter will be more challenging to make cheap and "always on".ResourceAttrstruct to make it lock-free and enable richer workloadID generation by packing multiple categories/ids into the uint64.