-
Notifications
You must be signed in to change notification settings - Fork 190
Description
What happened:
For streaming responses, Prometheus metrics for token counts were recorded with an empty model_name label, while the target_model_name label was correctly populated. This corrupts observability data, making it impossible to filter metrics by the public-facing model name.
Additionally, the token counting logic itself was brittle. It only parsed the final message in a stream for the usage block, meaning if token counts appeared in an earlier message, they would be missed.
What you expected to happen:
Metrics for streaming responses should be recorded with all labels, including model_name and target_model_name, correctly populated from the request context. The token counting logic should also be robust and accumulate usage data from all messages in the stream.
How to reproduce it (as minimally and precisely as possible):
This was discovered when refactoring the hermetic integration tests.
- Send a streaming request (e.g., a chat completion request) where the model name is not present in the top-level JSON body.
- Ensure the request includes the
x-gateway-api-inference-objective-keyheader. - Observe the
inference_objective_input_tokens_bucketmetric after the request completes. - The metric will be present, but the
model_namelabel will be empty (model_name="").
Anything else we need to know?:
Root Cause Analysis:
Two related issues were discovered:
- The
director.HandleRequestfunction incorrectly overwrites theRequestContext.IncomingModelName(which is correctly set from the objective header) with a value parsed from the request body'smodelfield. For requests like chat completions, this field doesn't exist at the top level, causingIncomingModelNameto be reset to an empty string. This corrupted context persists for the life of the stream and is used when the final response metrics are recorded. - The token counting logic in
HandleResponseBodyModelStreamingonly checked the final[DONE]message for ausageblock. It did not accumulate token counts from earlier messages in the stream, making it possible to miss metrics entirely.
Production Risk / Impact:
The production risk is high. This bug leads to corrupted and unusable observability data for common streaming use cases. Metrics for streaming token usage will have a missing model_name label, making it impossible to accurately filter, aggregate, or alert on a per-model basis possibly breaking monitoring, billing, and capacity planning capabilities.
Environment:
- Discovered during hermetic integration test refactoring.