-
Notifications
You must be signed in to change notification settings - Fork 93
[GuideLLM Refactor] Updates and Fixes for benchmark outputs, schemas, and stats calculations #442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR performs a significant refactoring to reorganize the codebase's statistics, schemas, and benchmarking components. The main changes consolidate utilities into schemas, remove deprecated presentation modules, and restructure the test suite to align with the new organization.
Key changes:
- Moved statistics utilities from
utils/statistics.pytoschemas/statistics.pyand updated to use probability density functions (PDFs) instead of cumulative distribution functions (CDFs) - Relocated pydantic utilities from
utils/pydantic_utils.pytoschemas/base.py - Removed deprecated presentation modules (
injector.py,data_models.py,builder.py) - Complete rewrite of statistics tests to use parametrized fixtures and broader distribution coverage
- Added new console utilities for formatted table printing
Reviewed Changes
Copilot reviewed 59 out of 61 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/utils/test_statistics.py | Complete rewrite with parametrized fixtures testing multiple probability distributions |
| tests/unit/utils/test_pydantic_utils.py | Updated import path from utils to schemas |
| tests/unit/presentation/* | Removed deprecated presentation test files |
| tests/unit/mock_benchmark.py | Updated class name from BenchmarkSchedulerStats to BenchmarkSchedulerMetrics |
| src/guidellm/utils/statistics.py | Deleted - moved to schemas/statistics.py |
| src/guidellm/schemas/statistics.py | New file with refactored statistics using PDF-based approach |
| src/guidellm/schemas/base.py | New file consolidating pydantic utilities |
| src/guidellm/utils/console.py | Enhanced with table printing capabilities and improved documentation |
| src/guidellm/utils/functions.py | Added safe_format_number utility |
| Multiple schema files | Updated imports and restructured benchmark schemas |
Comments suppressed due to low confidence (3)
src/guidellm/data/preprocessors/formatters.py:44
- This class does not call RequestFormatter.init during initialization. (GenerativeTextCompletionsRequestFormatter.init may be missing a call to a base class init)
class GenerativeTextCompletionsRequestFormatter(RequestFormatter):
src/guidellm/data/preprocessors/formatters.py:118
- This class does not call RequestFormatter.init during initialization. (GenerativeChatCompletionsRequestFormatter.init may be missing a call to a base class init)
class GenerativeChatCompletionsRequestFormatter(RequestFormatter):
src/guidellm/data/preprocessors/formatters.py:307
- This class does not call RequestFormatter.init during initialization. (GenerativeAudioTranscriptionRequestFormatter.init may be missing a call to a base class init)
class GenerativeAudioTranscriptionRequestFormatter(RequestFormatter):
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Ran this benchmark: guidellm benchmark
--target http://vllm-standalone-granite-3-2b.llmd.svc.cluster.local \
--data "prompt_tokens=4096,prompt_tokens_stdev=512,prompt_tokens_min=2048,prompt_tokens_max=6144,output_tokens=512,output_tokens_stdev=128,output_tokens_min=1,output_tokens_max=1024" \
--max-seconds 10 \
--profile concurrent \
--rate 10And got the following error: ...
File "/root/guidellm/src/guidellm/benchmark/benchmarker.py", line 161, in run
benchmark = benchmark_class.compile(
accumulator=accumulator,
scheduler_state=scheduler_state,
)
File "/root/guidellm/src/guidellm/benchmark/schemas/generative/benchmark.py", line 134, in compile
metrics=GenerativeMetrics.compile(accumulator),
~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
File "/root/guidellm/src/guidellm/benchmark/schemas/generative/metrics.py", line 797, in compile
incomplete = accumulator.incomplete.get_within_range(start_time, end_time)
File "/root/guidellm/src/guidellm/benchmark/schemas/generative/accumulator.py", line 623, in get_within_range
if (stats.request_end_time >= start_time)
^^^^^^^^^^^^^^^^^^^^^^
File "/root/guidellm/src/guidellm/schemas/request_stats.py", line 81, in request_end_time
raise ValueError("resolve_end timings should be set but is None.")
ValueError: resolve_end timings should be set but is None.Edit: Seems to be due to max-seconds constraint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from the max-seconds bug here are some high-level notes:
- Needs signoff
- Needs cleanup. To retroactively run pre-commit on only the changes:
pre-commit run --from $(git merge-base main@{u} HEAD) --to HEAD - Only glanced over the accumulator and statistics code. Will do a more in-depth review time permitting but don't block on it.
- Is warmup/cooldown only seconds now and not sometimes a percent? Setting
--warmup .1results in a table entry ofWarm Sec: 0.1.
73e7539 to
6c7f133
Compare
jaredoconnell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm that the max-seconds error is fixed. I'll do more testing tomorrow, but it works for me.
The new CLI table outputs are a little busy to look at, but work. Adding more horizontal padding may help, but isn't necessary.
jaredoconnell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't have anything obvious that I am going to block on. We just need to fix that test dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM Either land #440 first for test fixes or ignore.
…or the latest state of refactor Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>
Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>
Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>
…ons, metrics, and outputs Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>
Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>
Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>
Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>
Co-authored-by: Samuel Monson <smonson@redhat.com> Signed-off-by: Mark Kurtz <mark.j.kurtz@gmail.com> Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>
Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>
Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>
Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>
Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>
Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>
…d during testing and review Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>
ab6f2eb to
e6af4b2
Compare
Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
jaredoconnell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the throughput+rampup case, is it expected that it goes from very few requests (like one or two per second), to full throughput, rather than scaling linearly?
No issues stood out during my most recent code review. We just need to clarify this and fix the tests.
Summary
Details
Test Plan
Related Issues
Use of AI
## WRITTEN BY AI ##)