[GuideLLM Refactor] Updates and Fixes for benchmark outputs, schemas, and stats calculations #442

markurtz · 2025-10-31T17:52:49Z

Summary

Details

[ ]

Test Plan

Related Issues

Resolves #

"I certify that all code in this PR is my own, except as noted below."

Use of AI

Includes AI-assisted code completion
Includes code generated by an AI application
Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes ## WRITTEN BY AI ##)

Copilot

Pull Request Overview

This PR performs a significant refactoring to reorganize the codebase's statistics, schemas, and benchmarking components. The main changes consolidate utilities into schemas, remove deprecated presentation modules, and restructure the test suite to align with the new organization.

Key changes:

Moved statistics utilities from utils/statistics.py to schemas/statistics.py and updated to use probability density functions (PDFs) instead of cumulative distribution functions (CDFs)
Relocated pydantic utilities from utils/pydantic_utils.py to schemas/base.py
Removed deprecated presentation modules (injector.py, data_models.py, builder.py)
Complete rewrite of statistics tests to use parametrized fixtures and broader distribution coverage
Added new console utilities for formatted table printing

Reviewed Changes

Copilot reviewed 59 out of 61 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
tests/unit/utils/test_statistics.py	Complete rewrite with parametrized fixtures testing multiple probability distributions
tests/unit/utils/test_pydantic_utils.py	Updated import path from utils to schemas
tests/unit/presentation/*	Removed deprecated presentation test files
tests/unit/mock_benchmark.py	Updated class name from BenchmarkSchedulerStats to BenchmarkSchedulerMetrics
src/guidellm/utils/statistics.py	Deleted - moved to schemas/statistics.py
src/guidellm/schemas/statistics.py	New file with refactored statistics using PDF-based approach
src/guidellm/schemas/base.py	New file consolidating pydantic utilities
src/guidellm/utils/console.py	Enhanced with table printing capabilities and improved documentation
src/guidellm/utils/functions.py	Added safe_format_number utility
Multiple schema files	Updated imports and restructured benchmark schemas

Comments suppressed due to low confidence (3)

src/guidellm/data/preprocessors/formatters.py:44

This class does not call RequestFormatter.init during initialization. (GenerativeTextCompletionsRequestFormatter.init may be missing a call to a base class init)

class GenerativeTextCompletionsRequestFormatter(RequestFormatter):

src/guidellm/data/preprocessors/formatters.py:118

This class does not call RequestFormatter.init during initialization. (GenerativeChatCompletionsRequestFormatter.init may be missing a call to a base class init)

class GenerativeChatCompletionsRequestFormatter(RequestFormatter):

src/guidellm/data/preprocessors/formatters.py:307

This class does not call RequestFormatter.init during initialization. (GenerativeAudioTranscriptionRequestFormatter.init may be missing a call to a base class init)

class GenerativeAudioTranscriptionRequestFormatter(RequestFormatter):

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/unit/schemas/test_statistics.py

src/guidellm/utils/console.py

src/guidellm/schemas/statistics.py

src/guidellm/schemas/request_stats.py

src/guidellm/data/loaders.py

src/guidellm/benchmark/schemas/base.py

src/guidellm/benchmark/outputs/output.py

src/guidellm/data/preprocessors/preprocessor.py

src/guidellm/benchmark/outputs/output.py

sjmonson · 2025-10-31T20:55:41Z

Ran this benchmark:

guidellm benchmark
  --target http://vllm-standalone-granite-3-2b.llmd.svc.cluster.local \
  --data "prompt_tokens=4096,prompt_tokens_stdev=512,prompt_tokens_min=2048,prompt_tokens_max=6144,output_tokens=512,output_tokens_stdev=128,output_tokens_min=1,output_tokens_max=1024" \
  --max-seconds 10 \
  --profile concurrent \
  --rate 10

And got the following error:

...
  File "/root/guidellm/src/guidellm/benchmark/benchmarker.py", line 161, in run
    benchmark = benchmark_class.compile(
        accumulator=accumulator,
        scheduler_state=scheduler_state,
    )
  File "/root/guidellm/src/guidellm/benchmark/schemas/generative/benchmark.py", line 134, in compile
    metrics=GenerativeMetrics.compile(accumulator),
            ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
  File "/root/guidellm/src/guidellm/benchmark/schemas/generative/metrics.py", line 797, in compile
    incomplete = accumulator.incomplete.get_within_range(start_time, end_time)
  File "/root/guidellm/src/guidellm/benchmark/schemas/generative/accumulator.py", line 623, in get_within_range
    if (stats.request_end_time >= start_time)
        ^^^^^^^^^^^^^^^^^^^^^^
  File "/root/guidellm/src/guidellm/schemas/request_stats.py", line 81, in request_end_time
    raise ValueError("resolve_end timings should be set but is None.")
ValueError: resolve_end timings should be set but is None.

Edit: Seems to be due to max-seconds constraint.

sjmonson

Aside from the max-seconds bug here are some high-level notes:

Needs signoff
Needs cleanup. To retroactively run pre-commit on only the changes: pre-commit run --from $(git merge-base main@{u} HEAD) --to HEAD
Only glanced over the accumulator and statistics code. Will do a more in-depth review time permitting but don't block on it.
Is warmup/cooldown only seconds now and not sometimes a percent? Setting --warmup .1 results in a table entry of Warm Sec: 0.1.

pyproject.toml

src/guidellm/__main__.py

src/guidellm/benchmark/entrypoints.py

src/guidellm/scheduler/worker_group.py

jaredoconnell

I can confirm that the max-seconds error is fixed. I'll do more testing tomorrow, but it works for me.
The new CLI table outputs are a little busy to look at, but work. Adding more horizontal padding may help, but isn't necessary.

src/guidellm/benchmark/outputs/csv.py

src/guidellm/scheduler/worker_group.py

jaredoconnell

This doesn't have anything obvious that I am going to block on. We just need to fix that test dependency.

sjmonson

LGTM Either land #440 first for test fixes or ignore.

…or the latest state of refactor Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>

Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>

…ons, metrics, and outputs Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>

Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>

Co-authored-by: Samuel Monson <smonson@redhat.com> Signed-off-by: Mark Kurtz <mark.j.kurtz@gmail.com> Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>

Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>

…d during testing and review Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>

Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>

Signed-off-by: Jared O'Connell <joconnel@redhat.com>

jaredoconnell

For the throughput+rampup case, is it expected that it goes from very few requests (like one or two per second), to full throughput, rather than scaling linearly?

No issues stood out during my most recent code review. We just need to clarify this and fix the tests.

markurtz requested review from Copilot and sjmonson October 31, 2025 17:52

markurtz self-assigned this Oct 31, 2025

Copilot AI reviewed Oct 31, 2025

View reviewed changes

sjmonson requested changes Oct 31, 2025

View reviewed changes

markurtz force-pushed the features/refactor/progress_refactor branch 2 times, most recently from 73e7539 to 6c7f133 Compare November 4, 2025 17:26

jaredoconnell reviewed Nov 5, 2025

View reviewed changes

src/guidellm/benchmark/outputs/csv.py Show resolved Hide resolved

src/guidellm/scheduler/worker_group.py Show resolved Hide resolved

jaredoconnell reviewed Nov 5, 2025

View reviewed changes

sjmonson previously approved these changes Nov 6, 2025

View reviewed changes

markurtz dismissed sjmonson’s stale review via ab6f2eb November 6, 2025 15:53

markurtz and others added 14 commits November 7, 2025 15:57

Updates for scenarios and benchmarking entrypoints to reenable them f…

73b85d4

…or the latest state of refactor Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>

Initial state for new progress output

eef8c14

Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>

Intermediate working state for refactoring progress and output

fc4b181

Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>

Updated and functional state with e2e working for new stats calculati…

a61fdc2

…ons, metrics, and outputs Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>

Fixes from reviews and types/style

c81d5f6

Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>

Fixes from reviews

073ebdf

Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>

update pylock

f66c49f

Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>

Update src/guidellm/benchmark/entrypoints.py

19e2b2f

Co-authored-by: Samuel Monson <smonson@redhat.com> Signed-off-by: Mark Kurtz <mark.j.kurtz@gmail.com> Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>

fix typing, remove dataset preprocessing until it's migrated

5bd4d33

Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>

Further fixes for precommit and formatting/typing

a40508e

Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>

Fix unit tests for refactor and schemas package

5b07e34

Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>

minor fix from review

f49ac41

Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>

Fixes for non streaming request pathways

9514423

Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>

Rework how warmup, cooldown, and rampup works due to issues identifie…

e6af4b2

…d during testing and review Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>

markurtz force-pushed the features/refactor/progress_refactor branch from ab6f2eb to e6af4b2 Compare November 7, 2025 20:59

markurtz and others added 2 commits November 7, 2025 16:48

update lock file

75f4d95

Signed-off-by: Mark Kurtz <mark.kurtz@neuralmagic.com>

Lock with linux platform set

a05d7de

Signed-off-by: Jared O'Connell <joconnel@redhat.com>

jaredoconnell reviewed Nov 8, 2025

View reviewed changes

[GuideLLM Refactor] Updates and Fixes for benchmark outputs, schemas, and stats calculations #442

Are you sure you want to change the base?

[GuideLLM Refactor] Updates and Fixes for benchmark outputs, schemas, and stats calculations #442

Uh oh!

Conversation

markurtz commented Oct 31, 2025

Summary

Details

Test Plan

Related Issues

Use of AI

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sjmonson commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjmonson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jaredoconnell left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jaredoconnell left a comment

Choose a reason for hiding this comment

Uh oh!

sjmonson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaredoconnell left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sjmonson commented Oct 31, 2025 •

edited

Loading

sjmonson left a comment •

edited

Loading

sjmonson left a comment •

edited

Loading