kcz/support-for-video-in-benchmark #2975

krzyczar · 2025-11-06T10:34:37Z

Description

Fixes #(issue)

Checklist:

Tests have been updated or added to cover the new code.
This patch fully addresses the ticket.
I have made corresponding changes to the documentation.

tools/llm_bench/task/visual_language_generation.py

1. Enable video preprocessing for Qwen VL model. Add ov::Property<std::vector<ov::Tensor>> videos{"videos"}; 2. Support: mix images and videos input. 3. The main updates for Qwen-VL series models: -- For video: For 2-in-1 merging, if 9 images are input, only 5 images are actually processed. -- For image: For 2-in-1 merging, we only double each image, so if we input 9 images, we only actually process 9 images. -- Introduce "`If`" node, merge video and image preprocess into one OV subgroup. **tickets**: CVS-173219 --------- Signed-off-by: xipingya <xiping.yan@intel.com> Signed-off-by: xiping.yan <xiping.yan@intel.com> Co-authored-by: Wanglei Shen <wanglei.shen@intel.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Chen Peter <peter.chen@intel.com> Co-authored-by: Artur Paniukov <chgk1101@gmail.com> Co-authored-by: Roman Kazantsev <roman.kazantsev@intel.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull Request Overview

This PR adds support for video input in the LLM benchmark tool, enabling benchmarking of visual language models with video data alongside existing image support.

Key Changes:

Added video processing capability through a new make_video_tensor function that reads video files and converts them to frame tensors
Extended benchmark functions to accept and process video inputs in addition to images
Updated JSON parsing to handle video file paths in prompt configurations

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
tools/llm_bench/task/visual_language_generation.py	Added video input handling, parameter passing for frame control, and validation to prevent simultaneous media/video specification
tools/llm_bench/requirements.txt	Added opencv-python dependency for video processing
tools/llm_bench/llm_bench_utils/prompt_utils.py	Implemented video frame extraction and decimation logic with new `make_video_tensor` function
tools/llm_bench/llm_bench_utils/parse_json_data.py	Refactored JSON parsing with shared validation logic and added video key support
tools/llm_bench/benchmark.py	Added command-line argument for video frame control

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-10T07:32:08Z

tools/llm_bench/llm_bench_utils/prompt_utils.py

+
+@print_video_frames_number_and_convert_to_tensor
+def make_video_tensor(video_path, decym_frames=None):
+    supported_files = set([".mp4"])


[nitpick] The set literal syntax {'.mp4'} is more idiomatic than set(['.mp4']) for a single element set.

Suggested change

supported_files = set([".mp4"])

supported_files = {".mp4"}

Copilot · 2025-11-10T07:32:09Z

tools/llm_bench/llm_bench_utils/prompt_utils.py

+    supported_files = set([".mp4"])
+
+    assert os.path.exists(video_path), f"no input video file: {video_path}"
+    assert video_path.suffix.lower() in supported_files, "no supported video file"


Error messages should be more descriptive and grammatically correct. Consider: 'Input video file not found: {video_path}' and 'Unsupported video file format. Supported formats: .mp4'

Suggested change

assert video_path.suffix.lower() in supported_files, "no supported video file"

assert video_path.suffix.lower() in supported_files, (

f"Unsupported video file format for input: {video_path}. Supported formats: {', '.join(supported_files)}"

)

Copilot · 2025-11-10T07:32:09Z

tools/llm_bench/llm_bench_utils/prompt_utils.py

+        new_frame = np.zeros(shape, dtype)
+
+        width, height = pil_image.size
+        log.info(f"Video size: {width}x{height}")
+        for x in range(0, width):
+            for y in range(0, height):
+                new_frame[y, x] = frame_rgb[y, x]


Lines 55-65 create a new_frame array and manually copy pixels but then discard it by appending np.array(pil_image) instead. This entire block (lines 55-65) serves no purpose and should be removed, keeping only line 66.

Suggested change

new_frame = np.zeros(shape, dtype)

width, height = pil_image.size

log.info(f"Video size: {width}x{height}")

for x in range(0, width):

for y in range(0, height):

new_frame[y, x] = frame_rgb[y, x]

width, height = pil_image.size

log.info(f"Video size: {width}x{height}")

agree, new_frame is not used

tools/llm_bench/llm_bench_utils/prompt_utils.py

Copilot · 2025-11-10T07:32:10Z

tools/llm_bench/llm_bench_utils/parse_json_data.py

-            raise RuntimeError('== key word "prompt" does not exist ==')
+        prompt_data = create_base_prompt(json_data)
+        if ("media" in json_data) and ("video" in json_data):
+            raise ValueError("only one key is avaialble from media & video")


Corrected spelling of 'avaialble' to 'available'.

Suggested change

raise ValueError("only one key is avaialble from media & video")

raise ValueError("only one key is available from media & video")

Copilot · 2025-11-10T07:32:11Z

tools/llm_bench/task/visual_language_generation.py

-                    vlm_file['media'] = model_utils.resolve_media_file_path(vlm_file.get("media"), args['prompt_file'][0])
+                if args['prompt_file'] is not None and len(args['prompt_file']) > 0 and 'media' in vlm_file:
+                    if 'video' in vlm_file:
+                        raise ValueError('media and video cannot be specify in a single prompt file')


Corrected grammar: 'specify' should be 'specified'.

Suggested change

raise ValueError('media and video cannot be specify in a single prompt file')

raise ValueError('media and video cannot be specified in a single prompt file')

Copilot · 2025-11-10T07:32:11Z

tools/llm_bench/benchmark.py

    parser.add_argument("--vocoder_path", type=str, default=None,
                        help="Path to vocoder  for text to speech scenarios")
+    parser.add_argument("-vf", "--video_frames", type=int, default=None,
+                        help="controler of video frames to process")


Corrected spelling of 'controler' to 'controller'.

Suggested change

help="controler of video frames to process")

help="controller of video frames to process")

sbalandi · 2025-11-10T13:28:33Z

tools/llm_bench/benchmark.py

            iter_data_list, pretrain_time, iter_timestamp = CASE_TO_BENCH[model_args['use_case'].task](
-                model_path, framework, args.device, model_args, args.num_iters, memory_data_collector)
+                model_path, framework, args.device, model_args, args.num_iters,
+                memory_data_collector, args.video_frames)


move video_frames to model_args, like that https://github.com/openvinotoolkit/openvino.genai/blob/master/tools/llm_bench/llm_bench_utils/model_utils.py#L123

sbalandi · 2025-11-10T14:49:12Z

tools/llm_bench/benchmark.py

                        help="Path to .bin or .pt file with speaker embeddings for text to speech scenarios")
    parser.add_argument("--vocoder_path", type=str, default=None,
                        help="Path to vocoder  for text to speech scenarios")
+    parser.add_argument("-vf", "--video_frames", type=int, default=None,


we also need "--video" args parameter to have possibility to run llm bench with video by cmd

sbalandi · 2025-11-10T14:58:20Z

tools/llm_bench/llm_bench_utils/prompt_utils.py

+                new_frame[y, x] = frame_rgb[y, x]
+        output_frames.append(np.array(pil_image))
+
+    if decym_frames is None:


Suggested change

if decym_frames is None:

if decym_frames is None or int(decym_frames) == 0:

return output_frames

and I suggest to check decym_frames on the step of collecting and analizing input args, so that decym_frames can't be negative or 0

sbalandi · 2025-11-10T14:59:36Z

tools/llm_bench/task/visual_language_generation.py

 import llm_bench_utils.output_file
 import llm_bench_utils.gen_output_data as gen_output_data
 import llm_bench_utils.parse_json_data as parse_json_data
+import llm_bench_utils.prompt_utils as pu


Suggested change

import llm_bench_utils.prompt_utils as pu

import llm_bench_utils.prompt_utils as prompt_utils

sbalandi · 2025-11-10T15:02:09Z

tools/llm_bench/llm_bench_utils/prompt_utils.py

+    cap = cv2.VideoCapture(video_path)
+
+    output_frames = []
+    while True:


do we need to read all frames if decym_frames is set ?

sbalandi · 2025-11-10T15:05:01Z

tools/llm_bench/llm_bench_utils/prompt_utils.py

+        new_frame = np.zeros(shape, dtype)
+
+        width, height = pil_image.size
+        log.info(f"Video size: {width}x{height}")
+        for x in range(0, width):
+            for y in range(0, height):
+                new_frame[y, x] = frame_rgb[y, x]


agree, new_frame is not used

sbalandi · 2025-11-10T16:04:44Z

tools/llm_bench/llm_bench_utils/prompt_utils.py

+
+        shape = np.array(pil_image).shape
+        dtype = np.array(pil_image).dtype
+        log.info(f"Video shape: {shape}")


please, log once, otherwise if there are 1000 frames in the video, it will be printed 1000 times

sbalandi · 2025-11-10T16:21:57Z

tools/llm_bench/llm_bench_utils/prompt_utils.py

+    # or decimation factor if negative
+
+    decym_frames = int(decym_frames)
+    if decym_frames > 0:


controler of video frames to process is not very clear, I expected, that it's number of frames and we can just get output_frames[:decym_factor:]
Do we really need get that subsampling ? if yes, let's clarify that it's each n frames in help

sbalandi · 2025-11-10T16:49:27Z

tools/llm_bench/task/visual_language_generation.py

+    if videos:
+        kwargs["videos"] = videos
    prefix = '[warm-up]' if num == 0 else '[{}]'.format(num)
    log.info(f'{prefix}[P{prompt_index}] Input image nums:{len(images)}')


move it under if images and if videos, specify the specific type image/video

sbalandi · 2025-11-10T17:40:16Z

tools/llm_bench/task/visual_language_generation.py

+        if input_data.get("video", None):
+            entry = Path(input_data["video"])
+            video_tensor = pu.make_video_tensor(entry, required_frames)
+            videos.append(video_tensor)


np.array is not supported for GenAI

Suggested change

videos.append(video_tensor)

videos.append(ov.Tensor(video_tensor))

sbalandi · 2025-11-10T18:08:47Z

tools/llm_bench/task/visual_language_generation.py

+    if videos:
+        input_data["videos"] = videos


sbalandi · 2025-11-10T18:09:30Z

tools/llm_bench/task/visual_language_generation.py

        for bs_index, in_text in enumerate(prompts):
            llm_bench_utils.output_file.output_input_text(in_text, args, model_precision, prompt_index, bs_index, proc_id)
    tok_encode_start = time.perf_counter()
    input_data = model.preprocess_inputs(text=prompts[0], image=images[0] if images else None, **processor)


Suggested change

input_data = model.preprocess_inputs(text=prompts[0], image=images[0] if images else None, **processor)

input_data = model.preprocess_inputs(text=prompts[0], image=images[0] if images else None, video=videos[0] if videos else None, **processor)

…2985) Port openvinotoolkit#2514 1. Enable video preprocessing for Qwen VL model. Add ov::Property<std::vector<ov::Tensor>> videos{"videos"}; 2. Support: mix images and videos input. 3. The main updates for Qwen-VL series models: -- For video: For 2-in-1 merging, if 9 images are input, only 5 images are actually processed. -- For image: For 2-in-1 merging, we only double each image, so if we input 9 images, we only actually process 9 images. -- Introduce "`If`" node, merge video and image preprocess into one OV subgroup. **tickets**: CVS-173219 ---------  ## Description   CVS-###  Fixes #(issue) ## Checklist: - [ ] Tests have been updated or added to cover the new code.  - [ ] This patch fully addresses the ticket.  - [ ] I have made corresponding changes to the documentation.

…nvinotoolkit#2979) ## Description This PR updates the preprocessor condition for the deprecated `ALTERNATE` enum value in `KVCrushAnchorPointMode` to avoid conflicts with Windows headers and exclude it on all Windows platforms CVS-175618 ## Checklist: - [ ] Tests have been updated or added to cover the new code - N/A - [x] This patch fully addresses the ticket - [ ] I have made corresponding changes to the documentation - N/A

This reverts commit 96d778b.

…rs (openvinotoolkit#2997) ## Description This is port of openvinotoolkit#2979 to 2025.4 release branch. This PR updates the preprocessor condition for the deprecated `ALTERNATE` enum value in `KVCrushAnchorPointMode` to avoid conflicts with Windows headers and exclude it on all Windows platforms CVS-175618 ## Checklist: - [ ] Tests have been updated or added to cover the new code - N/A - [x] This patch fully addresses the ticket - [ ] I have made corresponding changes to the documentation - N/A

github-actions bot added the category: llm_bench Label for tool/llm_bench folder label Nov 6, 2025

krzyczar force-pushed the kcz/support-for-video-in-benchmark branch 6 times, most recently from 25aab2d to f179410 Compare November 6, 2025 14:37

peterchen-intel requested review from popovaan, sbalandi and xipingyan November 6, 2025 23:39

peterchen-intel assigned Wovchena Nov 6, 2025

xipingyan reviewed Nov 7, 2025

View reviewed changes

tools/llm_bench/task/visual_language_generation.py Outdated Show resolved Hide resolved

krzyczar force-pushed the kcz/support-for-video-in-benchmark branch 3 times, most recently from 6b04a70 to aa756f2 Compare November 7, 2025 16:04

xipingyan and others added 2 commits November 8, 2025 09:25

Correct the test ID

fb5e0da

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Wovchena requested a review from Copilot November 10, 2025 07:30

Copilot AI reviewed Nov 10, 2025

View reviewed changes

sbalandi reviewed Nov 10, 2025

View reviewed changes

sbalandi requested changes Nov 10, 2025

View reviewed changes

peterchen-intel and others added 5 commits November 11, 2025 00:23

Trigger CI

96d778b

Revert "Trigger CI"

de0241f

This reverts commit 96d778b.

krzyczar force-pushed the kcz/support-for-video-in-benchmark branch from a13f999 to 7b05528 Compare November 12, 2025 10:24

krzyczar force-pushed the kcz/support-for-video-in-benchmark branch 5 times, most recently from f72acc5 to ff2eca8 Compare November 12, 2025 15:27

After review

5dcca63

krzyczar force-pushed the kcz/support-for-video-in-benchmark branch from ff2eca8 to 5dcca63 Compare November 12, 2025 16:34

krzyczar changed the base branch from master to releases/2025/4 November 12, 2025 16:39

krzyczar changed the base branch from releases/2025/4 to releases/2024/5 November 12, 2025 16:41

krzyczar changed the base branch from releases/2024/5 to releases/2025/4 November 12, 2025 16:42

krzyczar marked this pull request as draft November 12, 2025 16:48

krzyczar changed the base branch from releases/2025/4 to master November 12, 2025 19:29

-    assert video_path.suffix.lower() in supported_files, "no supported video file"
+    assert video_path.suffix.lower() in supported_files, (
+        f"Unsupported video file format for input: {video_path}. Supported formats: {', '.join(supported_files)}"
+    )

	raise ValueError("only one key is avaialble from media & video")
	raise ValueError("only one key is available from media & video")

	raise ValueError('media and video cannot be specify in a single prompt file')
	raise ValueError('media and video cannot be specified in a single prompt file')

	help="controler of video frames to process")
	help="controller of video frames to process")

	if decym_frames is None:
	if decym_frames is None or int(decym_frames) == 0:
	return output_frames

	import llm_bench_utils.prompt_utils as pu
	import llm_bench_utils.prompt_utils as prompt_utils

	videos.append(video_tensor)
	videos.append(ov.Tensor(video_tensor))

	input_data = model.preprocess_inputs(text=prompts[0], image=images[0] if images else None, **processor)
	input_data = model.preprocess_inputs(text=prompts[0], image=images[0] if images else None, video=videos[0] if videos else None, **processor)

kcz/support-for-video-in-benchmark #2975

Are you sure you want to change the base?

kcz/support-for-video-in-benchmark #2975

Uh oh!

Conversation

krzyczar commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist:

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

krzyczar commented Nov 6, 2025 •

edited

Loading