-
Notifications
You must be signed in to change notification settings - Fork 303
kcz/support-for-video-in-benchmark #2975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
kcz/support-for-video-in-benchmark #2975
Conversation
25aab2d to
f179410
Compare
6b04a70 to
aa756f2
Compare
1. Enable video preprocessing for Qwen VL model.
Add ov::Property<std::vector<ov::Tensor>> videos{"videos"};
2. Support: mix images and videos input.
3. The main updates for Qwen-VL series models:
-- For video: For 2-in-1 merging, if 9 images are input, only 5 images
are actually processed.
-- For image: For 2-in-1 merging, we only double each image, so if we
input 9 images, we only actually process 9 images.
-- Introduce "`If`" node, merge video and image preprocess into one OV
subgroup.
**tickets**: CVS-173219
---------
Signed-off-by: xipingya <xiping.yan@intel.com>
Signed-off-by: xiping.yan <xiping.yan@intel.com>
Co-authored-by: Wanglei Shen <wanglei.shen@intel.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Chen Peter <peter.chen@intel.com>
Co-authored-by: Artur Paniukov <chgk1101@gmail.com>
Co-authored-by: Roman Kazantsev <roman.kazantsev@intel.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for video input in the LLM benchmark tool, enabling benchmarking of visual language models with video data alongside existing image support.
Key Changes:
- Added video processing capability through a new
make_video_tensorfunction that reads video files and converts them to frame tensors - Extended benchmark functions to accept and process video inputs in addition to images
- Updated JSON parsing to handle video file paths in prompt configurations
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/llm_bench/task/visual_language_generation.py | Added video input handling, parameter passing for frame control, and validation to prevent simultaneous media/video specification |
| tools/llm_bench/requirements.txt | Added opencv-python dependency for video processing |
| tools/llm_bench/llm_bench_utils/prompt_utils.py | Implemented video frame extraction and decimation logic with new make_video_tensor function |
| tools/llm_bench/llm_bench_utils/parse_json_data.py | Refactored JSON parsing with shared validation logic and added video key support |
| tools/llm_bench/benchmark.py | Added command-line argument for video frame control |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| @print_video_frames_number_and_convert_to_tensor | ||
| def make_video_tensor(video_path, decym_frames=None): | ||
| supported_files = set([".mp4"]) |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The set literal syntax {'.mp4'} is more idiomatic than set(['.mp4']) for a single element set.
| supported_files = set([".mp4"]) | |
| supported_files = {".mp4"} |
| supported_files = set([".mp4"]) | ||
|
|
||
| assert os.path.exists(video_path), f"no input video file: {video_path}" | ||
| assert video_path.suffix.lower() in supported_files, "no supported video file" |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error messages should be more descriptive and grammatically correct. Consider: 'Input video file not found: {video_path}' and 'Unsupported video file format. Supported formats: .mp4'
| assert video_path.suffix.lower() in supported_files, "no supported video file" | |
| assert video_path.suffix.lower() in supported_files, ( | |
| f"Unsupported video file format for input: {video_path}. Supported formats: {', '.join(supported_files)}" | |
| ) |
| new_frame = np.zeros(shape, dtype) | ||
|
|
||
| width, height = pil_image.size | ||
| log.info(f"Video size: {width}x{height}") | ||
| for x in range(0, width): | ||
| for y in range(0, height): | ||
| new_frame[y, x] = frame_rgb[y, x] |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 55-65 create a new_frame array and manually copy pixels but then discard it by appending np.array(pil_image) instead. This entire block (lines 55-65) serves no purpose and should be removed, keeping only line 66.
| new_frame = np.zeros(shape, dtype) | |
| width, height = pil_image.size | |
| log.info(f"Video size: {width}x{height}") | |
| for x in range(0, width): | |
| for y in range(0, height): | |
| new_frame[y, x] = frame_rgb[y, x] | |
| width, height = pil_image.size | |
| log.info(f"Video size: {width}x{height}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, new_frame is not used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, done
| raise RuntimeError('== key word "prompt" does not exist ==') | ||
| prompt_data = create_base_prompt(json_data) | ||
| if ("media" in json_data) and ("video" in json_data): | ||
| raise ValueError("only one key is avaialble from media & video") |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'avaialble' to 'available'.
| raise ValueError("only one key is avaialble from media & video") | |
| raise ValueError("only one key is available from media & video") |
| vlm_file['media'] = model_utils.resolve_media_file_path(vlm_file.get("media"), args['prompt_file'][0]) | ||
| if args['prompt_file'] is not None and len(args['prompt_file']) > 0 and 'media' in vlm_file: | ||
| if 'video' in vlm_file: | ||
| raise ValueError('media and video cannot be specify in a single prompt file') |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected grammar: 'specify' should be 'specified'.
| raise ValueError('media and video cannot be specify in a single prompt file') | |
| raise ValueError('media and video cannot be specified in a single prompt file') |
| parser.add_argument("--vocoder_path", type=str, default=None, | ||
| help="Path to vocoder for text to speech scenarios") | ||
| parser.add_argument("-vf", "--video_frames", type=int, default=None, | ||
| help="controler of video frames to process") |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'controler' to 'controller'.
| help="controler of video frames to process") | |
| help="controller of video frames to process") |
| iter_data_list, pretrain_time, iter_timestamp = CASE_TO_BENCH[model_args['use_case'].task]( | ||
| model_path, framework, args.device, model_args, args.num_iters, memory_data_collector) | ||
| model_path, framework, args.device, model_args, args.num_iters, | ||
| memory_data_collector, args.video_frames) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move video_frames to model_args, like that https://github.com/openvinotoolkit/openvino.genai/blob/master/tools/llm_bench/llm_bench_utils/model_utils.py#L123
| help="Path to .bin or .pt file with speaker embeddings for text to speech scenarios") | ||
| parser.add_argument("--vocoder_path", type=str, default=None, | ||
| help="Path to vocoder for text to speech scenarios") | ||
| parser.add_argument("-vf", "--video_frames", type=int, default=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also need "--video" args parameter to have possibility to run llm bench with video by cmd
| new_frame[y, x] = frame_rgb[y, x] | ||
| output_frames.append(np.array(pil_image)) | ||
|
|
||
| if decym_frames is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if decym_frames is None: | |
| if decym_frames is None or int(decym_frames) == 0: | |
| return output_frames |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and I suggest to check decym_frames on the step of collecting and analizing input args, so that decym_frames can't be negative or 0
| import llm_bench_utils.output_file | ||
| import llm_bench_utils.gen_output_data as gen_output_data | ||
| import llm_bench_utils.parse_json_data as parse_json_data | ||
| import llm_bench_utils.prompt_utils as pu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| import llm_bench_utils.prompt_utils as pu | |
| import llm_bench_utils.prompt_utils as prompt_utils |
| cap = cv2.VideoCapture(video_path) | ||
|
|
||
| output_frames = [] | ||
| while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to read all frames if decym_frames is set ?
| new_frame = np.zeros(shape, dtype) | ||
|
|
||
| width, height = pil_image.size | ||
| log.info(f"Video size: {width}x{height}") | ||
| for x in range(0, width): | ||
| for y in range(0, height): | ||
| new_frame[y, x] = frame_rgb[y, x] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, new_frame is not used
|
|
||
| shape = np.array(pil_image).shape | ||
| dtype = np.array(pil_image).dtype | ||
| log.info(f"Video shape: {shape}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please, log once, otherwise if there are 1000 frames in the video, it will be printed 1000 times
| # or decimation factor if negative | ||
|
|
||
| decym_frames = int(decym_frames) | ||
| if decym_frames > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
controler of video frames to process is not very clear, I expected, that it's number of frames and we can just get output_frames[:decym_factor:]
Do we really need get that subsampling ? if yes, let's clarify that it's each n frames in help
| if videos: | ||
| kwargs["videos"] = videos | ||
| prefix = '[warm-up]' if num == 0 else '[{}]'.format(num) | ||
| log.info(f'{prefix}[P{prompt_index}] Input image nums:{len(images)}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move it under if images and if videos, specify the specific type image/video
| if input_data.get("video", None): | ||
| entry = Path(input_data["video"]) | ||
| video_tensor = pu.make_video_tensor(entry, required_frames) | ||
| videos.append(video_tensor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.array is not supported for GenAI
| videos.append(video_tensor) | |
| videos.append(ov.Tensor(video_tensor)) |
| if videos: | ||
| input_data["videos"] = videos |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove it
| for bs_index, in_text in enumerate(prompts): | ||
| llm_bench_utils.output_file.output_input_text(in_text, args, model_precision, prompt_index, bs_index, proc_id) | ||
| tok_encode_start = time.perf_counter() | ||
| input_data = model.preprocess_inputs(text=prompts[0], image=images[0] if images else None, **processor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| input_data = model.preprocess_inputs(text=prompts[0], image=images[0] if images else None, **processor) | |
| input_data = model.preprocess_inputs(text=prompts[0], image=images[0] if images else None, video=videos[0] if videos else None, **processor) |
…2985) Port openvinotoolkit#2514 1. Enable video preprocessing for Qwen VL model. Add ov::Property<std::vector<ov::Tensor>> videos{"videos"}; 2. Support: mix images and videos input. 3. The main updates for Qwen-VL series models: -- For video: For 2-in-1 merging, if 9 images are input, only 5 images are actually processed. -- For image: For 2-in-1 merging, we only double each image, so if we input 9 images, we only actually process 9 images. -- Introduce "`If`" node, merge video and image preprocess into one OV subgroup. **tickets**: CVS-173219 --------- <!-- Keep your pull requests (PRs) as atomic as possible. That increases the likelihood that an individual PR won't be stuck because of adjacent problems, merge conflicts, or code review. Your merged PR is going to appear in the automatically generated release notes on GitHub. So the clearer the title the better. --> ## Description <!-- Please include a summary of the change. Also include relevant motivation and context. --> <!-- Jira ticket number (e.g., 123). Delete if there's no ticket. --> CVS-### <!-- Remove if not applicable --> Fixes #(issue) ## Checklist: - [ ] Tests have been updated or added to cover the new code. <!-- If the change isn't maintenance related, update the tests at https://github.com/openvinotoolkit/openvino.genai/tree/master/tests or explain in the description why the tests don't need an update. --> - [ ] This patch fully addresses the ticket. <!--- If follow-up pull requests are needed, specify in description. --> - [ ] I have made corresponding changes to the documentation. <!-- Run github.com/\<username>/openvino.genai/actions/workflows/deploy_gh_pages.yml on your fork with your branch as a parameter to deploy a test version with the updated content. Replace this comment with the link to the built docs. -->
…nvinotoolkit#2979) ## Description This PR updates the preprocessor condition for the deprecated `ALTERNATE` enum value in `KVCrushAnchorPointMode` to avoid conflicts with Windows headers and exclude it on all Windows platforms CVS-175618 ## Checklist: - [ ] Tests have been updated or added to cover the new code - N/A - [x] This patch fully addresses the ticket - [ ] I have made corresponding changes to the documentation - N/A
This reverts commit 96d778b.
…rs (openvinotoolkit#2997) ## Description This is port of openvinotoolkit#2979 to 2025.4 release branch. This PR updates the preprocessor condition for the deprecated `ALTERNATE` enum value in `KVCrushAnchorPointMode` to avoid conflicts with Windows headers and exclude it on all Windows platforms CVS-175618 ## Checklist: - [ ] Tests have been updated or added to cover the new code - N/A - [x] This patch fully addresses the ticket - [ ] I have made corresponding changes to the documentation - N/A
a13f999 to
7b05528
Compare
f72acc5 to
ff2eca8
Compare
ff2eca8 to
5dcca63
Compare
Description
CVS-173846
Fixes #(issue)
Checklist: