You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
2025-10-24T16:19:24.784314+0000 | run_oneshot_for_e2e_testing | INFO - ONESHOT KWARGS
14
-
2025-10-24T16:19:30.518930+0000 | reset | INFO - Compression lifecycle reset
15
-
2025-10-24T16:19:30.526042+0000 | _create_default_logger | INFO - Logging all LLM Compressor modifier-level logs to sparse_logs/24-10-2025_16.19.30.log
16
-
2025-10-24T16:19:30.526446+0000 | from_modifiers | INFO - Creating recipe from modifiers
17
-
2025-10-24T16:19:30.558764+0000 | initialize | INFO - Compression lifecycle initialized for 1 modifiers
18
-
2025-10-24T16:19:30.559026+0000 | IndependentPipeline | INFO - Inferred `DataFreePipeline` for `QuantizationModifier`
19
-
Updating global scales: 0%| | 0/356 [00:00<?, ?it/s]Updating global scales: 100%|██████████| 356/356 [00:00<00:00, 700362.21it/s]
20
-
Fusing global scales: 0it [00:00, ?it/s]Fusing global scales: 1333it [00:00, 606992.43it/s]
2025-10-24T16:21:26.199644+0000 | finalize | INFO - Compression lifecycle finalized for 1 modifiers
23
-
2025-10-24T16:21:40.851405+0000 | post_process | WARNING - Optimized model is not saved. To save, please provide`output_dir` as input arg.Ex. `oneshot(..., output_dir=...)`
24
-
2025-10-24T16:21:40.876059+0000 | test_vllm | INFO - ================= SAVING TO DISK ======================
25
-
2025-10-24T16:21:40.876633+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.
2025-10-28T04:30:26.615508+0000 | finalize | INFO - Compression lifecycle finalized for 1 modifiers
22
+
2025-10-28T04:30:41.297993+0000 | post_process | WARNING - Optimized model is not saved. To save, please provide`output_dir` as input arg.Ex. `oneshot(..., output_dir=...)`
23
+
2025-10-28T04:30:41.320280+0000 | test_vllm | INFO - ================= SAVING TO DISK ======================
24
+
2025-10-28T04:30:41.320754+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
39
38
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
40
39
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
41
40
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
42
41
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
43
42
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
44
-
[1;36m(EngineCore_DP0 pid=949834)[0;0m INFO 10-24 16:27:28 [parallel_state.py:1208] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
45
-
[1;36m(EngineCore_DP0 pid=949834)[0;0m INFO 10-24 16:27:28 [topk_topp_sampler.py:55] Using FlashInfer for top-p & top-k sampling.
46
-
[1;36m(EngineCore_DP0 pid=949834)[0;0m INFO 10-24 16:27:34 [gpu_model_runner.py:2602] Starting to load model Qwen3-VL-30B-A3B-Instruct-FP8_DYNAMIC...
47
-
[1;36m(EngineCore_DP0 pid=949834)[0;0m INFO 10-24 16:27:34 [gpu_model_runner.py:2634] Loading model from scratch...
48
-
[1;36m(EngineCore_DP0 pid=949834)[0;0m INFO 10-24 16:27:34 [cuda.py:366] Using Flash Attention backend on V1 engine.
49
-
[1;36m(EngineCore_DP0 pid=949834)[0;0m INFO 10-24 16:28:08 [default_loader.py:267] Loading weights took 33.06 seconds
50
-
[1;36m(EngineCore_DP0 pid=949834)[0;0m INFO 10-24 16:28:08 [gpu_model_runner.py:2653] Model loading took 30.0579 GiB and 33.710405 seconds
51
-
[1;36m(EngineCore_DP0 pid=949834)[0;0m INFO 10-24 16:28:08 [gpu_model_runner.py:3344] Encoder cache will be initialized with a budget of 153600 tokens, and profiled with 1 video items of the maximum feature size.
52
-
[1;36m(EngineCore_DP0 pid=949834)[0;0m INFO 10-24 16:28:28 [backends.py:548] Using cache directory: /home/HDCharles/.cache/vllm/torch_compile_cache/0dbf177978/rank_0_0/backbone for vLLM's torch.compile
53
-
[1;36m(EngineCore_DP0 pid=949834)[0;0m INFO 10-24 16:28:28 [backends.py:559] Dynamo bytecode transform time: 9.55 s
54
-
[1;36m(EngineCore_DP0 pid=949834)[0;0m INFO 10-24 16:28:31 [backends.py:197] Cache the graph for dynamic shape for later use
55
-
[1;36m(EngineCore_DP0 pid=949834)[0;0m INFO 10-24 16:29:35 [backends.py:218] Compiling a graph for dynamic shape takes 66.96 s
56
-
[1;36m(EngineCore_DP0 pid=949834)[0;0m INFO 10-24 16:29:37 [monitor.py:34] torch.compile takes 76.51 s in total
57
-
[1;36m(EngineCore_DP0 pid=949834)[0;0m INFO 10-24 16:29:39 [gpu_worker.py:298] Available KV cache memory: 35.26 GiB
[1;36m(EngineCore_DP0 pid=949834)[0;0m INFO 10-24 16:29:39 [kv_cache_utils.py:1091] Maximum concurrency for 262,144 tokens per request: 1.47x
60
-
[1;36m(EngineCore_DP0 pid=949834)[0;0m INFO 10-24 16:29:55 [gpu_model_runner.py:3480] Graph capturing finished in 16 secs, took 1.04 GiB
61
-
[1;36m(EngineCore_DP0 pid=949834)[0;0m INFO 10-24 16:29:55 [core.py:210] init engine (profile, create kv cache, warmup model) took 107.13 seconds
62
-
INFO 10-24 16:29:59 [llm.py:306] Supported_tasks: ['generate']
43
+
[1;36m(EngineCore_DP0 pid=3807648)[0;0m INFO 10-28 04:31:28 [parallel_state.py:1208] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
44
+
[1;36m(EngineCore_DP0 pid=3807648)[0;0m INFO 10-28 04:31:29 [topk_topp_sampler.py:55] Using FlashInfer for top-p & top-k sampling.
45
+
[1;36m(EngineCore_DP0 pid=3807648)[0;0m INFO 10-28 04:31:33 [gpu_model_runner.py:2602] Starting to load model Qwen3-VL-30B-A3B-Instruct-FP8_DYNAMIC...
46
+
[1;36m(EngineCore_DP0 pid=3807648)[0;0m INFO 10-28 04:31:34 [gpu_model_runner.py:2634] Loading model from scratch...
47
+
[1;36m(EngineCore_DP0 pid=3807648)[0;0m INFO 10-28 04:31:34 [cuda.py:366] Using Flash Attention backend on V1 engine.
48
+
[1;36m(EngineCore_DP0 pid=3807648)[0;0m INFO 10-28 04:32:06 [default_loader.py:267] Loading weights took 32.38 seconds
49
+
[1;36m(EngineCore_DP0 pid=3807648)[0;0m INFO 10-28 04:32:07 [gpu_model_runner.py:2653] Model loading took 30.0579 GiB and 32.616893 seconds
50
+
[1;36m(EngineCore_DP0 pid=3807648)[0;0m INFO 10-28 04:32:07 [gpu_model_runner.py:3344] Encoder cache will be initialized with a budget of 153600 tokens, and profiled with 1 video items of the maximum feature size.
51
+
[1;36m(EngineCore_DP0 pid=3807648)[0;0m INFO 10-28 04:32:26 [backends.py:548] Using cache directory: /home/HDCharles/.cache/vllm/torch_compile_cache/0dbf177978/rank_0_0/backbone for vLLM's torch.compile
52
+
[1;36m(EngineCore_DP0 pid=3807648)[0;0m INFO 10-28 04:32:26 [backends.py:559] Dynamo bytecode transform time: 9.62 s
53
+
[1;36m(EngineCore_DP0 pid=3807648)[0;0m INFO 10-28 04:32:29 [backends.py:164] Directly load the compiled graph(s) for dynamic shape from the cache, took 3.029 s
54
+
[1;36m(EngineCore_DP0 pid=3807648)[0;0m INFO 10-28 04:32:30 [monitor.py:34] torch.compile takes 9.62 s in total
55
+
[1;36m(EngineCore_DP0 pid=3807648)[0;0m INFO 10-28 04:32:31 [gpu_worker.py:298] Available KV cache memory: 35.25 GiB
0 commit comments