Skip to content

Conversation

@vanbasten23
Copy link
Collaborator

@vanbasten23 vanbasten23 commented Nov 7, 2025

Purpose

This PR moves the lora utils at tests/lora/utils.py to vllm/lora/ so that all the lora tests in the plugins can utilize the lora util functions. Otherwise, the lora tests in the plugins have to copy and paste the lora util functions because the plugins cannot directly import it (from vllm.tests.lora.utils import DummyLoRAManager fails due to ModuleNotFoundError: No module named 'vllm.tests').

Test Plan

CI

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the LoRA test utilities by moving them from tests/lora/utils.py into vllm/lora/utils.py. This change makes these utilities accessible to plugins, which cannot import from the vllm.tests module. The changes primarily consist of moving the file content and updating the corresponding import statements in the test files. While this achieves the goal, I have a suggestion regarding code organization to improve long-term maintainability.

Comment on lines +301 to +705
# Below utils are used in the tests.


class DummyLoRAManager:
def __init__(self, device: torch.device = "cuda:0"):
super().__init__()
self._loras: dict[str, LoRALayerWeights] = {}
self._device = device

def set_module_lora(self, module_name: str, lora: LoRALayerWeights):
self._loras[module_name] = lora

def get_module_lora(self, module_name: str) -> LoRALayerWeights:
return self._loras[module_name]

def init_random_lora(
self,
module_name: str,
weight: torch.Tensor,
rank: int = 8,
generate_embeddings_tensor: int = 0,
):
lora = LoRALayerWeights(
module_name,
rank=rank,
lora_alpha=1,
lora_a=torch.rand(
[rank, weight.shape[1]], dtype=weight.dtype, device=self._device
),
lora_b=torch.rand(
[weight.shape[0], rank], dtype=weight.dtype, device=self._device
),
)
if generate_embeddings_tensor:
lora.embeddings_tensor = torch.rand(
5,
generate_embeddings_tensor,
dtype=weight.dtype,
device=self._device,
)
self.set_module_lora(module_name, lora)

return lora

def init_lora(
self,
module_name: str,
input_dim: int,
output_dim: int,
rank=8,
noop=False,
embeddings_tensor=None,
):
lora = LoRALayerWeights(
module_name,
rank=rank,
lora_alpha=1,
lora_a=torch.rand([rank, input_dim], device="cuda"),
lora_b=torch.rand([output_dim, input_dim], device="cuda"),
embeddings_tensor=embeddings_tensor,
)
self.set_module_lora(module_name, lora)
return lora

def reset_lora(self):
self._loras = {}

def init_packed_lora(
self,
module_name: str,
input_dim: int,
output_dims: list[int],
noop_lora_index: list[int] | None = None,
rank: int = 8,
):
base_loras: list[LoRALayerWeights] = []
noop_lora_index_set = set(noop_lora_index or [])

for i, out_dim in enumerate(output_dims):
base_lora = self.init_lora(
module_name + "_000_" + str(i),
input_dim,
out_dim,
rank=rank,
noop=i in noop_lora_index_set,
)
base_loras.append(base_lora)
packed_lora = PackedLoRALayerWeights.pack(base_loras)
self.set_module_lora(module_name, packed_lora)
return packed_lora


def assert_close(a, b):
rtol, atol = {
torch.float16: (6e-2, 6e-2),
torch.bfloat16: (6e-2, 6e-2),
torch.float32: (1e-2, 1e-2),
}[a.dtype]
torch.testing.assert_close(a, b, rtol=rtol, atol=atol)


@dataclass
class PunicaTensors:
inputs_tensor: torch.Tensor
lora_weights: torch.Tensor | list[torch.Tensor]
our_out_tensor: torch.Tensor
ref_out_tensor: torch.Tensor
b_seq_start_loc: torch.Tensor
prompt_lora_mapping: torch.Tensor
seq_len_tensor: torch.Tensor
token_lora_mapping: torch.Tensor

def meta(self) -> tuple[int, int]:
"""
Infer max_seq_length and token_nums from the tensors
and return them.
"""
max_seq_length = self.seq_len_tensor.max()
token_nums = self.seq_len_tensor.sum().item()
if isinstance(max_seq_length, tuple):
max_seq_length = max_seq_length[0].item()
else:
max_seq_length = max_seq_length.item()
return max_seq_length, token_nums


def generate_data(
batches,
hidden_size,
lora_nums,
max_rank,
seq_length,
dtype,
op_type,
device,
) -> PunicaTensors:
seq_len_tensor = torch.randint(seq_length, seq_length + 1, (batches,)).to(device)
b_seq_start_loc = torch.cumsum(
torch.tensor([0] + seq_len_tensor[:-1].tolist(), dtype=torch.long),
dim=0,
).to(device)
total_tokens = seq_len_tensor.sum()
if op_type == "shrink":
inputs_tensor = torch.rand((total_tokens, hidden_size), dtype=dtype).to(device)
lora_weights = torch.rand(
(lora_nums, max_rank, hidden_size), # col-major
dtype=dtype,
).to(device)
# shrink op need atomic_add, so output is initinized by 0
ref_out_tensor = torch.zeros(
(total_tokens, max_rank), dtype=dtype, device=inputs_tensor.device
)
# NOTE shrink kernel using torch.float32 as output type
our_out_tensor = torch.zeros((total_tokens, max_rank), dtype=torch.float32).to(
device
)
else:
inputs_tensor = torch.rand(
(total_tokens, max_rank),
dtype=dtype,
).to(device)
lora_weights = torch.rand(
(lora_nums, hidden_size, max_rank), # col-major
dtype=dtype,
).to(device)
# expand op needs to complete y+=a@lora_b, so output is
# initinized randomly
ref_out_tensor = torch.rand(
(total_tokens, hidden_size),
dtype=dtype,
).to(device)
# Ensure the same input.
our_out_tensor = ref_out_tensor.clone()
lora_indices_tensor = torch.randint(
0, lora_nums - 1 if lora_nums > 1 else 1, (batches,)
).to(device)
indices = torch.zeros((total_tokens), dtype=torch.long).to(device)
current_offset = 0
for b_id in range(batches):
lora_index = lora_indices_tensor[b_id]
indices[current_offset : current_offset + seq_len_tensor[b_id]].copy_(
lora_index
)
current_offset += seq_len_tensor[b_id].item()

return PunicaTensors(
inputs_tensor,
lora_weights,
our_out_tensor,
ref_out_tensor,
b_seq_start_loc,
lora_indices_tensor,
seq_len_tensor,
indices,
)


def generate_data_for_expand_nslices(
batches,
hidden_size,
lora_nums,
max_rank,
seq_length,
dtype,
nslices,
device,
) -> PunicaTensors:
seq_len_tensor = torch.randint(seq_length, seq_length + 1, (batches,)).to(device)
b_seq_start_loc = torch.cumsum(
torch.tensor([0] + seq_len_tensor[:-1].tolist(), dtype=torch.long),
dim=0,
).to(device)
total_tokens = seq_len_tensor.sum()
inputs_tensor = torch.rand(
(total_tokens, max_rank),
dtype=dtype,
).to(device)
lora_weights_lst = []
for _ in range(nslices):
lora_weights_lst.append(
torch.rand(
(lora_nums, hidden_size, max_rank), # col-major
dtype=dtype,
).to(device)
)
# expand op needs to complete y+=a@lora_b, so output is
# initinized randomly
ref_out_tensor = torch.rand((total_tokens, hidden_size * nslices), dtype=dtype).to(
device
)
# Ensure the same input.
our_out_tensor = ref_out_tensor.clone()
lora_indices_tensor = torch.randint(
0, lora_nums - 1 if lora_nums > 1 else 1, (batches,)
)
indices = torch.zeros((total_tokens), dtype=torch.long).to(device)
current_offset = 0
for b_id in range(batches):
lora_index = lora_indices_tensor[b_id]
indices[current_offset : current_offset + seq_len_tensor[b_id]] = (
lora_index.item()
)
current_offset += seq_len_tensor[b_id].item()

lora_indices_tensor = lora_indices_tensor.to(device)
return PunicaTensors(
inputs_tensor,
lora_weights_lst,
our_out_tensor,
ref_out_tensor,
b_seq_start_loc,
lora_indices_tensor,
seq_len_tensor,
indices,
)


def generate_data_for_nslices(
batches,
hidden_size,
lora_nums,
max_rank,
seq_length,
nslices,
dtype,
op_type,
device,
) -> PunicaTensors:
seq_len_tensor = torch.randint(seq_length, seq_length + 1, (batches,)).to(device)
b_seq_start_loc = torch.cumsum(
torch.tensor([0] + seq_len_tensor[:-1].tolist(), dtype=torch.long),
dim=0,
).to(device)
total_tokens = seq_len_tensor.sum()

lora_weights_lst = []
if op_type == "shrink":
inputs_tensor = torch.rand((total_tokens, hidden_size), dtype=dtype).to(device)

for _ in range(nslices):
if op_type == "shrink":
lora_weights_lst.append(
torch.rand(
(lora_nums, max_rank, hidden_size), # col-major
dtype=dtype,
).to(device)
)
# NOTE shrink kernel using torch.float32 as output type
# shrink op need atomic_add, so output is initinized by 0
our_out_tensor = torch.zeros(
(nslices, total_tokens, max_rank),
dtype=torch.float32,
).to(device)
else:
inputs_tensor = torch.rand(
(nslices, total_tokens, max_rank),
dtype=dtype,
).to(device)
for _ in range(nslices):
lora_weights_lst.append(
torch.rand(
(lora_nums, hidden_size, max_rank), # col-major
dtype=dtype,
).to(device)
)
# expand op needs to complete y+=a@lora_b, so output is
# initinized randomly
our_out_tensor = torch.rand(
(total_tokens, hidden_size * nslices), dtype=dtype
).to(device)

# Ensure the same input.
ref_out_tensor = our_out_tensor.clone()
lora_indices_tensor = torch.randint(
0, lora_nums - 1 if lora_nums > 1 else 1, (batches,)
)
indices = torch.zeros((total_tokens), dtype=torch.long).to(device)
current_offset = 0
for b_id in range(batches):
lora_index = lora_indices_tensor[b_id]
indices[current_offset : current_offset + seq_len_tensor[b_id]] = (
lora_index.item()
)
current_offset += seq_len_tensor[b_id].item()

lora_indices_tensor = lora_indices_tensor.to(device)
return PunicaTensors(
inputs_tensor,
lora_weights_lst,
our_out_tensor,
ref_out_tensor,
b_seq_start_loc,
lora_indices_tensor,
seq_len_tensor,
indices,
)


def create_peft_lora(
model: torch.nn.Module,
save_dir: str,
target_modules: list[str],
rank: int = 8,
alpha: int = 16,
dropout: float = 0.1,
lora_dtype: torch.dtype = torch.float16,
) -> dict[str, torch.Tensor]:
lora_weights = {}
adapter_config = {
"peft_type": "LORA",
"auto_mapping": None,
"base_model_name_or_path": "dummy_model",
"revision": None,
"task_type": "CAUSAL_LM",
"inference_mode": False,
"r": rank,
"lora_alpha": alpha,
"lora_dropout": dropout,
"fan_in_fan_out": False,
"bias": "none",
"modules_to_save": None,
"init_lora_weights": True,
"layers_to_transform": None,
"layers_pattern": None,
"target_modules": target_modules,
"exclude_modules": None,
"use_rslora": False,
"use_dora": False,
"loftq_config": None,
}

for module_name in target_modules:
module = model
for attr in module_name.split("."):
module = getattr(module, attr)

if hasattr(module, "input_size") and hasattr(module, "output_size"):
in_features = module.input_size
out_features = module.output_size

elif hasattr(module, "embedding_dim") and hasattr(module, "num_embeddings"):
# ParallelLMHead
in_features = module.embedding_dim
out_features = module.num_embeddings
else:
raise ValueError(f"Unable to determine dimensions for module {module_name}")

lora_A = torch.randn(rank, in_features, dtype=lora_dtype)

torch.nn.init.kaiming_uniform_(lora_A, a=5**0.5)

lora_B = torch.zeros(out_features, rank, dtype=lora_dtype)

# PEFT style
lora_weights[f"base_model.model.{module_name}.lora_A.weight"] = lora_A
lora_weights[f"base_model.model.{module_name}.lora_B.weight"] = lora_B

config_path = os.path.join(save_dir, "adapter_config.json")
with open(config_path, "w", encoding="utf-8") as f:
json.dump(adapter_config, f, indent=2, ensure_ascii=False)

weights_path = os.path.join(save_dir, "adapter_model.safetensors")
save_file(lora_weights, weights_path)

return lora_weights
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While moving these utilities makes them accessible to plugins, it introduces test-specific code into the main vllm library. This is generally not ideal for code organization and maintainability as it can pollute the library's public API with internal testing tools and potentially confuse users and future developers.

A better long-term solution would be to create a dedicated testing utility package, for example, vllm.testing, which can be explicitly imported by plugins for their tests. This would keep a clear separation between production code and test code.

For example, you could create a new file vllm/testing/lora_utils.py with the content of these utilities. Plugins could then import from it:

from vllm.testing.lora_utils import DummyLoRAManager

This would require adjusting the packaging configuration to include vllm.testing.

Given the goal of this PR, the current change is a functional workaround, but I recommend considering a cleaner architectural separation for better maintainability.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think the bot's comments are very reasonable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. When we expose a new library API, we should think about the API compatibility commitment we are making. It is not ok to expose an API and arbitrarily break users of that API, nor is it ok to be trapped in a world where we cannot evolve that API.

A separate vllm.testing API or library package would allow us to make different API compatibility guarantees for it. Maybe it's ok to make a breaking change to vllm.testing with only a very short transition period, just enough to avoid instantly breaking lots of third party CI testing

Copy link
Collaborator Author

@vanbasten23 vanbasten23 Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it makes sense to me. I think I have 2 options:

  • Option1: Put tests/ to the vllm wheel and map tests to vllm.tests so that doing from vllm.tests.lora.util import DummyLoRAManager can find the package: But I'm not sure if it is desirable to put all the tests to the vllm wheel.

  • Option2: Create a new file vllm/lora/testing/utils.py and put the DummyLoRAManager there. This will be less disruptive and also shows the developers that the utils there is for testing and shouldn't use it in production.

I feel option2 is preferable. But please also let me know what you think or if you have better ideas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants