Skip to content

Commit ee385e5

Browse files
jingxu10ZhaoqiongZ
andauthored
bug fix in llm env activate scripts (#5396)
* bug fix in llm env activate scripts * remove llm training and revert back example training * update env activate script for bitsandbytes example * remove llama7b/13b in run_accuracy scripts * add README for bitsandbytes * specific the client gpu validated --------- Co-authored-by: Zheng, Zhaoqiong <zhaoqiong.zheng@intel.com>
1 parent b5f20d5 commit ee385e5

21 files changed

+92
-1118
lines changed

dependency_version.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,8 @@
1919
"commit": "v0.21.0"
2020
},
2121
"torch-ccl": {
22-
"version": "2.5.0+xpu",
23-
"commit": "v2.5.0+xpu"
22+
"version": "2.6.0+xpu",
23+
"commit": "v2.6.0+xpu"
2424
},
2525
"basekit": {
2626
"dpcpp-cpp-rt": {

examples/gpu/llm/README.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
Here you can find benchmarking scripts for large language models (LLM) text generation. These scripts:
44

5-
- Support Llama, GPT-J, Qwen, OPT, Bloom model families and some other models such as ChatGLMv3-6B, Baichuan2-13B and Phi3-mini.
5+
- Support Llama, GPT-J, Qwen, OPT, Bloom model families and some other models such as Baichuan2-13B and Phi3-mini.
66
- Include both single instance and distributed (DeepSpeed) use cases for FP16 optimization.
77
- Cover model generation inference with low precision cases for different models with best performance and accuracy (fp16 AMP and weight only quantization)
88

@@ -28,7 +28,7 @@ docker run -it --rm --privileged -v /dev/dri/by-path:/dev/dri/by-path ipex-llm:2
2828
cd llm
2929

3030
# Activate environment variables
31-
source ./tools/env_activate.sh [inference|fine-tuning]
31+
source ./tools/env_activate.sh [inference|fine-tuning|bitsandbytes]
3232
```
3333

3434
### Conda-based environment setup with prebuilt wheel files
@@ -54,7 +54,7 @@ cd examples/gpu/llm
5454
bash ./tools/env_setup.sh 0x07
5555
conda deactivate
5656
conda activate llm
57-
source ./tools/env_activate.sh [inference|fine-tuning]
57+
source ./tools/env_activate.sh [inference|fine-tuning|bitsandbytes]
5858
```
5959

6060
### Docker-based environment setup with compilation from source
@@ -77,7 +77,7 @@ docker run -it --rm --privileged -v /dev/dri/by-path:/dev/dri/by-path ipex-llm:2
7777
cd llm
7878

7979
# Activate environment variables
80-
source ./tools/env_activate.sh [inference|fine-tuning]
80+
source ./tools/env_activate.sh [inference|fine-tuning|bitsandbytes]
8181
```
8282

8383
### Conda-based environment setup with compilation from source
@@ -106,7 +106,7 @@ bash ./tools/env_setup.sh 3 <ONEAPI_ROOT_DIR> <AOT>
106106

107107
conda deactivate
108108
conda activate llm
109-
source ./tools/env_activate.sh [inference|fine-tuning]
109+
source ./tools/env_activate.sh [inference|fine-tuning|bitsandbytes]
110110
```
111111

112112
where <br />
@@ -122,3 +122,5 @@ Inference and fine-tuning are supported in individual directories.
122122
For inference example scripts, visit the [inference](./inference/) directory.
123123

124124
For fine-tuning example scripts, visit the [fine-tuning](./fine-tuning/) directory.
125+
126+
For fine-tuning with quantized model, visit the [bitsandbytes](./bitsandbytes/) directory.
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
# LLM Quantized Model Lora-Finetuning Overview
2+
3+
Here you can find the quantized model lora-finetuning scripts for Llama3.
4+
5+
6+
7+
## Supported Platforms
8+
9+
\* Intel® Data Center GPU Max Series (1550/1100) : support Llama3.1-8B.<br />
10+
\* Intel® Core™ Ultra Processors with Intel® Arc™ B Series Graphics : support Llama3.2-3B.<br />
11+
12+
## Run Models
13+
14+
**Note**: During the execution, you may need to log in your Hugging Face account to access model files. Refer to [HuggingFace Login](https://huggingface.co/docs/huggingface_hub/quick-start#login)
15+
16+
```
17+
huggingface-cli login --token <your_token_here>
18+
```
19+
20+
### Environment Set Up
21+
Set up environment by following [LLM Environment Set Up](../README.md).
22+
23+
24+
### Run Qlora finetuning with quantized model using Bash Script
25+
26+
The related code and run script are prepared in the folder. Run all with the one-click bash script `run_qlora_pvc.sh` or `run_qlora_client.sh`:
27+
28+
29+
If you are running on a Data Center Max Series GPU:
30+
31+
```
32+
bash run_qlora_pvc.sh
33+
```
34+
35+
If you are running on a Intel Client GPU:
36+
37+
```
38+
bash run_qlora_client.sh
39+
```
40+
41+
42+
### Run inference with quantized model
43+
44+
```
45+
# set quant_type and max_new_tokens according to your needs
46+
python bnb_inf_xpu.py --model_name ${model} --quant_type nf4 --max_new_tokens 64 --device xpu
47+
```
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
transformers==v4.49.0
2+
tf-keras
3+
accelerate==1.1.1
4+
peft==0.14.0

examples/gpu/llm/fine-tuning/Llama3/README.md

Lines changed: 0 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -18,34 +18,6 @@ huggingface-cli login --token <your_token_here>
1818
wandb login
1919
```
2020

21-
### Fine-tuning on single card
22-
23-
**Note**:
24-
Full-finetuning on single card will cause OOM.
25-
26-
Example: Llama 3 8B LoRA fine-tuning on single card. The default dataset `financial_phrasebank` is loaded in `llama3_ft.py`.
27-
28-
```bash
29-
export TORCH_LLM_ALLREDUCE=1
30-
31-
export model="meta-llama/Meta-Llama-3-8B"
32-
33-
python llama3_ft.py \
34-
--model_name_or_path ${model} \
35-
--use_flashattn True \
36-
--custom_mp True \
37-
--use_peft True \
38-
--max_seq_length 128 \
39-
--output_dir="output" \
40-
--evaluation_strategy="epoch" \
41-
--learning_rate=1e-3 \
42-
--auto_find_batch_size=True \
43-
--num_train_epochs=1 \
44-
--save_steps=500 \
45-
--logging_steps=1 \
46-
--save_total_limit=8
47-
```
48-
4921
### Fine-tuning on multi-GPU
5022

5123
**Note**:

examples/gpu/llm/fine-tuning/Phi3/README.md

Lines changed: 0 additions & 78 deletions
Original file line numberDiff line numberDiff line change
@@ -43,29 +43,6 @@ python phi3_ft.py \
4343

4444
#### Fine-tuning on single card
4545

46-
Example: Phi-3 Mini 4k full fine-tuning on single card. The default dataset `financial_phrasebank` is loaded in `phi3_ft.py`.
47-
48-
```bash
49-
export TORCH_LLM_ALLREDUCE=1
50-
51-
export model="microsoft/Phi-3-mini-4k-instruct"
52-
53-
python phi3_ft.py \
54-
--model_name_or_path ${model} \
55-
--use_flashattn False \
56-
--custom_mp True \
57-
--max_seq_length 128 \
58-
--output_dir="output" \
59-
--evaluation_strategy="epoch" \
60-
--learning_rate=1e-3 \
61-
--auto_find_batch_size=True \
62-
--num_train_epochs=1 \
63-
--save_steps=500 \
64-
--logging_steps=1 \
65-
--save_total_limit=8
66-
```
67-
68-
6946
Example: Phi-3 Mini 4k LoRA fine-tuning on single card. The default dataset `financial_phrasebank` is loaded in `phi3_ft.py`.
7047

7148
```bash
@@ -95,61 +72,6 @@ python phi3_ft.py \
9572
The default `fsdp_config.yml` is set with 1 machine with 4 cards 8 tiles, If you are using different setting, please change the `num_processes: 8` accordingly. For example, to use 8 cards 16 tiles, the line in `fsdp_config.yml` should be changed to `num_processes: 16`.
9673

9774

98-
Example: Phi-3 Mini 4k full fine-tuning.
99-
100-
101-
```bash
102-
export CCL_PROCESS_LAUNCHER=none
103-
export TORCH_LLM_ALLREDUCE=1
104-
105-
export model="microsoft/Phi-3-mini-4k-instruct"
106-
107-
accelerate launch --config_file "fsdp_config.yaml" phi3_ft.py \
108-
--model_name_or_path ${model} \
109-
--use_flashattn False \
110-
--bf16 True \
111-
--max_seq_length 128 \
112-
--output_dir="output" \
113-
--evaluation_strategy="epoch" \
114-
--learning_rate=1e-3 \
115-
--gradient_accumulation_steps=1 \
116-
--per_device_train_batch_size=8 \
117-
--per_device_eval_batch_size=8 \
118-
--num_train_epochs=1 \
119-
--save_steps=500 \
120-
--logging_steps=1 \
121-
--save_total_limit=8 2>&1 | tee phi3-mini_ft_fsdp_converge.log
122-
```
123-
124-
125-
Example: Phi-3 Mini 4k LoRA fine-tuning.
126-
127-
128-
```bash
129-
export CCL_PROCESS_LAUNCHER=none
130-
export TORCH_LLM_ALLREDUCE=1
131-
132-
export model="microsoft/Phi-3-mini-4k-instruct"
133-
134-
accelerate launch --config_file "fsdp_config.yaml" phi3_ft.py \
135-
--model_name_or_path ${model} \
136-
--use_flashattn False \
137-
--bf16 True \
138-
--use_peft True \
139-
--max_seq_length 128 \
140-
--output_dir="output" \
141-
--evaluation_strategy="epoch" \
142-
--learning_rate=1e-3 \
143-
--gradient_accumulation_steps=1 \
144-
--per_device_train_batch_size=8 \
145-
--per_device_eval_batch_size=8 \
146-
--num_train_epochs=1 \
147-
--save_steps=500 \
148-
--logging_steps=1 \
149-
--save_total_limit=8 2>&1 | tee phi3-mini_ft_fsdp_converge.log
150-
```
151-
152-
15375
Example: Phi3-Mini 4k LoRA fine-tuning.
15476

15577

examples/gpu/llm/fine-tuning/Qwen/run_qwen2_fsdp.sh

Lines changed: 0 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -55,31 +55,6 @@ Run_fsdp_dummy_dataset_lora_sequence_length_256() {
5555
#--optim "adamw_torch_fused"
5656
}
5757

58-
Run_fsdp_dummy_dataset_sequence_length_2048() {
59-
accelerate launch --config_file "fsdp_config.yaml" qwen2_ft.py \
60-
--model_name_or_path $model \
61-
--data_path $data \
62-
--bf16 True \
63-
--output_dir output_qwen \
64-
--num_train_epochs 1 \
65-
--per_device_train_batch_size 1 \
66-
--per_device_eval_batch_size 1 \
67-
--gradient_accumulation_steps 1 \
68-
--evaluation_strategy "no" \
69-
--save_strategy "steps" \
70-
--save_steps 2000 \
71-
--save_total_limit 10 \
72-
--learning_rate 3e-4 \
73-
--weight_decay 0.01 \
74-
--adam_beta2 0.95 \
75-
--warmup_ratio 0.01 \
76-
--lr_scheduler_type "cosine" \
77-
--logging_steps 1 \
78-
--report_to "none" \
79-
--model_max_length 2048
80-
#--optim "adamw_torch_fused"
81-
}
82-
8358
Run_fsdp_dummy_dataset_lora_sequence_length_2048() {
8459
accelerate launch --config_file "fsdp_config.yaml" qwen2_ft.py \
8560
--model_name_or_path $model \
@@ -108,5 +83,4 @@ Run_fsdp_dummy_dataset_lora_sequence_length_2048() {
10883

10984
Run_fsdp_dummy_dataset_sequence_length_256
11085
#Run_fsdp_dummy_dataset_lora_sequence_length_256
111-
#Run_fsdp_dummy_dataset_sequence_length_2048
11286
#Run_fsdp_dummy_dataset_lora_sequence_length_2048

examples/gpu/llm/fine-tuning/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ Here we mainly focus on the memory-constrained fine-tuning on single GPU, and pr
5454

5555
### Profile the finetuning
5656

57-
For profiling the process of finetuning, Apply the `patches/transformers.patch` to transformers v4.41.2 and set the following VARIABLE before finetuning.
57+
For profiling the process of finetuning, Apply the `patches/transformers.patch` to transformers v4.44.2 and set the following VARIABLE before finetuning.
5858

5959
```bash
6060
export PROFILE=1

examples/gpu/llm/inference/run_accuracy.sh

Lines changed: 0 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -12,29 +12,6 @@ Accuracy_lmeval_gpt-j-6b() {
1212
mv log_acc ${dir}
1313
}
1414

15-
16-
## Llama-7b
17-
Accuracy_lmeval_llama-7b() {
18-
model=decapoda-research/llama-7b-hf
19-
sub_model_name=llama-7b
20-
dir=accuracy/${model}/task${task}
21-
mkdir -p ${dir}
22-
LLM_ACC_TEST=1 python -u run_generation.py -m ${model} --sub-model-name ${sub_model_name} --ipex --dtype float16 --accuracy-only --acc-tasks ${task} 2>&1 | tee log_acc
23-
mv log_acc ${dir}
24-
}
25-
26-
27-
## Llama-13b
28-
Accuracy_lmeval_llama-13b() {
29-
model=decapoda-research/llama-13b-hf
30-
sub_model_name=llama-13b
31-
dir=accuracy/${model}/task${task}
32-
mkdir -p ${dir}
33-
LLM_ACC_TEST=1 python -u run_generation.py -m ${model} --sub-model-name ${sub_model_name} --ipex --dtype float16 --accuracy-only --acc-tasks ${task} 2>&1 | tee log_acc
34-
mv log_acc ${dir}
35-
}
36-
37-
3815
## Llama2-7b
3916
Accuracy_lmeval_llama2-7b() {
4017
model=meta-llama/Llama-2-7b-hf
@@ -84,8 +61,6 @@ main() {
8461
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
8562

8663
Accuracy_lmeval_gpt-j-6b
87-
Accuracy_lmeval_llama-7b
88-
Accuracy_lmeval_llama-13b
8964
Accuracy_lmeval_llama2-7b
9065
Accuracy_lmeval_llama2-13b
9166
Accuracy_lmeval_opt-6.7b

examples/gpu/llm/inference/run_accuracy_ds.sh

Lines changed: 0 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -14,28 +14,6 @@ Accuracy_lmeval_gpt-j-6b() {
1414
}
1515

1616

17-
## Llama-7b
18-
Accuracy_lmeval_llama-7b() {
19-
model=decapoda-research/llama-7b-hf
20-
sub_model_name=llama-7b
21-
dir=accuracy/${model}/task${task}_ranknum2
22-
mkdir -p ${dir}
23-
LLM_ACC_TEST=1 mpirun -np 2 --prepend-rank python -u run_generation_with_deepspeed.py -m ${model} --sub-model-name ${sub_model_name} --ipex --dtype float16 --accuracy-only --acc-tasks ${task} 2>&1 | tee log_acc_ds
24-
mv log_acc_ds ${dir}
25-
}
26-
27-
28-
## Llama-13b
29-
Accuracy_lmeval_llama-13b() {
30-
model=decapoda-research/llama-13b-hf
31-
sub_model_name=llama-13b
32-
dir=accuracy/${model}/task${task}_ranknum2
33-
mkdir -p ${dir}
34-
LLM_ACC_TEST=1 mpirun -np 2 --prepend-rank python -u run_generation_with_deepspeed.py -m ${model} --sub-model-name ${sub_model_name} --ipex --dtype float16 --accuracy-only --acc-tasks ${task} 2>&1 | tee log_acc_ds
35-
mv log_acc_ds ${dir}
36-
}
37-
38-
3917
## Llama2-7b
4018
Accuracy_lmeval_llama2-7b() {
4119
model=meta-llama/Llama-2-7b-hf
@@ -140,8 +118,6 @@ main() {
140118
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
141119

142120
Accuracy_lmeval_gpt-j-6b
143-
Accuracy_lmeval_llama-7b
144-
Accuracy_lmeval_llama-13b
145121
Accuracy_lmeval_llama2-7b
146122
Accuracy_lmeval_llama2-13b
147123
Accuracy_lmeval_llama2-34b

0 commit comments

Comments
 (0)