Skip to content
This repository was archived by the owner on Dec 14, 2023. It is now read-only.

Commit 9c85d2d

Browse files
Merge pull request #26 from ExponentialML/version/version2
Text To Video Finetuning Version 2 ## Changes and Updates - [x] High quality VRAM config. - [x] Add text encoder training. - [x] Allow training on low vram systems. - [x] Allow single image training. - [x] Train with image captions. - [x] Train with video captions in folder. - [x] Gradient checkpointing support. - [x] Time agnostic training. - [x] Add aspect ratio bucketing. - [x] Verify installation. - [x] Add hybrid LoRA for training. - [x] Add latent caching. - [x] Add optimizer agnostic settings in config. - [x] Soup up unet finetuner for readability and efficiency. - [x] Update README to reflect training.
2 parents 25697f9 + 4b0be8a commit 9c85d2d

15 files changed

+1442
-599
lines changed

README.md

Lines changed: 64 additions & 142 deletions
Original file line numberDiff line numberDiff line change
@@ -1,199 +1,121 @@
11
# Text-To-Video-Finetuning
22
## Finetune ModelScope's Text To Video model using Diffusers 🧨
3-
***(This is a WIP)***
3+
4+
[output.webm](https://user-images.githubusercontent.com/59846140/230748413-fe91e90b-94b9-49ea-97ec-250469ee9472.webm)
45

56
### Updates
7+
- **2023-4-8**: Version 2 is released!
68
- **2023-3-29**: Added gradient checkpointing support.
79
- **2023-3-27**: Support for using Scaled Dot Product Attention for Torch 2.0 users.
810

911
## Getting Started
10-
### Requirements
1112

12-
#### Installation
13+
### Requirements & Installation
1314

14-
#### Repository Requirements
1515
```bash
1616
git clone https://github.com/ExponentialML/Text-To-Video-Finetuning.git
1717
cd Text-To-Video-Finetuning
1818
git lfs install
1919
git clone https://huggingface.co/damo-vilab/text-to-video-ms-1.7b ./models/model_scope_diffusers/
2020
```
2121

22-
#### Create Conda Environment
22+
### Create Conda Environment (Optional)
23+
It is recommended to install Anaconda.
24+
25+
**Windows Installation:** https://docs.anaconda.com/anaconda/install/windows/
26+
27+
**Linux Installation:** https://docs.anaconda.com/anaconda/install/linux/
28+
2329
```bash
2430
conda create -n text2video-finetune python=3.10
2531
conda activate text2video-finetune
2632
```
2733

28-
#### Python Requirements
34+
### Python Requirements
2935
```bash
3036
pip install -r requirements.txt
3137
```
3238

39+
## Hardware
40+
3341
All code was tested on Python 3.10.9 & Torch version 1.13.1 & 2.0.
3442

35-
You could potentially save memory by installing xformers and enabling it in your config. Please follow the instructions at the following repository for details on how to install.
43+
It is **highly recommended** to install >= Torch 2.0. This way, you don't have to install Xformers *or* worry about memory performance.
44+
45+
If you don't have Xformers enabled, you can follow the instructions here: https://github.com/facebookresearch/xformers
3646

37-
https://github.com/facebookresearch/xformers
3847

39-
## Hardware
4048
Recommended to use a RTX 3090, but you should be able to train on GPUs with <= 16GB ram with:
41-
- Validation turned off
49+
- Validation turned off.
4250
- Xformers or Torch 2.0 Scaled Dot-Product Attention
43-
- gradient checkpointing enabled.
51+
- Gradient checkpointing enabled.
4452
- Resolution of 256.
53+
- Enable all LoRA options.
4554

46-
## Usage
55+
## Preprocessing your data
4756

48-
### Preprocessing your data
49-
All videos were preprocessed using the script [here](https://github.com/ExponentialML/Video-BLIP2-Preprocessor) using automatic BLIP2 captions. Please follow the instructions there.
57+
### Using Captions
5058

51-
If you wish to use a custom dataloader (for instance, a folder of mp4's and captions), you're free to update the dataloader [here](https://github.com/ExponentialML/Text-To-Video-Finetuning/blob/d72e34cfbd91d2a62c07172f9ef079ca5cd651b2/utils/dataset.py#L83).
59+
You can use caption files when training on images or video. Simply place them into a folder like so:
5260

53-
Feel free to share your dataloaders for others to use! It would be much appreciated.
61+
**Images**: `/images/img.png /images/img.txt`
62+
**Videos**: `/videos/vid.mp4 | /videos/vid.txt`
5463

55-
### Finetuning using a training JSON
56-
```python
57-
python train.py --config ./configs/my_config.yaml
58-
```
64+
Then in your config, make sure to have `-folder` enabled, along with the root directory containing the files.
5965

60-
### Finetuning using a training JSON and HQ settings
61-
```python
62-
python train.py --config ./configs/my_config_hq.yaml
63-
```
66+
### Process Automatically
67+
68+
You can automatically caption the videos using the [Video-BLIP2-Preprocessor Script](https://github.com/ExponentialML/Video-BLIP2-Preprocessor)
69+
70+
## Configuration
6471

65-
### Finetuning on a single video
72+
The configuration uses a YAML config borrowed from [Tune-A-Video](https://github.com/showlab/Tune-A-Video) reposotories.
73+
74+
All configuration details are placed in `configs/v2/train_config.yaml`. Each parameter has a definition for what it does.
75+
76+
### How would you recommend I proceed with making a config with my data?
77+
78+
I highly recommend (I did this myself) going to `configs/v2/train_config.yaml`. Then make a copy of it and name it whatever you wish `my_train.yaml`.
79+
80+
Then, follow each line and configure it for your specific use case.
81+
82+
The instructions should be clear enough to get you up and running with your dataset, but feel free to ask any questions in the discussion board.
83+
84+
## Finetune.
6685
```python
67-
python train.py --config ./configs/single_video_config.yaml
86+
python train.py --config ./configs/v2/train_config.yaml
6887
```
6988
---
7089

71-
### Training Results
90+
## Training Results
91+
7292
With a lot of data, you can expect training results to show at roughly 2500 steps at a constant learning rate of 5e-6.
73-
Play around with learning rates to see what works best for you (5e-6, 3e-5, 1e-4).
7493

7594
When finetuning on a single video, you should see results in half as many steps.
7695

77-
After training, you should see your results in your output directory. By default, it should be placed at the script root under `./outputs/train_<date>`
96+
After training, you should see your results in your output directory.
7897

79-
## Configuration
80-
The configuration uses a YAML config borrowed from [Tune-A-Video](https://github.com/showlab/Tune-A-Video) reposotories. Here's the gist of how it works.
81-
82-
<details>
83-
84-
```yaml
85-
86-
# The path to your diffusers folder. The structure should look exactly like the huggingface one with folders and json configs
87-
pretrained_model_path: "diffusers_path"
88-
89-
# The directory where your training runs (and samples) will be saved.
90-
output_dir: "./outputs"
91-
92-
# Enable training the text encoder or not.
93-
train_text_encoder: False
94-
95-
# The basis of where your training data is store.
96-
train_data:
97-
98-
# The path to your JSON file using the steps above.
99-
json_path: "json/train.json"
100-
101-
# Leave this as true for now. Custom configurations are currently not supported.
102-
preprocessed: True
103-
104-
# Number of frames to sample from the videos. The higher this number, the more VRAM is required (usage is similar to batchsize)
105-
n_sample_frames: 4
106-
107-
# Choose whether or not to ignore the frame data from the preprocessing step, and shuffle them.
108-
shuffle_frames: False
109-
110-
# The height and width of training data.
111-
width: 256
112-
height: 256
113-
114-
# At what frame to start the video sampling. Ignores preprocessing frames.
115-
sample_start_idx: 0
116-
117-
# The rate of sampling frames. This effectively "skips" frames making it appear faster or slower.
118-
sample_frame_rate: 1
119-
120-
# The key of the video data name. This is to align with any preprocess script changes.
121-
vid_data_key: "video_path"
122-
123-
# The video path and prompt for that video for single video training.
124-
# If enabled, JSON path is ignored
125-
single_video_path: ""
126-
single_video_prompt: ""
127-
128-
# This is the data for validation during training. Prompt will override training data prompts.
129-
sample_preview: True
130-
prompt: ""
131-
num_frames: 16
132-
width: 256
133-
height: 256
134-
num_inference_steps: 50
135-
guidance_scale: 9
136-
137-
# Training parameters
138-
learning_rate: 5e-6
139-
adam_weight_decay: 0
140-
train_batch_size: 1
141-
max_train_steps: 50000
142-
143-
# Allow checkpointing during training (save once every X amount of steps)
144-
checkpointing_steps: 10000
145-
146-
# How many steps during training before we create a sample
147-
validation_steps: 100
148-
149-
# The parameters to unfreeze. As it is now, all attention layers are unfrozen.
150-
# Unfreezing resnet layers would lead to better quality, but consumes a very large amount of VRAM.
151-
trainable_modules:
152-
- "attn1"
153-
- "attn1"
154-
155-
# Seed for sampling validation
156-
seed: 64
157-
158-
# Use mixed precision for better memory allocation
159-
mixed_precision: "fp16"
160-
161-
# This seems to be incompatible at the moment in my testing.
162-
use_8bit_adam: False
163-
164-
# Currently has no effect.
165-
enable_xformers_memory_efficient_attention: True
166-
167-
```
168-
</details>
98+
By default, it should be placed at the script root under `./outputs/train_<date>`
16999

170-
## Trainable modules (Advanced Usage)
171-
The `trainable_modules` parameter are a set list by the user that tells the model which layers to unfreeze.
100+
From my testing, I recommend:
172101

173-
Typically you want to train the cross attention layers. The more layers you unfreeze, the higher the VRAM usage. Typically in my testing, here is what I see.
102+
- Keep the number of sample frames between 4-16. Use long frame generation for inference, *not* training.
103+
- If you have a low VRAM system, you can try single frame training or just use `n_sample_frames: 2`.
104+
- Using a learning rate of about `5e-6` seems to work well in all cases.
105+
- The best quality will always come from training the text encoder. If you're limited on VRAM, disabling it can help.
106+
- Leave some memory to avoid OOM when saving models during training.
174107

175-
`"attentions"`: Uses a lot of VRAM, but high probability for quality.
108+
## Developing
176109

177-
`"attn1", "attn2"`: Uses a good amount of VRAM, but allows for processing more frames. Good quality finetunes can happen with these settings.
110+
Please feel free to open a pull request if you have a feature implementation or suggesstion! I welcome all contributions.
178111

179-
`"attn1.to_out", "attn2.to_out"`: This only trains the linears on on the cross attention layers. This seems to be a good tradeoff for VRAM with great results with a learning rate of 1e-4.
112+
I've tried to make the code fairly modular so you can hack away, see how the code works, and what the implementations do.
180113

181-
## Running
182-
After training, you can easily run your model by doing the following.
114+
## Deprecation
115+
If you want to use the V1 repository, you can use the branch [here](https://github.com/ExponentialML/Text-To-Video-Finetuning/tree/version/first-release).
183116

184-
```python
185-
import torch
186-
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
187-
from diffusers.utils import export_to_video
117+
## Shoutouts
188118

189-
my_trained_model_path = "./trained_model_path/"
190-
pipe = DiffusionPipeline.from_pretrained(my_trained_model_path, torch_dtype=torch.float16, variant="fp16")
191-
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
192-
pipe.enable_model_cpu_offload()
193-
194-
prompt = "Your prompt based on train data"
195-
video_frames = pipe(prompt, num_inference_steps=25).frames
196-
197-
out_file = "./my_video.mp4"
198-
video_path = export_to_video(video_frames, out_file)
199-
```
119+
- [Showlab](https://github.com/showlab/Tune-A-Video) and bryandlee[https://github.com/bryandlee/Tune-A-Video] for their Tune-A-Video contribution that made this much easier.
120+
- [lucidrains](https://github.com/lucidrains) for their implementations around video diffusion.
121+
- [cloneofsimo](https://github.com/cloneofsimo) for their diffusers implementation of LoRA.

configs/my_config.yaml

Lines changed: 0 additions & 44 deletions
This file was deleted.

configs/my_config_hq.yaml

Lines changed: 0 additions & 42 deletions
This file was deleted.

0 commit comments

Comments
 (0)