|
1 | 1 | # Text-To-Video-Finetuning |
2 | 2 | ## Finetune ModelScope's Text To Video model using Diffusers 🧨 |
3 | | -***(This is a WIP)*** |
| 3 | + |
| 4 | +[output.webm](https://user-images.githubusercontent.com/59846140/230748413-fe91e90b-94b9-49ea-97ec-250469ee9472.webm) |
4 | 5 |
|
5 | 6 | ### Updates |
| 7 | +- **2023-4-8**: Version 2 is released! |
6 | 8 | - **2023-3-29**: Added gradient checkpointing support. |
7 | 9 | - **2023-3-27**: Support for using Scaled Dot Product Attention for Torch 2.0 users. |
8 | 10 |
|
9 | 11 | ## Getting Started |
10 | | -### Requirements |
11 | 12 |
|
12 | | -#### Installation |
| 13 | +### Requirements & Installation |
13 | 14 |
|
14 | | -#### Repository Requirements |
15 | 15 | ```bash |
16 | 16 | git clone https://github.com/ExponentialML/Text-To-Video-Finetuning.git |
17 | 17 | cd Text-To-Video-Finetuning |
18 | 18 | git lfs install |
19 | 19 | git clone https://huggingface.co/damo-vilab/text-to-video-ms-1.7b ./models/model_scope_diffusers/ |
20 | 20 | ``` |
21 | 21 |
|
22 | | -#### Create Conda Environment |
| 22 | +### Create Conda Environment (Optional) |
| 23 | +It is recommended to install Anaconda. |
| 24 | + |
| 25 | +**Windows Installation:** https://docs.anaconda.com/anaconda/install/windows/ |
| 26 | + |
| 27 | +**Linux Installation:** https://docs.anaconda.com/anaconda/install/linux/ |
| 28 | + |
23 | 29 | ```bash |
24 | 30 | conda create -n text2video-finetune python=3.10 |
25 | 31 | conda activate text2video-finetune |
26 | 32 | ``` |
27 | 33 |
|
28 | | -#### Python Requirements |
| 34 | +### Python Requirements |
29 | 35 | ```bash |
30 | 36 | pip install -r requirements.txt |
31 | 37 | ``` |
32 | 38 |
|
| 39 | +## Hardware |
| 40 | + |
33 | 41 | All code was tested on Python 3.10.9 & Torch version 1.13.1 & 2.0. |
34 | 42 |
|
35 | | -You could potentially save memory by installing xformers and enabling it in your config. Please follow the instructions at the following repository for details on how to install. |
| 43 | +It is **highly recommended** to install >= Torch 2.0. This way, you don't have to install Xformers *or* worry about memory performance. |
| 44 | + |
| 45 | +If you don't have Xformers enabled, you can follow the instructions here: https://github.com/facebookresearch/xformers |
36 | 46 |
|
37 | | -https://github.com/facebookresearch/xformers |
38 | 47 |
|
39 | | -## Hardware |
40 | 48 | Recommended to use a RTX 3090, but you should be able to train on GPUs with <= 16GB ram with: |
41 | | -- Validation turned off |
| 49 | +- Validation turned off. |
42 | 50 | - Xformers or Torch 2.0 Scaled Dot-Product Attention |
43 | | -- gradient checkpointing enabled. |
| 51 | +- Gradient checkpointing enabled. |
44 | 52 | - Resolution of 256. |
| 53 | +- Enable all LoRA options. |
45 | 54 |
|
46 | | -## Usage |
| 55 | +## Preprocessing your data |
47 | 56 |
|
48 | | -### Preprocessing your data |
49 | | -All videos were preprocessed using the script [here](https://github.com/ExponentialML/Video-BLIP2-Preprocessor) using automatic BLIP2 captions. Please follow the instructions there. |
| 57 | +### Using Captions |
50 | 58 |
|
51 | | -If you wish to use a custom dataloader (for instance, a folder of mp4's and captions), you're free to update the dataloader [here](https://github.com/ExponentialML/Text-To-Video-Finetuning/blob/d72e34cfbd91d2a62c07172f9ef079ca5cd651b2/utils/dataset.py#L83). |
| 59 | +You can use caption files when training on images or video. Simply place them into a folder like so: |
52 | 60 |
|
53 | | -Feel free to share your dataloaders for others to use! It would be much appreciated. |
| 61 | +**Images**: `/images/img.png /images/img.txt` |
| 62 | +**Videos**: `/videos/vid.mp4 | /videos/vid.txt` |
54 | 63 |
|
55 | | -### Finetuning using a training JSON |
56 | | -```python |
57 | | -python train.py --config ./configs/my_config.yaml |
58 | | -``` |
| 64 | +Then in your config, make sure to have `-folder` enabled, along with the root directory containing the files. |
59 | 65 |
|
60 | | -### Finetuning using a training JSON and HQ settings |
61 | | -```python |
62 | | -python train.py --config ./configs/my_config_hq.yaml |
63 | | -``` |
| 66 | +### Process Automatically |
| 67 | + |
| 68 | +You can automatically caption the videos using the [Video-BLIP2-Preprocessor Script](https://github.com/ExponentialML/Video-BLIP2-Preprocessor) |
| 69 | + |
| 70 | +## Configuration |
64 | 71 |
|
65 | | -### Finetuning on a single video |
| 72 | +The configuration uses a YAML config borrowed from [Tune-A-Video](https://github.com/showlab/Tune-A-Video) reposotories. |
| 73 | + |
| 74 | +All configuration details are placed in `configs/v2/train_config.yaml`. Each parameter has a definition for what it does. |
| 75 | + |
| 76 | +### How would you recommend I proceed with making a config with my data? |
| 77 | + |
| 78 | +I highly recommend (I did this myself) going to `configs/v2/train_config.yaml`. Then make a copy of it and name it whatever you wish `my_train.yaml`. |
| 79 | + |
| 80 | +Then, follow each line and configure it for your specific use case. |
| 81 | + |
| 82 | +The instructions should be clear enough to get you up and running with your dataset, but feel free to ask any questions in the discussion board. |
| 83 | + |
| 84 | +## Finetune. |
66 | 85 | ```python |
67 | | -python train.py --config ./configs/single_video_config.yaml |
| 86 | +python train.py --config ./configs/v2/train_config.yaml |
68 | 87 | ``` |
69 | 88 | --- |
70 | 89 |
|
71 | | -### Training Results |
| 90 | +## Training Results |
| 91 | + |
72 | 92 | With a lot of data, you can expect training results to show at roughly 2500 steps at a constant learning rate of 5e-6. |
73 | | -Play around with learning rates to see what works best for you (5e-6, 3e-5, 1e-4). |
74 | 93 |
|
75 | 94 | When finetuning on a single video, you should see results in half as many steps. |
76 | 95 |
|
77 | | -After training, you should see your results in your output directory. By default, it should be placed at the script root under `./outputs/train_<date>` |
| 96 | +After training, you should see your results in your output directory. |
78 | 97 |
|
79 | | -## Configuration |
80 | | -The configuration uses a YAML config borrowed from [Tune-A-Video](https://github.com/showlab/Tune-A-Video) reposotories. Here's the gist of how it works. |
81 | | - |
82 | | -<details> |
83 | | - |
84 | | -```yaml |
85 | | - |
86 | | -# The path to your diffusers folder. The structure should look exactly like the huggingface one with folders and json configs |
87 | | -pretrained_model_path: "diffusers_path" |
88 | | - |
89 | | -# The directory where your training runs (and samples) will be saved. |
90 | | -output_dir: "./outputs" |
91 | | - |
92 | | -# Enable training the text encoder or not. |
93 | | -train_text_encoder: False |
94 | | - |
95 | | -# The basis of where your training data is store. |
96 | | -train_data: |
97 | | - |
98 | | - # The path to your JSON file using the steps above. |
99 | | - json_path: "json/train.json" |
100 | | - |
101 | | - # Leave this as true for now. Custom configurations are currently not supported. |
102 | | - preprocessed: True |
103 | | - |
104 | | - # Number of frames to sample from the videos. The higher this number, the more VRAM is required (usage is similar to batchsize) |
105 | | - n_sample_frames: 4 |
106 | | - |
107 | | - # Choose whether or not to ignore the frame data from the preprocessing step, and shuffle them. |
108 | | - shuffle_frames: False |
109 | | - |
110 | | - # The height and width of training data. |
111 | | - width: 256 |
112 | | - height: 256 |
113 | | - |
114 | | - # At what frame to start the video sampling. Ignores preprocessing frames. |
115 | | - sample_start_idx: 0 |
116 | | - |
117 | | - # The rate of sampling frames. This effectively "skips" frames making it appear faster or slower. |
118 | | - sample_frame_rate: 1 |
119 | | - |
120 | | - # The key of the video data name. This is to align with any preprocess script changes. |
121 | | - vid_data_key: "video_path" |
122 | | - |
123 | | - # The video path and prompt for that video for single video training. |
124 | | - # If enabled, JSON path is ignored |
125 | | - single_video_path: "" |
126 | | - single_video_prompt: "" |
127 | | - |
128 | | -# This is the data for validation during training. Prompt will override training data prompts. |
129 | | - sample_preview: True |
130 | | - prompt: "" |
131 | | - num_frames: 16 |
132 | | - width: 256 |
133 | | - height: 256 |
134 | | - num_inference_steps: 50 |
135 | | - guidance_scale: 9 |
136 | | - |
137 | | -# Training parameters |
138 | | -learning_rate: 5e-6 |
139 | | -adam_weight_decay: 0 |
140 | | -train_batch_size: 1 |
141 | | -max_train_steps: 50000 |
142 | | - |
143 | | -# Allow checkpointing during training (save once every X amount of steps) |
144 | | -checkpointing_steps: 10000 |
145 | | - |
146 | | -# How many steps during training before we create a sample |
147 | | -validation_steps: 100 |
148 | | - |
149 | | -# The parameters to unfreeze. As it is now, all attention layers are unfrozen. |
150 | | -# Unfreezing resnet layers would lead to better quality, but consumes a very large amount of VRAM. |
151 | | -trainable_modules: |
152 | | - - "attn1" |
153 | | - - "attn1" |
154 | | - |
155 | | -# Seed for sampling validation |
156 | | -seed: 64 |
157 | | - |
158 | | -# Use mixed precision for better memory allocation |
159 | | -mixed_precision: "fp16" |
160 | | - |
161 | | -# This seems to be incompatible at the moment in my testing. |
162 | | -use_8bit_adam: False |
163 | | - |
164 | | -# Currently has no effect. |
165 | | -enable_xformers_memory_efficient_attention: True |
166 | | - |
167 | | -``` |
168 | | - </details> |
| 98 | +By default, it should be placed at the script root under `./outputs/train_<date>` |
169 | 99 |
|
170 | | -## Trainable modules (Advanced Usage) |
171 | | -The `trainable_modules` parameter are a set list by the user that tells the model which layers to unfreeze. |
| 100 | +From my testing, I recommend: |
172 | 101 |
|
173 | | -Typically you want to train the cross attention layers. The more layers you unfreeze, the higher the VRAM usage. Typically in my testing, here is what I see. |
| 102 | +- Keep the number of sample frames between 4-16. Use long frame generation for inference, *not* training. |
| 103 | +- If you have a low VRAM system, you can try single frame training or just use `n_sample_frames: 2`. |
| 104 | +- Using a learning rate of about `5e-6` seems to work well in all cases. |
| 105 | +- The best quality will always come from training the text encoder. If you're limited on VRAM, disabling it can help. |
| 106 | +- Leave some memory to avoid OOM when saving models during training. |
174 | 107 |
|
175 | | -`"attentions"`: Uses a lot of VRAM, but high probability for quality. |
| 108 | +## Developing |
176 | 109 |
|
177 | | -`"attn1", "attn2"`: Uses a good amount of VRAM, but allows for processing more frames. Good quality finetunes can happen with these settings. |
| 110 | +Please feel free to open a pull request if you have a feature implementation or suggesstion! I welcome all contributions. |
178 | 111 |
|
179 | | -`"attn1.to_out", "attn2.to_out"`: This only trains the linears on on the cross attention layers. This seems to be a good tradeoff for VRAM with great results with a learning rate of 1e-4. |
| 112 | +I've tried to make the code fairly modular so you can hack away, see how the code works, and what the implementations do. |
180 | 113 |
|
181 | | -## Running |
182 | | -After training, you can easily run your model by doing the following. |
| 114 | +## Deprecation |
| 115 | +If you want to use the V1 repository, you can use the branch [here](https://github.com/ExponentialML/Text-To-Video-Finetuning/tree/version/first-release). |
183 | 116 |
|
184 | | -```python |
185 | | -import torch |
186 | | -from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler |
187 | | -from diffusers.utils import export_to_video |
| 117 | +## Shoutouts |
188 | 118 |
|
189 | | -my_trained_model_path = "./trained_model_path/" |
190 | | -pipe = DiffusionPipeline.from_pretrained(my_trained_model_path, torch_dtype=torch.float16, variant="fp16") |
191 | | -pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) |
192 | | -pipe.enable_model_cpu_offload() |
193 | | -
|
194 | | -prompt = "Your prompt based on train data" |
195 | | -video_frames = pipe(prompt, num_inference_steps=25).frames |
196 | | -
|
197 | | -out_file = "./my_video.mp4" |
198 | | -video_path = export_to_video(video_frames, out_file) |
199 | | -``` |
| 119 | +- [Showlab](https://github.com/showlab/Tune-A-Video) and bryandlee[https://github.com/bryandlee/Tune-A-Video] for their Tune-A-Video contribution that made this much easier. |
| 120 | +- [lucidrains](https://github.com/lucidrains) for their implementations around video diffusion. |
| 121 | +- [cloneofsimo](https://github.com/cloneofsimo) for their diffusers implementation of LoRA. |
0 commit comments