Skip to content

Commit 166460f

Browse files
authored
Add support for Ministral 3 (#1474)
* Add support for Mistral3ForConditionalGeneration * Add both ministral and ministral3 model types * Bump jinja.js * Formatting * Update list of supported models * Update list of supported models
1 parent eb7dd02 commit 166460f

File tree

10 files changed

+124
-6
lines changed

10 files changed

+124
-6
lines changed

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -376,7 +376,10 @@ You can refine your search by selecting the task you're interested in (e.g., [te
376376
1. **[MusicGen](https://huggingface.co/docs/transformers/model_doc/musicgen)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://huggingface.co/papers/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
377377
1. **[MGP-STR](https://huggingface.co/docs/transformers/model_doc/mgp-str)** (from Alibaba Research) released with the paper [Multi-Granularity Prediction for Scene Text Recognition](https://huggingface.co/papers/2209.03592) by Peng Wang, Cheng Da, and Cong Yao.
378378
1. **[Mimi](https://huggingface.co/docs/transformers/model_doc/mimi)** (from Kyutai) released with the paper [Moshi: a speech-text foundation model for real-time dialogue](https://huggingface.co/papers/2410.00037) by Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour.
379+
1. **[Ministral](https://huggingface.co/docs/transformers/model_doc/ministral)** (from Mistral AI) by The [Mistral AI](https://mistral.ai) team.
380+
1. **[Ministral3](https://huggingface.co/docs/transformers/model_doc/ministral3)** (from Mistral AI) by The [Mistral AI](https://mistral.ai) team.
379381
1. **[Mistral](https://huggingface.co/docs/transformers/model_doc/mistral)** (from Mistral AI) by The [Mistral AI](https://mistral.ai) team: Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed.
382+
1. **[Mistral3](https://huggingface.co/docs/transformers/model_doc/mistral3)** (from Mistral AI) by The [Mistral AI](https://mistral.ai) team.
380383
1. **[MMS](https://huggingface.co/docs/transformers/model_doc/mms)** (from Facebook) released with the paper [Scaling Speech Technology to 1,000+ Languages](https://huggingface.co/papers/2305.13516) by Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli.
381384
1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://huggingface.co/papers/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou.
382385
1. **MobileCLIP** (from Apple) released with the paper [MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training](https://huggingface.co/papers/2311.17049) by Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel.

docs/snippets/6_supported-models.snippet

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,10 @@
9090
1. **[MusicGen](https://huggingface.co/docs/transformers/model_doc/musicgen)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://huggingface.co/papers/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
9191
1. **[MGP-STR](https://huggingface.co/docs/transformers/model_doc/mgp-str)** (from Alibaba Research) released with the paper [Multi-Granularity Prediction for Scene Text Recognition](https://huggingface.co/papers/2209.03592) by Peng Wang, Cheng Da, and Cong Yao.
9292
1. **[Mimi](https://huggingface.co/docs/transformers/model_doc/mimi)** (from Kyutai) released with the paper [Moshi: a speech-text foundation model for real-time dialogue](https://huggingface.co/papers/2410.00037) by Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour.
93+
1. **[Ministral](https://huggingface.co/docs/transformers/model_doc/ministral)** (from Mistral AI) by The [Mistral AI](https://mistral.ai) team.
94+
1. **[Ministral3](https://huggingface.co/docs/transformers/model_doc/ministral3)** (from Mistral AI) by The [Mistral AI](https://mistral.ai) team.
9395
1. **[Mistral](https://huggingface.co/docs/transformers/model_doc/mistral)** (from Mistral AI) by The [Mistral AI](https://mistral.ai) team: Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed.
96+
1. **[Mistral3](https://huggingface.co/docs/transformers/model_doc/mistral3)** (from Mistral AI) by The [Mistral AI](https://mistral.ai) team.
9497
1. **[MMS](https://huggingface.co/docs/transformers/model_doc/mms)** (from Facebook) released with the paper [Scaling Speech Technology to 1,000+ Languages](https://huggingface.co/papers/2305.13516) by Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli.
9598
1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://huggingface.co/papers/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou.
9699
1. **MobileCLIP** (from Apple) released with the paper [MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training](https://huggingface.co/papers/2311.17049) by Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel.

package-lock.json

Lines changed: 4 additions & 4 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@
5555
},
5656
"homepage": "https://github.com/huggingface/transformers.js#readme",
5757
"dependencies": {
58-
"@huggingface/jinja": "^0.5.1",
58+
"@huggingface/jinja": "^0.5.3",
5959
"onnxruntime-node": "1.21.0",
6060
"onnxruntime-web": "1.22.0-dev.20250409-89f8206ba4",
6161
"sharp": "^0.34.1"

src/configs.js

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,7 @@ function getNormalizedConfig(config) {
7575
case 'voxtral':
7676
case 'smolvlm':
7777
case 'gemma3n':
78+
case 'mistral3':
7879
// @ts-expect-error TS2339
7980
init_normalized_config = getNormalizedConfig(config.text_config);
8081
break;
@@ -145,6 +146,8 @@ function getNormalizedConfig(config) {
145146
case 'glm':
146147
case 'helium':
147148
case 'ernie4_5':
149+
case 'ministral':
150+
case 'ministral3':
148151
mapping['num_heads'] = 'num_key_value_heads';
149152
mapping['num_layers'] = 'num_hidden_layers';
150153
mapping['dim_kv'] = 'head_dim';

src/models.js

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3866,6 +3866,8 @@ export class LlavaQwen2ForCausalLM extends LlavaPreTrainedModel {
38663866
}
38673867
}
38683868

3869+
export class Mistral3ForConditionalGeneration extends LlavaQwen2ForCausalLM { }
3870+
38693871
export class Gemma3nPreTrainedModel extends PreTrainedModel {
38703872
forward_params = [
38713873
'input_ids',
@@ -6948,6 +6950,20 @@ export class MistralModel extends MistralPreTrainedModel { }
69486950
export class MistralForCausalLM extends MistralPreTrainedModel { }
69496951
//////////////////////////////////////////////////
69506952

6953+
//////////////////////////////////////////////////
6954+
// Ministral models
6955+
export class MinistralPreTrainedModel extends PreTrainedModel { }
6956+
export class MinistralModel extends MinistralPreTrainedModel { }
6957+
export class MinistralForCausalLM extends MinistralPreTrainedModel { }
6958+
//////////////////////////////////////////////////
6959+
6960+
//////////////////////////////////////////////////
6961+
// Ministral3 models
6962+
export class Ministral3PreTrainedModel extends PreTrainedModel { }
6963+
export class Ministral3Model extends Ministral3PreTrainedModel { }
6964+
export class Ministral3ForCausalLM extends Ministral3PreTrainedModel { }
6965+
//////////////////////////////////////////////////
6966+
69516967
//////////////////////////////////////////////////
69526968
// ERNIE-4.5 models
69536969
export class Ernie4_5PreTrainedModel extends PreTrainedModel { }
@@ -8041,6 +8057,8 @@ const MODEL_MAPPING_NAMES_DECODER_ONLY = new Map([
80418057
['mpt', ['MptModel', MptModel]],
80428058
['opt', ['OPTModel', OPTModel]],
80438059
['mistral', ['MistralModel', MistralModel]],
8060+
['ministral', ['MinistralModel', MinistralModel]],
8061+
['ministral3', ['Ministral3Model', Ministral3Model]],
80448062
['ernie4_5', ['Ernie4_5Model', Ernie4_5Model]],
80458063
['starcoder2', ['Starcoder2Model', Starcoder2Model]],
80468064
['falcon', ['FalconModel', FalconModel]],
@@ -8155,6 +8173,8 @@ const MODEL_FOR_CAUSAL_LM_MAPPING_NAMES = new Map([
81558173
['opt', ['OPTForCausalLM', OPTForCausalLM]],
81568174
['mbart', ['MBartForCausalLM', MBartForCausalLM]],
81578175
['mistral', ['MistralForCausalLM', MistralForCausalLM]],
8176+
['ministral', ['MinistralForCausalLM', MinistralForCausalLM]],
8177+
['ministral3', ['Ministral3ForCausalLM', Ministral3ForCausalLM]],
81588178
['ernie4_5', ['Ernie4_5ForCausalLM', Ernie4_5ForCausalLM]],
81598179
['starcoder2', ['Starcoder2ForCausalLM', Starcoder2ForCausalLM]],
81608180
['falcon', ['FalconForCausalLM', FalconForCausalLM]],
@@ -8228,6 +8248,7 @@ const MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES = new Map([
82288248
['paligemma', ['PaliGemmaForConditionalGeneration', PaliGemmaForConditionalGeneration]],
82298249
['llava_qwen2', ['LlavaQwen2ForCausalLM', LlavaQwen2ForCausalLM]],
82308250
['gemma3n', ['Gemma3nForConditionalGeneration', Gemma3nForConditionalGeneration]],
8251+
['mistral3', ['Mistral3ForConditionalGeneration', Mistral3ForConditionalGeneration]],
82318252
]);
82328253

82338254
const MODEL_FOR_AUDIO_TEXT_TO_TEXT_MAPPING_NAMES = new Map([

src/models/image_processors.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ export * from './nougat/image_processing_nougat.js'
2727
export * from './owlv2/image_processing_owlv2.js'
2828
export * from './owlvit/image_processing_owlvit.js'
2929
export * from './phi3_v/image_processing_phi3_v.js'
30+
export * from './pixtral/image_processing_pixtral.js'
3031
export * from './pvt/image_processing_pvt.js'
3132
export * from './qwen2_vl/image_processing_qwen2_vl.js'
3233
export * from './rt_detr/image_processing_rt_detr.js'
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
import {
2+
ImageProcessor,
3+
} from "../../base/image_processors_utils.js";
4+
5+
export class PixtralImageProcessor extends ImageProcessor {
6+
7+
/** @type {ImageProcessor['get_resize_output_image_size']} */
8+
get_resize_output_image_size(image, size) {
9+
const { longest_edge } = size;
10+
if (longest_edge === undefined) {
11+
throw new Error("size must contain 'longest_edge'");
12+
}
13+
14+
const [srcWidth, srcHeight] = image.size;
15+
16+
const ratio = Math.max(srcWidth, srcHeight) / longest_edge;
17+
18+
let newWidth = srcWidth;
19+
let newHeight = srcHeight;
20+
if (ratio > 1) {
21+
newWidth = Math.floor(srcWidth / ratio);
22+
newHeight = Math.floor(srcHeight / ratio);
23+
}
24+
25+
// @ts-expect-error TS2339
26+
const { patch_size, spatial_merge_size } = this.config;
27+
if (!spatial_merge_size) {
28+
throw new Error("config must contain 'spatial_merge_size'");
29+
}
30+
const real_patch_size = patch_size * spatial_merge_size;
31+
32+
// Calculate number of tokens
33+
const num_width_tokens = Math.floor((newWidth - 1) / real_patch_size) + 1;
34+
const num_height_tokens = Math.floor((newHeight - 1) / real_patch_size) + 1;
35+
36+
return [num_width_tokens * real_patch_size, num_height_tokens * real_patch_size];
37+
}
38+
}
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
2+
import { Processor } from "../../base/processing_utils.js";
3+
import { AutoImageProcessor } from "../auto/image_processing_auto.js";
4+
import { AutoTokenizer } from "../../tokenizers.js";
5+
6+
export class PixtralProcessor extends Processor {
7+
static tokenizer_class = AutoTokenizer
8+
static image_processor_class = AutoImageProcessor
9+
static uses_processor_config = true;
10+
11+
/**
12+
* @typedef {import('../../utils/image.js').RawImage} RawImage
13+
*/
14+
15+
// `images` is required, `text` is optional
16+
async _call(/** @type {RawImage|RawImage[]} */ images, text = null, kwargs = {}) {
17+
18+
const image_inputs = await this.image_processor(images, kwargs);
19+
20+
if (text) {
21+
const [height, width] = image_inputs.pixel_values.dims.slice(-2);
22+
23+
const { image_token, image_break_token, image_end_token, patch_size, spatial_merge_size } = this.config;
24+
const real_patch_size = patch_size * spatial_merge_size;
25+
const num_height_tokens = Math.floor(height / real_patch_size);
26+
const num_width_tokens = Math.floor(width / real_patch_size);
27+
28+
text = structuredClone(text); // Avoid modifying the original text input
29+
if (!Array.isArray(text)) {
30+
text = [text];
31+
}
32+
for (let i = 0; i < text.length; ++i) {
33+
const width_tokens = image_token.repeat(num_width_tokens);
34+
const row = width_tokens + image_break_token;
35+
const finalRow = width_tokens + image_end_token;
36+
const full = row.repeat(num_height_tokens - 1) + finalRow;
37+
text[i] = text[i].replace(image_token, full);
38+
}
39+
}
40+
41+
const text_inputs = text ? this.tokenizer(text, kwargs) : {};
42+
43+
return {
44+
...image_inputs,
45+
...text_inputs,
46+
}
47+
}
48+
}

src/models/processors.js

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,9 @@ export * from './llava/processing_llava.js';
88
export * from './mgp_str/processing_mgp_str.js';
99
export * from './moonshine/processing_moonshine.js';
1010
export * from './owlvit/processing_owlvit.js';
11-
export * from './phi3_v/processing_phi3_v.js';
1211
export * from './paligemma/processing_paligemma.js';
12+
export * from './phi3_v/processing_phi3_v.js';
13+
export * from './pixtral/processing_pixtral.js';
1314
export * from './pyannote/processing_pyannote.js';
1415
export * from './qwen2_vl/processing_qwen2_vl.js';
1516
export * from './sam/processing_sam.js';

0 commit comments

Comments
 (0)