|
| 1 | +--- |
| 2 | +slug: mom-family |
| 3 | +title: "MoM: Specialized Models for Intelligent Routing" |
| 4 | +authors: [Xunzhuo] |
| 5 | +tags: [mom, models, routing, announcement] |
| 6 | +--- |
| 7 | + |
| 8 | + |
| 9 | + |
| 10 | +**One fabric. Many minds.** We're introducing **MoM** (Mixture of Models)—a family of specialized routing models that power vLLM-SR's intelligent decision-making. |
| 11 | + |
| 12 | +<!-- truncate --> |
| 13 | + |
| 14 | +## Why MoM? |
| 15 | + |
| 16 | +vLLM-SR solves a critical problem: **how to route LLM requests to the right model at the right time**. Not every query needs the same resources—"What's the weather?" shouldn't cost as much as "Analyze this legal contract." |
| 17 | + |
| 18 | +## MoM System Card |
| 19 | + |
| 20 | +A quick overview of all MoM models: |
| 21 | + |
| 22 | +| Category | Model | Size | Base Model | Latency | Purpose | |
| 23 | +|----------|-------|------|------------|---------|---------| |
| 24 | +| **🧠 Intelligent Routing** | mom-brain-flash | Flash | ModernBERT | <10ms | Ultra-fast intent classification | |
| 25 | +| | mom-brain-pro | Pro | Qwen 0.6B | ~30-50ms | Balanced routing with reasoning | |
| 26 | +| | mom-brain-max | Max | Qwen 1.7B | ~50-100ms | Maximum accuracy for complex decisions | |
| 27 | +| **🔍 Similarity Search** | mom-similarity-flash | Flash | ModernBERT | <10ms | Semantic similarity matching | |
| 28 | +| **🔒 Prompt Guardian** | mom-jailbreak-flash | Flash | ModernBERT | <10ms | Jailbreak/attack detection | |
| 29 | +| | mom-pii-flash | Flash | ModernBERT | <10ms | PII detection & privacy protection | |
| 30 | +| **🎯 SLM Experts** | mom-expert-math-flash | Flash | Qwen 0.6B | ~30-50ms | Mathematics routing | |
| 31 | +| | mom-expert-math-pro | Pro | Qwen 1.7B | ~50-100ms | Advanced math with reasoning | |
| 32 | + |
| 33 | +**Key Insights:** |
| 34 | + |
| 35 | +- **4 Categories** × **3 Size Variants** = Flexible routing architecture |
| 36 | +- **ModernBERT** (encoder-only) → Sub-10ms latency for high-throughput scenarios |
| 37 | +- **Qwen** (decoder-only) → Explainable decisions with reasoning capabilities |
| 38 | +- **Flash** models achieve 10,000+ QPS on commodity hardware |
| 39 | + |
| 40 | +## The Evolution: From Encoder-Only to Mixture-of-Models |
| 41 | + |
| 42 | +### Where We Started: ModernBERT Foundation |
| 43 | + |
| 44 | +vLLM-SR initially built its routing intelligence entirely on **ModernBERT** (encoder-only models): |
| 45 | + |
| 46 | +**Advantages**: |
| 47 | + |
| 48 | +- ⚡ **Blazing fast**: Sub-10ms inference latency |
| 49 | +- 📊 **High throughput**: 10,000+ QPS on commodity hardware |
| 50 | +- 💰 **Cost-effective**: Minimal compute requirements |
| 51 | +- 🎯 **Proven accuracy**: Strong performance on classification tasks |
| 52 | + |
| 53 | +**Limitations**: |
| 54 | + |
| 55 | +- ❌ **Black box decisions**: No explanation for routing choices |
| 56 | +- ❌ **Limited reasoning**: Cannot handle complex, multi-step logic |
| 57 | +- ❌ **Fixed capabilities**: Hard to extend with new behaviors |
| 58 | +- ❌ **No tool integration**: Cannot leverage external tools or APIs |
| 59 | + |
| 60 | +### Why We're Evolving: Decoder-Only Models |
| 61 | + |
| 62 | +As vLLM-SR adoption grew, we encountered more diverse scenarios and requirements: |
| 63 | + |
| 64 | +- **Explainability**: Users need to understand *why* a query was routed to a specific model |
| 65 | +- **Complex reasoning**: Some routing decisions require multi-step analysis |
| 66 | +- **Agentic workflows**: Integration with tool calling, function execution, and external APIs |
| 67 | +- **Advanced techniques**: Reinforcement learning (RL), sophisticated post-training methods |
| 68 | +- **Domain expertise**: Specialized routing for legal, medical, scientific domains |
| 69 | + |
| 70 | +**The Solution**: Expand to decoder-only models while keeping encoder speed where it matters. |
| 71 | + |
| 72 | +### The MoM Architecture: Best of Both Worlds |
| 73 | + |
| 74 | +Our **Mixture-of-Models** approach combines encoder and decoder strengths: |
| 75 | + |
| 76 | +- ⚡ **Encoders** — Fast classification (sub-10ms latency) for high-throughput scenarios |
| 77 | +- 🧠 **Decoders** — Explainable decisions with reasoning for transparency |
| 78 | +- 🎯 **Domain Agents** — Expert routing with specialized knowledge |
| 79 | + |
| 80 | +This hybrid architecture lets you choose the right tool for each job: speed when you need it, reasoning when it matters. |
| 81 | + |
| 82 | +## The MoM Model Family |
| 83 | + |
| 84 | +We organize MoM models into **four categories** with **three size variants** (Flash, Pro, Max): |
| 85 | + |
| 86 | +### 🧠 Intelligent Routing |
| 87 | + |
| 88 | +Smart routing models with three size variants: |
| 89 | + |
| 90 | +| Model | Size | Base Model | Purpose | |
| 91 | +|-------|------|------------|---------| |
| 92 | +| **mom-brain-flash** | Flash | ModernBERT | Ultra-fast intent classification (sub-10ms latency) | |
| 93 | +| **mom-brain-pro** | Pro | Qwen 0.6B | Balanced performance with reasoning capabilities | |
| 94 | +| **mom-brain-max** | Max | Qwen 1.7B | Maximum accuracy for complex routing decisions | |
| 95 | + |
| 96 | +**Architecture**: Flash is based on ModernBERT (encoder-only), while Pro and Max are based on Qwen 0.6B and 1.7B (decoder-only) models. |
| 97 | + |
| 98 | +### 🔍 Similarity Search |
| 99 | + |
| 100 | +Semantic similarity and vector search: |
| 101 | + |
| 102 | +| Model | Size | Base Model | Purpose | |
| 103 | +|-------|------|------------|---------| |
| 104 | +| **mom-similarity-flash** | Flash | ModernBERT | Fast semantic similarity matching for route selection | |
| 105 | + |
| 106 | +**Architecture**: Based on ModernBERT (encoder-only) for high-speed embedding generation. |
| 107 | + |
| 108 | +### 🔒 Prompt Guardian |
| 109 | + |
| 110 | +Security and safety checks before routing: |
| 111 | + |
| 112 | +| Model | Size | Base Model | Purpose | |
| 113 | +|-------|------|------------|---------| |
| 114 | +| **mom-jailbreak-flash** | Flash | ModernBERT | Jailbreak/attack detection (security) | |
| 115 | +| **mom-pii-flash** | Flash | ModernBERT | PII detection (privacy protection) | |
| 116 | + |
| 117 | +**Architecture**: Both based on ModernBERT (encoder-only) for ultra-fast security checks. |
| 118 | + |
| 119 | +### 🎯 SLM Experts |
| 120 | + |
| 121 | +Specialized small language models for domain-specific routing: |
| 122 | + |
| 123 | +| Model | Size | Base Model | Domain | |
| 124 | +|-------|------|------------|--------| |
| 125 | +| **mom-expert-math-flash** | Flash | Qwen 0.6B | Mathematics (algebra, calculus, statistics) | |
| 126 | +| **mom-expert-math-pro** | Pro | Qwen 1.7B | Advanced mathematics with reasoning | |
| 127 | + |
| 128 | +**Architecture**: Based on Qwen models (decoder-only) for domain-specific reasoning capabilities. |
| 129 | + |
| 130 | +## Design Principles |
| 131 | + |
| 132 | +**Safety-First**: Prompt Guardian models (PII, jailbreak detection) run before routing—security at the edge. |
| 133 | + |
| 134 | +**Speed ↔ Capability**: Choose Flash for sub-10ms latency, Pro for balanced performance, or Max for maximum accuracy. Different sizes, different SLAs. |
| 135 | + |
| 136 | +**Domain Expertise**: SLM Expert models achieve 15-25% better accuracy on domain-specific tasks vs. generalist routing. Math queries go to math experts. |
| 137 | + |
| 138 | +## How vLLM-SR Uses MoM |
| 139 | + |
| 140 | +vLLM-SR's routing pipeline leverages MoM models at multiple stages: |
| 141 | + |
| 142 | +1. **Security Check** → `mom-jailbreak-flash` and `mom-pii-flash` filter malicious/sensitive requests |
| 143 | +2. **Intent Classification** → `mom-brain-*` models (flash/pro/max) determine query type and routing decisions |
| 144 | +3. **Similarity Search** → `mom-similarity-flash` finds semantically similar routes |
| 145 | +4. **Domain Routing** → `mom-expert-*` models route specialized queries to optimal downstream models |
| 146 | +5. **Cost Optimization** → Simple queries → lightweight models; complex queries → premium models |
| 147 | + |
| 148 | +This achieves **2x+ cost reduction** while maintaining quality, similar to [RouteLLM](https://arxiv.org/abs/2406.18665). |
| 149 | + |
| 150 | +## What's Next: Exploring Frontier Techniques |
| 151 | + |
| 152 | +The move to decoder-only models opens exciting possibilities for vLLM-SR: |
| 153 | + |
| 154 | +### 🤖 Agentic Routing |
| 155 | + |
| 156 | +Decoder models can act as intelligent agents that: |
| 157 | + |
| 158 | +- Dynamically select and orchestrate multiple models |
| 159 | +- Make multi-step routing decisions with tool calling |
| 160 | +- Adapt routing strategies based on feedback |
| 161 | + |
| 162 | +### 🎯 Reinforcement Learning (RL) |
| 163 | + |
| 164 | +Apply RL techniques to optimize routing decisions: |
| 165 | + |
| 166 | +- Learn from user feedback and model performance |
| 167 | +- Discover optimal routing policies through trial and error |
| 168 | +- Continuously improve cost-quality trade-offs |
| 169 | + |
| 170 | +### 🔧 Advanced Post-Training |
| 171 | + |
| 172 | +Leverage cutting-edge post-training methods: |
| 173 | + |
| 174 | +- **Distillation**: Transfer knowledge from large models to efficient routers |
| 175 | +- **Preference learning**: Train on human feedback (RLHF, DPO) |
| 176 | +- **Domain adaptation**: Fine-tune for specific industries or use cases |
| 177 | + |
| 178 | +### 🛠️ Tool Integration |
| 179 | + |
| 180 | +Enable routers to: |
| 181 | + |
| 182 | +- Call external APIs for context-aware routing |
| 183 | +- Query databases for historical routing patterns |
| 184 | +- Integrate with monitoring systems for real-time optimization |
| 185 | + |
| 186 | +**The vision**: vLLM-SR routers that not only classify but *reason*, *learn*, and *adapt*. |
| 187 | + |
| 188 | +## Model Naming Convention |
| 189 | + |
| 190 | +```text |
| 191 | +mom-{category}-{size} |
| 192 | +mom-expert-{domain}-{size} |
| 193 | +``` |
| 194 | + |
| 195 | +### Four Categories |
| 196 | + |
| 197 | +1. **Intelligent Routing**: `mom-brain-{flash|pro|max}` |
| 198 | +2. **Similarity Search**: `mom-similarity-{flash}` |
| 199 | +3. **Prompt Guardian**: `mom-{jailbreak|pii}-{flash}` |
| 200 | +4. **SLM Experts**: `mom-expert-{domain}-{flash|pro}` |
| 201 | + |
| 202 | +### Three Size Variants |
| 203 | + |
| 204 | +- **flash**: ModernBERT-based (for brain/similarity/guardian) or Qwen 0.6B (for experts) — fastest, sub-10ms latency |
| 205 | +- **pro**: Qwen 0.6B (for brain) or Qwen 1.7B (for experts) — balanced performance with reasoning |
| 206 | +- **max**: Qwen 1.7B (for brain) — maximum accuracy and capabilities |
| 207 | + |
| 208 | +### Architecture Summary |
| 209 | + |
| 210 | +- **Intelligent Routing**: Flash (ModernBERT) + Pro/Max (Qwen 0.6B/1.7B) |
| 211 | +- **Similarity Search**: Flash (ModernBERT) |
| 212 | +- **Prompt Guardian**: Flash (ModernBERT) |
| 213 | +- **SLM Experts**: Flash/Pro (Qwen 0.6B/1.7B) |
| 214 | + |
| 215 | +## Get Started |
| 216 | + |
| 217 | +All MoM models are available on [Hugging Face](https://huggingface.co/LLM-Semantic-Router). |
| 218 | + |
| 219 | +**Resources**: |
| 220 | + |
| 221 | +- [GitHub](https://github.com/vllm-project/semantic-router) |
| 222 | +- [Documentation](https://vllm-semantic-router.com) |
| 223 | +- [Quick Start Guide](https://vllm-semantic-router.com/docs/installation) |
| 224 | + |
| 225 | +--- |
| 226 | + |
| 227 | +**vLLM-SR · Route with intent. Think with reason.** |
0 commit comments