|
| 1 | +--- |
| 2 | +slug: mom-family |
| 3 | +title: "MoM: Specialized Models for Intelligent Routing" |
| 4 | +authors: [Xunzhuo] |
| 5 | +tags: [mom, models, routing, announcement] |
| 6 | +--- |
| 7 | + |
| 8 | + |
| 9 | + |
| 10 | +**One fabric. Many minds.** We're introducing **MoM** (Mixture of Models)—a family of specialized routing models that power vLLM-SR's intelligent decision-making. |
| 11 | + |
| 12 | +<!-- truncate --> |
| 13 | + |
| 14 | +## Why MoM? |
| 15 | + |
| 16 | +vLLM-SR solves a critical problem: **how to route LLM requests to the right model at the right time**. Not every query needs the same resources—"What's the weather?" shouldn't cost as much as "Analyze this legal contract." |
| 17 | + |
| 18 | +## The Evolution: From Encoder-Only to Mixture-of-Models |
| 19 | + |
| 20 | +### Where We Started: ModernBERT Foundation |
| 21 | + |
| 22 | +vLLM-SR initially built its routing intelligence entirely on **ModernBERT** (encoder-only models): |
| 23 | + |
| 24 | +**Advantages**: |
| 25 | + |
| 26 | +- ⚡ **Blazing fast**: Sub-10ms inference latency |
| 27 | +- 📊 **High throughput**: 10,000+ QPS on commodity hardware |
| 28 | +- 💰 **Cost-effective**: Minimal compute requirements |
| 29 | +- 🎯 **Proven accuracy**: Strong performance on classification tasks |
| 30 | + |
| 31 | +**Limitations**: |
| 32 | + |
| 33 | +- ❌ **Black box decisions**: No explanation for routing choices |
| 34 | +- ❌ **Limited reasoning**: Cannot handle complex, multi-step logic |
| 35 | +- ❌ **Fixed capabilities**: Hard to extend with new behaviors |
| 36 | +- ❌ **No tool integration**: Cannot leverage external tools or APIs |
| 37 | + |
| 38 | +### Why We're Evolving: Decoder-Only Models |
| 39 | + |
| 40 | +As vLLM-SR adoption grew, we encountered more diverse scenarios and requirements: |
| 41 | + |
| 42 | +- **Explainability**: Users need to understand *why* a query was routed to a specific model |
| 43 | +- **Complex reasoning**: Some routing decisions require multi-step analysis |
| 44 | +- **Agentic workflows**: Integration with tool calling, function execution, and external APIs |
| 45 | +- **Advanced techniques**: Reinforcement learning (RL), sophisticated post-training methods |
| 46 | +- **Domain expertise**: Specialized routing for legal, medical, scientific domains |
| 47 | + |
| 48 | +**The Solution**: Expand to decoder-only models while keeping encoder speed where it matters. |
| 49 | + |
| 50 | +### The MoM Architecture: Best of Both Worlds |
| 51 | + |
| 52 | +Our **Mixture-of-Models** approach combines encoder and decoder strengths: |
| 53 | + |
| 54 | +- ⚡ **Encoders** — Fast classification (sub-10ms latency) for high-throughput scenarios |
| 55 | +- 🧠 **Decoders** — Explainable decisions with reasoning for transparency |
| 56 | +- 🎯 **Domain Agents** — Expert routing with specialized knowledge |
| 57 | + |
| 58 | +This hybrid architecture lets you choose the right tool for each job: speed when you need it, reasoning when it matters. |
| 59 | + |
| 60 | +## The MoM Model Family |
| 61 | + |
| 62 | +### 🔒 Encoders — Speed & Safety |
| 63 | + |
| 64 | +Fast, high-throughput models for classification and security checks: |
| 65 | + |
| 66 | +| Model | Purpose | |
| 67 | +|-------|---------| |
| 68 | +| **mom-enc-class-intent-v1** | Intent/topic classification (sub-10ms latency) | |
| 69 | +| **mom-enc-guard-pii-v1** | PII detection (privacy protection) | |
| 70 | +| **mom-enc-guard-jailbreak-v1** | Jailbreak/attack detection (security) | |
| 71 | + |
| 72 | +### 🧠 Decoders — Explainability |
| 73 | + |
| 74 | +When you need to understand *why* a routing decision was made: |
| 75 | + |
| 76 | +| Model | Purpose | |
| 77 | +|-------|---------| |
| 78 | +| **mom-dec-class-intent-v1** | Intent classification with reasoning | |
| 79 | +| **mom-dec-class-intent-r1** | Higher-capacity variant for complex cases | |
| 80 | + |
| 81 | +### 🎯 Domain Agents — Specialized Expertise |
| 82 | + |
| 83 | +Expert models for domain-specific routing: |
| 84 | + |
| 85 | +| Model | Domain | |
| 86 | +|-------|--------| |
| 87 | +| **mom-dec-agent-sci-v1** | Science (physics, chemistry, biology) | |
| 88 | +| **mom-dec-agent-math-v1** | Mathematics (algebra, calculus, statistics) | |
| 89 | +| **mom-dec-agent-hum-v1** | Humanities (literature, philosophy, history) | |
| 90 | +| **mom-dec-agent-soc-v1** | Social sciences (psychology, economics) | |
| 91 | +| **mom-dec-agent-law-v1** | Legal (contracts, compliance) | |
| 92 | +| **mom-dec-agent-gen-v1** | Generalist fallback | |
| 93 | + |
| 94 | +## Design Principles |
| 95 | + |
| 96 | +**Safety-First**: Guardrail models (PII, jailbreak detection) run before routing—security at the edge. |
| 97 | + |
| 98 | +**Speed ↔ Explainability**: Choose encoders for sub-10ms latency or decoders for transparent reasoning. Different endpoints, different SLAs. |
| 99 | + |
| 100 | +**Domain Expertise**: Specialized agents achieve 15-25% better accuracy on domain-specific tasks vs. generalist routing. Math queries go to math experts, legal queries to legal experts. |
| 101 | + |
| 102 | +## How vLLM-SR Uses MoM |
| 103 | + |
| 104 | +vLLM-SR's routing pipeline leverages MoM models at multiple stages: |
| 105 | + |
| 106 | +1. **Security Check** → `mom-enc-guard-*` models filter malicious/sensitive requests |
| 107 | +2. **Intent Classification** → `mom-enc-class-intent-v1` or `mom-dec-class-intent-v1` determines query type |
| 108 | +3. **Domain Routing** → `mom-dec-agent-*` models route specialized queries to optimal downstream models |
| 109 | +4. **Cost Optimization** → Simple queries → lightweight models; complex queries → premium models |
| 110 | + |
| 111 | +This achieves **2x+ cost reduction** while maintaining quality, similar to [RouteLLM](https://arxiv.org/abs/2406.18665). |
| 112 | + |
| 113 | +## Performance |
| 114 | + |
| 115 | +Early benchmarks: |
| 116 | + |
| 117 | +- **Encoders**: sub-10ms p99 latency, 10,000+ QPS |
| 118 | +- **Decoders**: ~50-100ms latency with explainable outputs |
| 119 | +- **Domain Agents**: 15-25% accuracy improvement over generalist routing |
| 120 | + |
| 121 | +## What's Next: Exploring Frontier Techniques |
| 122 | + |
| 123 | +The move to decoder-only models opens exciting possibilities for vLLM-SR: |
| 124 | + |
| 125 | +### 🤖 Agentic Routing |
| 126 | + |
| 127 | +Decoder models can act as intelligent agents that: |
| 128 | + |
| 129 | +- Dynamically select and orchestrate multiple models |
| 130 | +- Make multi-step routing decisions with tool calling |
| 131 | +- Adapt routing strategies based on feedback |
| 132 | + |
| 133 | +### 🎯 Reinforcement Learning (RL) |
| 134 | + |
| 135 | +Apply RL techniques to optimize routing decisions: |
| 136 | + |
| 137 | +- Learn from user feedback and model performance |
| 138 | +- Discover optimal routing policies through trial and error |
| 139 | +- Continuously improve cost-quality trade-offs |
| 140 | + |
| 141 | +### 🔧 Advanced Post-Training |
| 142 | + |
| 143 | +Leverage cutting-edge post-training methods: |
| 144 | + |
| 145 | +- **Distillation**: Transfer knowledge from large models to efficient routers |
| 146 | +- **Preference learning**: Train on human feedback (RLHF, DPO) |
| 147 | +- **Domain adaptation**: Fine-tune for specific industries or use cases |
| 148 | + |
| 149 | +### 🛠️ Tool Integration |
| 150 | + |
| 151 | +Enable routers to: |
| 152 | + |
| 153 | +- Call external APIs for context-aware routing |
| 154 | +- Query databases for historical routing patterns |
| 155 | +- Integrate with monitoring systems for real-time optimization |
| 156 | + |
| 157 | +**The vision**: vLLM-SR routers that not only classify but *reason*, *learn*, and *adapt*. |
| 158 | + |
| 159 | +## Model Naming |
| 160 | + |
| 161 | +```text |
| 162 | +mom-{type}-{function}-{domain}-{version} |
| 163 | +``` |
| 164 | + |
| 165 | +- **type**: `enc` (encoder) / `dec` (decoder) |
| 166 | +- **function**: `class` (classification) / `guard` (safety) / `agent` (domain expert) |
| 167 | +- **domain**: `intent`, `pii`, `jailbreak`, `sci`, `math`, etc. |
| 168 | +- **version**: `v1` (baseline) / `r1` (higher-capacity) |
| 169 | + |
| 170 | +## Get Started |
| 171 | + |
| 172 | +All MoM models are available on [Hugging Face](https://huggingface.co/LLM-Semantic-Router). |
| 173 | + |
| 174 | +**Resources**: |
| 175 | + |
| 176 | +- [GitHub](https://github.com/vllm-project/semantic-router) |
| 177 | +- [Documentation](https://vllm-semantic-router.com) |
| 178 | +- [Quick Start Guide](https://vllm-semantic-router.com/docs/installation) |
| 179 | + |
| 180 | +--- |
| 181 | + |
| 182 | +**vLLM-SR · Route with intent. Think with reason.** |
0 commit comments