more

Xunzhuo · Xunzhuo · commit a6ba9d61577a · 2025-10-17T15:56:38.000+08:00
Signed-off-by: bitliu &lt;bitliu@tencent.com&gt;
diff --git a/website/blog/2025-10-16-mom-family.md b/website/blog/2025-10-16-mom-family.md
@@ -0,0 +1,227 @@
+---
+slug: mom-family
+title: "MoM: Specialized Models for Intelligent Routing"
+authors: [Xunzhuo]
+tags: [mom, models, routing, announcement]
+---
+
+![MoM Family](/img/mom-family.png)
+
+**One fabric. Many minds.** We're introducing **MoM** (Mixture of Models)—a family of specialized routing models that power vLLM-SR's intelligent decision-making.
+
+<!-- truncate -->
+
+## Why MoM?
+
+vLLM-SR solves a critical problem: **how to route LLM requests to the right model at the right time**. Not every query needs the same resources—"What's the weather?" shouldn't cost as much as "Analyze this legal contract."
+
+## MoM System Card
+
+A quick overview of all MoM models:
+
+| Category | Model | Size | Base Model | Latency | Purpose |
+|----------|-------|------|------------|---------|---------|
+| **🧠 Intelligent Routing** | mom-brain-flash | Flash | ModernBERT | &lt;10ms | Ultra-fast intent classification |
+| | mom-brain-pro | Pro | Qwen 0.6B | ~30-50ms | Balanced routing with reasoning |
+| | mom-brain-max | Max | Qwen 1.7B | ~50-100ms | Maximum accuracy for complex decisions |
+| **🔍 Similarity Search** | mom-similarity-flash | Flash | ModernBERT | &lt;10ms | Semantic similarity matching |
+| **🔒 Prompt Guardian** | mom-jailbreak-flash | Flash | ModernBERT | &lt;10ms | Jailbreak/attack detection |
+| | mom-pii-flash | Flash | ModernBERT | &lt;10ms | PII detection & privacy protection |
+| **🎯 SLM Experts** | mom-expert-math-flash | Flash | Qwen 0.6B | ~30-50ms | Mathematics routing |
+| | mom-expert-math-pro | Pro | Qwen 1.7B | ~50-100ms | Advanced math with reasoning |
+
+**Key Insights:**
+
+- **4 Categories** × **3 Size Variants** = Flexible routing architecture
+- **ModernBERT** (encoder-only) → Sub-10ms latency for high-throughput scenarios
+- **Qwen** (decoder-only) → Explainable decisions with reasoning capabilities
+- **Flash** models achieve 10,000+ QPS on commodity hardware
+
+## The Evolution: From Encoder-Only to Mixture-of-Models
+
+### Where We Started: ModernBERT Foundation
+
+vLLM-SR initially built its routing intelligence entirely on **ModernBERT** (encoder-only models):
+
+**Advantages**:
+
+- ⚡ **Blazing fast**: Sub-10ms inference latency
+- 📊 **High throughput**: 10,000+ QPS on commodity hardware
+- 💰 **Cost-effective**: Minimal compute requirements
+- 🎯 **Proven accuracy**: Strong performance on classification tasks
+
+**Limitations**:
+
+- ❌ **Black box decisions**: No explanation for routing choices
+- ❌ **Limited reasoning**: Cannot handle complex, multi-step logic
+- ❌ **Fixed capabilities**: Hard to extend with new behaviors
+- ❌ **No tool integration**: Cannot leverage external tools or APIs
+
+### Why We're Evolving: Decoder-Only Models
+
+As vLLM-SR adoption grew, we encountered more diverse scenarios and requirements:
+
+- **Explainability**: Users need to understand *why* a query was routed to a specific model
+- **Complex reasoning**: Some routing decisions require multi-step analysis
+- **Agentic workflows**: Integration with tool calling, function execution, and external APIs
+- **Advanced techniques**: Reinforcement learning (RL), sophisticated post-training methods
+- **Domain expertise**: Specialized routing for legal, medical, scientific domains
+
+**The Solution**: Expand to decoder-only models while keeping encoder speed where it matters.
+
+### The MoM Architecture: Best of Both Worlds
+
+Our **Mixture-of-Models** approach combines encoder and decoder strengths:
+
+- ⚡ **Encoders** — Fast classification (sub-10ms latency) for high-throughput scenarios
+- 🧠 **Decoders** — Explainable decisions with reasoning for transparency
+- 🎯 **Domain Agents** — Expert routing with specialized knowledge
+
+This hybrid architecture lets you choose the right tool for each job: speed when you need it, reasoning when it matters.
+
+## The MoM Model Family
+
+We organize MoM models into **four categories** with **three size variants** (Flash, Pro, Max):
+
+### 🧠 Intelligent Routing
+
+Smart routing models with three size variants:
+
+| Model | Size | Base Model | Purpose |
+|-------|------|------------|---------|
+| **mom-brain-flash** | Flash | ModernBERT | Ultra-fast intent classification (sub-10ms latency) |
+| **mom-brain-pro** | Pro | Qwen 0.6B | Balanced performance with reasoning capabilities |
+| **mom-brain-max** | Max | Qwen 1.7B | Maximum accuracy for complex routing decisions |
+
+**Architecture**: Flash is based on ModernBERT (encoder-only), while Pro and Max are based on Qwen 0.6B and 1.7B (decoder-only) models.
+
+### 🔍 Similarity Search
+
+Semantic similarity and vector search:
+
+| Model | Size | Base Model | Purpose |
+|-------|------|------------|---------|
+| **mom-similarity-flash** | Flash | ModernBERT | Fast semantic similarity matching for route selection |
+
+**Architecture**: Based on ModernBERT (encoder-only) for high-speed embedding generation.
+
+### 🔒 Prompt Guardian
+
+Security and safety checks before routing:
+
+| Model | Size | Base Model | Purpose |
+|-------|------|------------|---------|
+| **mom-jailbreak-flash** | Flash | ModernBERT | Jailbreak/attack detection (security) |
+| **mom-pii-flash** | Flash | ModernBERT | PII detection (privacy protection) |
+
+**Architecture**: Both based on ModernBERT (encoder-only) for ultra-fast security checks.
+
+### 🎯 SLM Experts
+
+Specialized small language models for domain-specific routing:
+
+| Model | Size | Base Model | Domain |
+|-------|------|------------|--------|
+| **mom-expert-math-flash** | Flash | Qwen 0.6B | Mathematics (algebra, calculus, statistics) |
+| **mom-expert-math-pro** | Pro | Qwen 1.7B | Advanced mathematics with reasoning |
+
+**Architecture**: Based on Qwen models (decoder-only) for domain-specific reasoning capabilities.
+
+## Design Principles
+
+**Safety-First**: Prompt Guardian models (PII, jailbreak detection) run before routing—security at the edge.
+
+**Speed ↔ Capability**: Choose Flash for sub-10ms latency, Pro for balanced performance, or Max for maximum accuracy. Different sizes, different SLAs.
+
+**Domain Expertise**: SLM Expert models achieve 15-25% better accuracy on domain-specific tasks vs. generalist routing. Math queries go to math experts.
+
+## How vLLM-SR Uses MoM
+
+vLLM-SR's routing pipeline leverages MoM models at multiple stages:
+
+1. **Security Check** → `mom-jailbreak-flash` and `mom-pii-flash` filter malicious/sensitive requests
+2. **Intent Classification** → `mom-brain-*` models (flash/pro/max) determine query type and routing decisions
+3. **Similarity Search** → `mom-similarity-flash` finds semantically similar routes
+4. **Domain Routing** → `mom-expert-*` models route specialized queries to optimal downstream models
+5. **Cost Optimization** → Simple queries → lightweight models; complex queries → premium models
+
+This achieves **2x+ cost reduction** while maintaining quality, similar to [RouteLLM](https://arxiv.org/abs/2406.18665).
+
+## What's Next: Exploring Frontier Techniques
+
+The move to decoder-only models opens exciting possibilities for vLLM-SR:
+
+### 🤖 Agentic Routing
+
+Decoder models can act as intelligent agents that:
+
+- Dynamically select and orchestrate multiple models
+- Make multi-step routing decisions with tool calling
+- Adapt routing strategies based on feedback
+
+### 🎯 Reinforcement Learning (RL)
+
+Apply RL techniques to optimize routing decisions:
+
+- Learn from user feedback and model performance
+- Discover optimal routing policies through trial and error
+- Continuously improve cost-quality trade-offs
+
+### 🔧 Advanced Post-Training
+
+Leverage cutting-edge post-training methods:
+
+- **Distillation**: Transfer knowledge from large models to efficient routers
+- **Preference learning**: Train on human feedback (RLHF, DPO)
+- **Domain adaptation**: Fine-tune for specific industries or use cases
+
+### 🛠️ Tool Integration
+
+Enable routers to:
+
+- Call external APIs for context-aware routing
+- Query databases for historical routing patterns
+- Integrate with monitoring systems for real-time optimization
+
+**The vision**: vLLM-SR routers that not only classify but *reason*, *learn*, and *adapt*.
+
+## Model Naming Convention
+
+```text
+mom-{category}-{size}
+mom-expert-{domain}-{size}
+```
+
+### Four Categories
+
+1. **Intelligent Routing**: `mom-brain-{flash|pro|max}`
+2. **Similarity Search**: `mom-similarity-{flash}`
+3. **Prompt Guardian**: `mom-{jailbreak|pii}-{flash}`
+4. **SLM Experts**: `mom-expert-{domain}-{flash|pro}`
+
+### Three Size Variants
+
+- **flash**: ModernBERT-based (for brain/similarity/guardian) or Qwen 0.6B (for experts) — fastest, sub-10ms latency
+- **pro**: Qwen 0.6B (for brain) or Qwen 1.7B (for experts) — balanced performance with reasoning
+- **max**: Qwen 1.7B (for brain) — maximum accuracy and capabilities
+
+### Architecture Summary
+
+- **Intelligent Routing**: Flash (ModernBERT) + Pro/Max (Qwen 0.6B/1.7B)
+- **Similarity Search**: Flash (ModernBERT)
+- **Prompt Guardian**: Flash (ModernBERT)
+- **SLM Experts**: Flash/Pro (Qwen 0.6B/1.7B)
+
+## Get Started
+
+All MoM models are available on [Hugging Face](https://huggingface.co/LLM-Semantic-Router).
+
+**Resources**:
+
+- [GitHub](https://github.com/vllm-project/semantic-router)
+- [Documentation](https://vllm-semantic-router.com)
+- [Quick Start Guide](https://vllm-semantic-router.com/docs/installation)
+
+---
+
+**vLLM-SR · Route with intent. Think with reason.**
diff --git a/website/src/css/custom.css b/website/src/css/custom.css
@@ -779,3 +779,107 @@ td, th {
     width: 100% !important;
   }
 }
+
+/* ============================================
+   Blog Page Optimizations - Wider Content
+   ============================================ */
+
+/* Only apply blog optimizations to blog pages - use more specific selectors */
+
+/* Hide blog sidebar (left side posts list) - only on blog pages */
+[class*='blog-wrapper'] aside[class*='blogSidebar'],
+[class*='blog-wrapper'] aside.col--3,
+[class*='blog-wrapper'] .theme-blog-sidebar {
+  display: none !important;
+}
+
+/* Hide table of contents (right side) - only on blog pages, not docs */
+[class*='blog-wrapper'] .theme-doc-toc-desktop,
+[class*='blog-wrapper'] .table-of-contents,
+[class*='blog-wrapper'] div[class*='tableOfContents'],
+[class*='blog-wrapper'] div[class*='tocCollapsible'] {
+  display: none !important;
+}
+
+/* Expand blog content row to full width - only on blog pages */
+[class*='blog-wrapper'] div[class*='blogContainer'] .row,
+[class*='blog-wrapper'] .row {
+  justify-content: center !important;
+  margin: 0 auto !important;
+}
+
+/* Expand blog content column to use full width - only on blog pages */
+[class*='blog-wrapper'] .col--7,
+[class*='blog-wrapper'] div[class*='blogPostContent'] {
+  max-width: 100% !important;
+  flex: 0 0 100% !important;
+  margin: 0 auto !important;
+}
+
+/* Center blog content container - wider layout - only on blog pages */
+[class*='blog-wrapper'] .container,
+[class*='blog-wrapper'] div[class*='blogContainer'] {
+  max-width: 1600px !important;
+  margin: 0 auto !important;
+  padding: 0 3rem !important;
+}
+
+/* Blog post content - centered and wide - only on blog pages */
+[class*='blog-wrapper'] article,
+[class*='blog-wrapper'] article[class*='blogPostItem'],
+[class*='blog-wrapper'] div[class*='blogPostContent'] article {
+  min-width: 60vw !important;
+  max-width: 1400px !important;
+  margin: 2rem auto !important;
+  padding: 3rem 4rem !important;
+  display: block !important;
+}
+
+/* Blog post header - only on blog pages */
+[class*='blog-wrapper'] header[class*='blogPostHeader'] {
+  max-width: 1400px !important;
+  margin: 0 auto !important;
+}
+
+/* Blog list page optimization - only on blog pages */
+[class*='blog-wrapper'] .margin-vert--lg,
+[class*='blog-wrapper'] div[class*='blogListPage'] {
+  max-width: 1400px !important;
+  margin: 2rem auto !important;
+  width: 100% !important;
+}
+
+/* Ensure blog post items are centered - only on blog pages */
+[class*='blog-wrapper'] .blogPostItem,
+[class*='blog-wrapper'] div[class*='blogPostItem'] {
+  max-width: 1400px !important;
+  margin: 0 auto 2rem auto !important;
+}
+
+/* Center blog post content wrapper - only on blog pages */
+[class*='blog-wrapper'] div[class*='blogPostPageContent'] {
+  display: flex !important;
+  justify-content: center !important;
+  width: 100% !important;
+}
+
+/* Responsive adjustments for blog - only on blog pages */
+@media (max-width: 996px) {
+  [class*='blog-wrapper'] article,
+  [class*='blog-wrapper'] article[class*='blogPostItem'] {
+    min-width: auto !important;
+    padding: 2rem 1.5rem !important;
+  }
+
+  [class*='blog-wrapper'] .container,
+  [class*='blog-wrapper'] div[class*='blogContainer'] {
+    padding: 0 1rem !important;
+  }
+}
+
+@media (max-width: 768px) {
+  [class*='blog-wrapper'] article,
+  [class*='blog-wrapper'] article[class*='blogPostItem'] {
+    padding: 1.5rem 1rem !important;
+  }
+}
diff --git a/website/static/img/mom-family.png b/website/static/img/mom-family.png