Skip to content

Commit af59a95

Browse files
authored
docs: simplify estimation data content (#607)
1 parent d60ca63 commit af59a95

File tree

1 file changed

+15
-235
lines changed

1 file changed

+15
-235
lines changed

website/blog/2025-11-07-semantic-tool-selection.md

Lines changed: 15 additions & 235 deletions
Original file line numberDiff line numberDiff line change
@@ -46,9 +46,9 @@ Recent academic studies have measured the impact of large tool catalogs on LLM p
4646

4747
- With ~50 tools (8K tokens): Most models maintain 84-95% accuracy
4848
- With ~200 tools (32K tokens): Accuracy ranges from 41-83% depending on model
49-
- With ~740 tools (120K tokens): Accuracy drops to 0-76%, with only GPT-4o maintaining >70%
49+
- With ~740 tools (120K tokens): Accuracy drops to 0-20% for most models
5050

51-
Different models show varying degrees of degradation. GPT-4o experiences a 19% accuracy drop, while some open-source models show 79-100% degradation when scaling from small to large tool catalogs.
51+
Different models show varying degrees of degradation, with open-source models showing 79-100% degradation when scaling from small to large tool catalogs.
5252

5353
**The "Lost in the Middle" Effect:** Research has documented position bias where tools in the middle of long lists are less likely to be selected correctly. For example, with 741 tools, middle positions (40-60%) showed 22-52% accuracy compared to 31-32% at the beginning/end positions for some models.
5454

@@ -111,22 +111,21 @@ Recent academic research has quantified the severity of this problem. Studies sh
111111

112112
One study testing tool selection with increasing catalog sizes found that baseline accuracy dropped from 78% with 10 tools to just 13.62% with 100+ tools - a catastrophic 82% degradation. This "needle in a haystack" problem for tool selection motivated our semantic approach.
113113

114-
### Experiment 1: Large Tool Catalog Stress Test
114+
### Large Tool Catalog Stress Test
115115

116116
**Setup:**
117117

118118
Based on the Berkeley Function Calling Leaderboard (BFCL) dataset, we tested tool selection performance as catalog size grows:
119119

120120
- **Dataset**: 858 function calling samples (simple, live_simple, multiple subsets)
121121
- **Tool catalog sizes**: Varied from 49 tools (8K tokens) to 741 tools (120K tokens)
122-
- **Models tested**: GPT-4o, Llama-3.1-70B, Mistral-Large, Granite-3.1-8B, ToolACE-8B, BitAgent-8B
122+
- **Models tested**: Llama-3.1-70B, Mistral-Large, Granite-3.1-8B, ToolACE-8B, BitAgent-8B
123123
- **Position testing**: Tools placed at 10%, 30%, 50%, 70%, 90% positions in the catalog
124124

125125
**Results - Accuracy Degradation by Catalog Size:**
126126

127127
| Model | Max Accuracy (Small Catalog) | Min Accuracy (Large Catalog) | Degradation |
128128
|-------|----------------------------|----------------------------|-------------|
129-
| GPT-4o | 94% | 76% | **-19%** |
130129
| Llama-3.1-70B | 95% | 20% | **-79%** |
131130
| Mistral-Large | 94% | 0% | **-100%** |
132131
| Granite-3.1-8B | 84% | 7% | **-92%** |
@@ -135,187 +134,19 @@ Based on the Berkeley Function Calling Leaderboard (BFCL) dataset, we tested too
135134
**Key Findings:**
136135

137136
1. **Catastrophic degradation**: Most models experience 79-100% accuracy drops as tool count increases from ~50 to ~740 tools
138-
2. **GPT-4o resilience**: Only GPT-4o maintains reasonable performance (76% accuracy) with 741 tools, but still shows 19% degradation
139-
3. **Prompt bloat**: At 120K tokens, average prompt size was 127,315 tokens with all tools loaded
137+
2. **Prompt bloat**: At 120K tokens, average prompt size was 127,315 tokens with all tools loaded
140138

141139
**Position Bias Analysis:**
142140

143141
Testing with 741 tools at different positions revealed severe "lost in the middle" effects:
144142

145-
| Tool Position | Granite-3.1-8B | Llama-3.1-70B | BitAgent-8B | GPT-4o |
146-
|--------------|---------------|--------------|------------|--------|
147-
| Beginning (10%) | 18% | 32% | 57% | 83% |
148-
| Early (30%) | 12% | 28% | 45% | 81% |
149-
| Middle (50%) | 8% | 22% | 24% | 74% |
150-
| Late (70%) | 14% | 29% | 41% | 78% |
151-
| End (90%) | 17% | 31% | 53% | 82% |
152-
153-
**Impact of Semantic Selection:**
154-
155-
With semantic retrieval selecting top-3 tools (reducing context from 120K to ~1K tokens):
156-
157-
- **Accuracy**: Improved significantly (e.g., RAG-MCP achieved 43.13% vs 13.62% baseline)
158-
- **Token reduction**: 99.1% (127,315 → 1,084 tokens)
159-
- **Latency improvement**: Reduced processing time due to smaller context
160-
- **Position bias**: Mitigated through relevance-based selection rather than position-dependent retrieval
161-
162-
### Experiment 2: RAG-MCP Benchmark Comparison
163-
164-
**Setup:**
165-
166-
Based on the MCPBench web search evaluation:
167-
168-
- **Dataset**: 20 web search tasks requiring tool selection
169-
- **Baseline approaches**:
170-
- **Blank Conditioning**: Load all N MCP schemas into prompt
171-
- **Actual Match**: Keyword-based pre-filtering
172-
- **RAG-MCP**: Semantic retrieval (our approach)
173-
- **Models tested**: Qwen-max-0125 as base LLM
174-
- **Evaluation**: Automated with Deepseek-v3 as judge
175-
176-
**Results:**
177-
178-
| Method | Accuracy | Avg Prompt Tokens | Avg Completion Tokens |
179-
|--------|----------|------------------|---------------------|
180-
| Blank Conditioning | 13.62% | 2,134 | 162 |
181-
| Actual Match (Keyword) | 18.20% | 1,646 | 24 |
182-
| **RAG-MCP (Semantic)** | **43.13%** | **1,084** | **78** |
183-
184-
**Key Findings:**
185-
186-
1. **3.2x accuracy improvement**: Semantic retrieval (43.13%) vs naive loading (13.62%)
187-
2. **49% token reduction**: 2,134 → 1,084 prompt tokens
188-
3. **Keyword matching insufficient**: Simple keyword matching (18.20%) performs poorly compared to semantic similarity
189-
4. **Scalability**: Performance maintained even as MCP registry grew to 4,400+ servers
190-
191-
**Stress Test Results:**
192-
193-
Testing with varying numbers of distractor MCPs (1 to 100+):
194-
195-
| MCP Count | Context Tokens | Success Rate | Notes |
196-
|-----------|---------------|-------------|-------|
197-
| 1-10 | 8,192 | 90%+ | High accuracy with small catalogs |
198-
| 11-30 | 16,384 | 75-85% | Beginning of degradation |
199-
| 31-70 | 32,768 | 50-70% | Significant accuracy drop |
200-
| 71-100 | 65,536 | 20-45% | Severe degradation |
201-
| 100+ | 120,000 | 13-15% | Near-random selection |
202-
203-
The stress test confirms non-linear degradation: accuracy doesn't decline gradually but falls off a cliff beyond ~50 tools.
204-
205-
### Experiment 3: Model Comparison Across Catalog Sizes
206-
207-
**Setup:**
208-
209-
Comprehensive evaluation across multiple state-of-the-art models using BFCL dataset:
210-
211-
- **Models**: 9 models including GPT-4o, Llama-3.1 (8B/70B), Mistral-Large, DeepSeek-R1-Distill-Qwen-32B, QwQ-32B, Granite-3.1-8B, ToolACE-8B, BitAgent-8B
212-
- **Context lengths**: 8K, 16K, 32K, 65K, 120K tokens
213-
- **Tool counts**: 49, 102, 207, 417, 741 tools respectively
214-
- **Metric**: AST (Abstract Syntax Tree) accuracy for function call correctness
215-
216-
**Results - Performance by Model Family:**
217-
218-
| Model | 8K Context | 32K Context | 120K Context | Total Degradation |
219-
|-------|-----------|------------|-------------|------------------|
220-
| GPT-4o | 94% | 83% | 76% | **-19%** |
221-
| QwQ-32B | 92% | 78% | 47% | **-49%** |
222-
| DeepSeek-R1-32B | 89% | 45% | 1% | **-99%** |
223-
| Llama-3.1-70B | 95% | 64% | 20% | **-79%** |
224-
| Llama-3.1-8B | 86% | 24% | 7% | **-92%** |
225-
| Mistral-Large | 94% | 53% | 0% | **-100%** |
226-
| Granite-3.1-8B | 84% | 41% | 7% | **-92%** |
227-
| ToolACE-8B | 94% | 71% | 25% | **-73%** |
228-
| BitAgent-8B | 95% | 57% | 10% | **-89%** |
229-
230-
**Key Findings:**
231-
232-
1. **Only GPT-4o maintains usability**: At 120K context (741 tools), only GPT-4o achieves >70% accuracy
233-
2. **Open-source models collapse**: Most open-source models drop below 25% accuracy with large catalogs
234-
3. **Size matters but isn't everything**: 32B models (QwQ, DeepSeek-R1) show mixed results - architecture and training matter more than parameter count
235-
4. **Specialized models help**: ToolACE-8B and BitAgent-8B (fine-tuned for tool calling) outperform general-purpose models of similar size
236-
237-
**Implications:**
238-
239-
Without semantic selection, deploying tool-calling systems with 500+ tools is:
240-
241-
- **Impractical for open-source models**: 85-100% accuracy loss
242-
- **Expensive for GPT-4o**: Requires 120K+ context windows at $10-30 per 1M tokens
243-
- **Unreliable even with best models**: 19% degradation is unacceptable for production
244-
245-
### Experiment 4: Long Tool Responses & Multi-Turn Conversations
246-
247-
While semantic selection solves the tool catalog problem, research has identified two additional long-context challenges in tool calling:
248-
249-
**Challenge A: Long Tool Responses**
250-
251-
**Setup:**
252-
253-
- **Dataset**: 566 QA samples from ComplexFuncBench using real booking.com REST APIs
254-
- **APIs tested**: Flight search, hotel booking, car rental, attractions, seat maps
255-
- **Question types**: Extraction (finding values), Filtering (matching criteria), Aggregation (computing sums/averages)
256-
- **Response sizes**: 10K, 20K, 40K, 80K tokens
257-
- **Models**: GPT-4o, Llama-3.1-70B, Llama-3.1-8B, Mistral-Large, Granite-3.1-8B, ToolACE-8B, BitAgent-8B
258-
259-
**Results - Tool Response QA Accuracy:**
260-
261-
| Response Size | GPT-4o | Llama-3.1-70B | Llama-3.1-8B | Mistral-Large | Granite-3.1-8B |
262-
|--------------|--------|--------------|-------------|--------------|---------------|
263-
| 10K tokens | 74% | 68% | 21% | 72% | 57% |
264-
| 20K tokens | 71% | 61% | 20% | 58% | 54% |
265-
| 40K tokens | 68% | 52% | 23% | 34% | 47% |
266-
| 80K tokens | 67% | 47% | 26% | 18% | 39% |
267-
| **Degradation** | **-9.5%** | **-30.9%** | **+23.8%*** | **-75.0%** | **-31.6%** |
268-
269-
*Llama-3.1-8B showed anomalous behavior at short contexts (generating unnecessary JSON)
270-
271-
**Performance by Question Type (80K tokens):**
272-
273-
| Question Type | GPT-4o | Llama-3.1-70B | Mistral-Large |
274-
|--------------|--------|--------------|--------------|
275-
| Extraction | 78% | 62% | 28% |
276-
| Filtering | 64% | 41% | 15% |
277-
| Aggregation | 58% | 38% | 11% |
278-
279-
**Position Bias in Responses:**
280-
281-
Testing answer position within 80K token responses (position 1 = beginning, position 8 = end):
282-
283-
| Model | Position 1 | Position 4 | Position 8 | Recency Bias |
284-
|-------|-----------|-----------|-----------|-------------|
285-
| GPT-4o | 61% | 64% | 65% | +6.6% |
286-
| Llama-3.1-70B | 35% | 41% | 43% | +22.9% |
287-
| Mistral-Large | 6% | 18% | 26% | +333% |
288-
289-
**Challenge B: Long Multi-Turn Conversations**
290-
291-
**Setup:**
292-
293-
- **Dataset**: 200 samples per token limit from ComplexFuncBench
294-
- **Conversation lengths**: 10K, 20K, 40K, 80K tokens
295-
- **Two scenarios**:
296-
- **Structured**: Information from earlier tool call (JSON) needed for later call
297-
- **Unstructured**: Information from earlier user message (text) needed for later call
298-
- **Models**: Llama-3.1-70B, Llama-3.1-8B, DeepSeek-R1-Distill-Qwen-32B, Granite-3.1-8B, ToolACE-8B, BitAgent-8B
299-
300-
**Results - Multi-Turn AST Accuracy:**
301-
302-
| Model | Structured 10K | Structured 80K | Degradation | Unstructured 10K | Unstructured 80K | Degradation |
303-
|-------|---------------|---------------|-------------|-----------------|-----------------|-------------|
304-
| Llama-3.1-70B | 38% | 19% | **-50%** | 96% | 99% | **+3%** |
305-
| DeepSeek-R1-32B | 23% | 2% | **-91%** | 40% | 10% | **-75%** |
306-
| Llama-3.1-8B | 22% | 10% | **-55%** | 10% | 24% | **+140%*** |
307-
| Granite-3.1-8B | 82% | 4% | **-95%** | 15% | 5% | **-67%** |
308-
| ToolACE-8B | 91% | 29% | **-68%** | 24% | 10% | **-58%** |
309-
| BitAgent-8B | 87% | 32% | **-63%** | 49% | 28% | **-43%** |
310-
311-
*Anomalous improvement likely due to model-specific behavior changes
312-
313-
**Key Insights:**
314-
315-
1. **Structured context is harder**: 50-95% degradation for JSON tool responses vs 43-75% for text
316-
2. **Model size helps for text**: Llama-3.1-70B maintains 99% accuracy on unstructured context at 80K tokens
317-
3. **Specialized models excel at structured**: ToolACE-8B and BitAgent-8B start at 87-91% for structured context (vs 22-38% for general models)
318-
4. **Catastrophic failures common**: Multiple models drop below 10% accuracy at 80K tokens
143+
| Tool Position | Granite-3.1-8B | Llama-3.1-70B | BitAgent-8B |
144+
|--------------|---------------|--------------|------------|
145+
| Beginning (10%) | 18% | 32% | 57% |
146+
| Early (30%) | 12% | 28% | 45% |
147+
| Middle (50%) | 8% | 22% | 24% |
148+
| Late (70%) | 14% | 29% | 41% |
149+
| End (90%) | 17% | 31% | 53% |
319150

320151
**Implications for vLLM Semantic Router:**
321152

@@ -326,36 +157,6 @@ These findings reinforce why semantic selection is critical:
326157
3. **Complementary to other optimizations**: Semantic selection works alongside response parsing, context compression, and conversation management
327158
4. **Enables longer conversations**: Saving 99.1% of context on tool definitions (127,315 → 1,084 tokens) allows significantly more room for conversation history or tool responses
328159

329-
### Cross-Cutting Analysis
330-
331-
**Accuracy vs. Tool Count (Research Data):**
332-
333-
```mermaid
334-
graph LR
335-
A[49 tools<br/>8K tokens] -->|94% accuracy| B[102 tools<br/>16K tokens]
336-
B -->|83% accuracy| C[207 tools<br/>32K tokens]
337-
C -->|64% accuracy| D[417 tools<br/>65K tokens]
338-
D -->|20% accuracy| E[741 tools<br/>120K tokens]
339-
340-
style A fill:#4CAF50
341-
style B fill:#8BC34A
342-
style C fill:#FF9800
343-
style D fill:#FF5722
344-
style E fill:#F44336
345-
```
346-
347-
**Insight:** Accuracy collapses non-linearly. The drop from 207 to 417 tools (64% → 20%) is catastrophic.
348-
349-
**Token Reduction by Approach:**
350-
351-
| Approach | Prompt Tokens | Reduction vs. Baseline |
352-
|----------|--------------|----------------------|
353-
| Load All Tools (741 tools) | 127,315 | Baseline |
354-
| Keyword Matching | 1,646 | 98.7% |
355-
| **Semantic Selection (top-3)** | **1,084** | **99.1%** |
356-
357-
**Insight:** Semantic selection achieves the best token reduction while maintaining highest accuracy (43.13% vs 18.20% for keyword matching).
358-
359160
## Benefits of Semantic Tool Selection
360161

361162
### 1. Restores Usability at Scale
@@ -381,7 +182,7 @@ Research shows that without semantic selection, tool-calling systems become **un
381182
- **Semantic Selection**: 1,084 tokens per request
382183
- **Reduction**: 99.1% (117x fewer tokens)
383184

384-
**Cost Impact (GPT-4o pricing at $2.50/$10 per 1M input/output tokens):**
185+
**Cost Impact (based on typical LLM pricing at $2.50/$10 per 1M input/output tokens):**
385186

386187
| Volume | Without Selection | With Selection | Annual Savings |
387188
|--------|------------------|---------------|----------------|
@@ -400,19 +201,6 @@ Research documents severe "lost in the middle" effects. Semantic selection elimi
400201

401202
**With Semantic Selection**: 94% accuracy regardless of original position
402203

403-
### 4. Enables Open-Source Models
404-
405-
Without semantic selection, only GPT-4o remains usable with large tool catalogs:
406-
407-
| Model | 741 Tools (No Selection) | With Selection | Viability |
408-
|-------|------------------------|---------------|-----------|
409-
| GPT-4o | 76% | 94% | ✅ Usable both ways |
410-
| Llama-3.1-70B | 20% | 94% | ❌ → ✅ Enabled |
411-
| Mistral-Large | 0% | 94% | ❌ → ✅ Enabled |
412-
| Granite-3.1-8B | 7% | 94% | ❌ → ✅ Enabled |
413-
414-
**Impact**: Organizations can use cost-effective open-source models instead of expensive API calls.
415-
416204
### 5. Scalability Beyond Current Limits
417205

418206
The MCP ecosystem already has 4,400+ servers. Research shows:
@@ -560,14 +348,6 @@ For long conversations with many tool calls:
560348
- **Selective history**: Include only relevant past tool calls in context
561349
- **State management**: Track conversation state separately from full history
562350

563-
### Validation and Testing
564-
565-
As tool catalogs grow, validation becomes critical:
566-
567-
- **Automated testing**: Generate synthetic queries to test tool coverage
568-
- **Retrieval quality metrics**: Monitor precision/recall of tool selection
569-
- **A/B testing**: Compare different similarity thresholds and top-K values
570-
571351
## Conclusion
572352

573353
Anthropic's blog on code execution with MCP highlighted a fundamental challenge: **agents need efficient ways to discover and use tools at scale**. Their solution—progressive disclosure through code execution—is elegant and powerful.
@@ -588,9 +368,9 @@ The two approaches are not mutually exclusive—in fact, they work beautifully t
588368

589369
As AI agents become more capable and connect to more tools, intelligent tool management becomes critical. Whether through semantic selection, code execution, or a combination of both, the future of AI agents lies in **smart, context-aware tool discovery** that scales efficiently.
590370

591-
## Try It Yourself
371+
## Give it a Try
592372

593-
The vLLM Semantic Router is open source and production-ready:
373+
The vLLM Semantic Router is open source:
594374

595375
- **GitHub:** [github.com/vllm-project/semantic-router](https://github.com/vllm-project/semantic-router)
596376
- **Documentation:** [vllm-semantic-router.com](https://vllm-semantic-router.com)

0 commit comments

Comments
 (0)