You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -46,9 +46,9 @@ Recent academic studies have measured the impact of large tool catalogs on LLM p
46
46
47
47
- With ~50 tools (8K tokens): Most models maintain 84-95% accuracy
48
48
- With ~200 tools (32K tokens): Accuracy ranges from 41-83% depending on model
49
-
- With ~740 tools (120K tokens): Accuracy drops to 0-76%, with only GPT-4o maintaining >70%
49
+
- With ~740 tools (120K tokens): Accuracy drops to 0-20% for most models
50
50
51
-
Different models show varying degrees of degradation. GPT-4o experiences a 19% accuracy drop, while some open-source models show 79-100% degradation when scaling from small to large tool catalogs.
51
+
Different models show varying degrees of degradation, with open-source models showing 79-100% degradation when scaling from small to large tool catalogs.
52
52
53
53
**The "Lost in the Middle" Effect:** Research has documented position bias where tools in the middle of long lists are less likely to be selected correctly. For example, with 741 tools, middle positions (40-60%) showed 22-52% accuracy compared to 31-32% at the beginning/end positions for some models.
54
54
@@ -111,22 +111,21 @@ Recent academic research has quantified the severity of this problem. Studies sh
111
111
112
112
One study testing tool selection with increasing catalog sizes found that baseline accuracy dropped from 78% with 10 tools to just 13.62% with 100+ tools - a catastrophic 82% degradation. This "needle in a haystack" problem for tool selection motivated our semantic approach.
113
113
114
-
### Experiment 1: Large Tool Catalog Stress Test
114
+
### Large Tool Catalog Stress Test
115
115
116
116
**Setup:**
117
117
118
118
Based on the Berkeley Function Calling Leaderboard (BFCL) dataset, we tested tool selection performance as catalog size grows:
119
119
120
120
-**Dataset**: 858 function calling samples (simple, live_simple, multiple subsets)
121
121
-**Tool catalog sizes**: Varied from 49 tools (8K tokens) to 741 tools (120K tokens)
@@ -326,36 +157,6 @@ These findings reinforce why semantic selection is critical:
326
157
3.**Complementary to other optimizations**: Semantic selection works alongside response parsing, context compression, and conversation management
327
158
4.**Enables longer conversations**: Saving 99.1% of context on tool definitions (127,315 → 1,084 tokens) allows significantly more room for conversation history or tool responses
**Impact**: Organizations can use cost-effective open-source models instead of expensive API calls.
415
-
416
204
### 5. Scalability Beyond Current Limits
417
205
418
206
The MCP ecosystem already has 4,400+ servers. Research shows:
@@ -560,14 +348,6 @@ For long conversations with many tool calls:
560
348
-**Selective history**: Include only relevant past tool calls in context
561
349
-**State management**: Track conversation state separately from full history
562
350
563
-
### Validation and Testing
564
-
565
-
As tool catalogs grow, validation becomes critical:
566
-
567
-
-**Automated testing**: Generate synthetic queries to test tool coverage
568
-
-**Retrieval quality metrics**: Monitor precision/recall of tool selection
569
-
-**A/B testing**: Compare different similarity thresholds and top-K values
570
-
571
351
## Conclusion
572
352
573
353
Anthropic's blog on code execution with MCP highlighted a fundamental challenge: **agents need efficient ways to discover and use tools at scale**. Their solution—progressive disclosure through code execution—is elegant and powerful.
@@ -588,9 +368,9 @@ The two approaches are not mutually exclusive—in fact, they work beautifully t
588
368
589
369
As AI agents become more capable and connect to more tools, intelligent tool management becomes critical. Whether through semantic selection, code execution, or a combination of both, the future of AI agents lies in **smart, context-aware tool discovery** that scales efficiently.
590
370
591
-
## Try It Yourself
371
+
## Give it a Try
592
372
593
-
The vLLM Semantic Router is open source and production-ready:
0 commit comments