Skip to content

Add blog post: Solving MCP Tool Overload#9695

Draft
digitarald wants to merge 1 commit intomainfrom
users/digitarald/blog-mcp-tool-overload-v2
Draft

Add blog post: Solving MCP Tool Overload#9695
digitarald wants to merge 1 commit intomainfrom
users/digitarald/blog-mcp-tool-overload-v2

Conversation

@digitarald
Copy link
Copy Markdown
Contributor

Engineering blog post on how VS Code handles MCP tool overload with progressive tool discovery.

  • Explains the core/deferred tool split (~30 always-available, 170+ deferred with defer_loading)
  • Client-side embedding search replacing Anthropic server-side regex/BM25
  • Two-layer embedding cache (pre-computed + local persistent LRU)
  • Prompt caching optimization via stable tool prefix
  • Covers Anthropic tool_reference and OpenAI tool_search integration
  • Co-authored by Bhavya U, Connor Peet, Harald Kirschner

TODOs before publish:

  • Replace ASCII art diagrams with designed visuals
  • Create tool-discovery-hero.png social image
  • Verify MCP spec URL anchor

Copy link
Copy Markdown
Collaborator

@ntrogh ntrogh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@digitarald Nice blog post! Left a few suggestions.


April 24, 2026 by [Bhavya U](https://github.com/bhavyaus), [Connor Peet](https://github.com/connor4312), [Harald Kirschner](https://github.com/digitarald)

We just got back from [MCP Dev Summit](https://events.linuxfoundation.org/mcp-dev-summit-north-america/) in New York City, where we presented on [MCP Apps support in VS Code](https://code.visualstudio.com/blogs/2026/01/26/mcp-apps-support). One theme came up in nearly every session: too many tools. As the MCP ecosystem grows, servers and extensions each bring their own capabilities, and it adds up fast. Install a few [MCP servers](https://code.visualstudio.com/docs/copilot/customization/mcp-servers), enable some extensions, and your agent session suddenly has access to over 200 tools. Including all of their schemas in every request adds over 12,000 tokens on top of the base prompt, pushing a single turn past 18,000 tokens before you've even said "hello."
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We just got back from [MCP Dev Summit](https://events.linuxfoundation.org/mcp-dev-summit-north-america/) in New York City, where we presented on [MCP Apps support in VS Code](https://code.visualstudio.com/blogs/2026/01/26/mcp-apps-support). One theme came up in nearly every session: too many tools. As the MCP ecosystem grows, servers and extensions each bring their own capabilities, and it adds up fast. Install a few [MCP servers](https://code.visualstudio.com/docs/copilot/customization/mcp-servers), enable some extensions, and your agent session suddenly has access to over 200 tools. Including all of their schemas in every request adds over 12,000 tokens on top of the base prompt, pushing a single turn past 18,000 tokens before you've even said "hello."
We just got back from [MCP Dev Summit](https://events.linuxfoundation.org/mcp-dev-summit-north-america/) in New York City, where we presented on [MCP Apps support in VS Code](https://code.visualstudio.com/blogs/2026/01/26/mcp-apps-support). One theme that consistently came up was that **there are too many tools**. As the MCP ecosystem grows, servers and extensions each bring their own capabilities, which adds up quickly. Install a few [MCP servers](https://code.visualstudio.com/docs/copilot/customization/mcp-servers), enable some extensions, and your agent session suddenly has access to over 200 tools. Including all of their schemas in every request adds over 12,000 tokens on top of the base prompt, pushing a single turn past 18,000 tokens before you've even said "hello."


We just got back from [MCP Dev Summit](https://events.linuxfoundation.org/mcp-dev-summit-north-america/) in New York City, where we presented on [MCP Apps support in VS Code](https://code.visualstudio.com/blogs/2026/01/26/mcp-apps-support). One theme came up in nearly every session: too many tools. As the MCP ecosystem grows, servers and extensions each bring their own capabilities, and it adds up fast. Install a few [MCP servers](https://code.visualstudio.com/docs/copilot/customization/mcp-servers), enable some extensions, and your agent session suddenly has access to over 200 tools. Including all of their schemas in every request adds over 12,000 tokens on top of the base prompt, pushing a single turn past 18,000 tokens before you've even said "hello."

That creates real problems. The model gets slower. It picks the wrong tool more often. And because the tool list changes between sessions (different MCP servers, different extensions), [prompt caching](https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/prompt-caching), one of the biggest performance wins in production LLM systems, becomes unreliable.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
That creates real problems. The model gets slower. It picks the wrong tool more often. And because the tool list changes between sessions (different MCP servers, different extensions), [prompt caching](https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/prompt-caching), one of the biggest performance wins in production LLM systems, becomes unreliable.
That creates real problems: the **model gets slower and it picks the wrong tool more often**. And because the tool list changes between sessions (different MCP servers, different extensions), [prompt caching](https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/prompt-caching), one of the biggest performance wins in production LLM systems, becomes unreliable.


That creates real problems. The model gets slower. It picks the wrong tool more often. And because the tool list changes between sessions (different MCP servers, different extensions), [prompt caching](https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/prompt-caching), one of the biggest performance wins in production LLM systems, becomes unreliable.

We believe strongly that **the client should handle this**, not the user. Developers shouldn't have to think about which tools to enable, worry about token budgets, or pay penalties in performance and cost just because their ecosystem is rich. The hard parts of agentic optimization, including how tools are provided to the model, should be invisible. This post walks through how we built that in VS Code: progressive tool discovery.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We believe strongly that **the client should handle this**, not the user. Developers shouldn't have to think about which tools to enable, worry about token budgets, or pay penalties in performance and cost just because their ecosystem is rich. The hard parts of agentic optimization, including how tools are provided to the model, should be invisible. This post walks through how we built that in VS Code: progressive tool discovery.
We strongly believe that **the client should handle tool management**, not the user. Developers shouldn't have to think about which tools to enable, worry about token budgets, or pay penalties in performance and cost just because their ecosystem is rich. The hard parts of agentic optimization, including how tools are provided to the model, should be invisible. This post walks through how we built that into VS Code: **progressive tool discovery**.


## A core toolkit and a search index

When we looked at telemetry, a clear pattern emerged. On any given turn, the agent uses two to five tools. The same ~30 tools cover nearly 88% of all invocations: reading files, editing code, searching the codebase, running terminal commands. The remaining 170+ tools are specialized: open a browser page, run a notebook cell, create a GitHub pull request, drag an element on a page. Important, but not on every turn.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When we looked at telemetry, a clear pattern emerged. On any given turn, the agent uses two to five tools. The same ~30 tools cover nearly 88% of all invocations: reading files, editing code, searching the codebase, running terminal commands. The remaining 170+ tools are specialized: open a browser page, run a notebook cell, create a GitHub pull request, drag an element on a page. Important, but not on every turn.
When we looked at telemetry, a clear pattern emerged. On any given turn, the agent typically uses two to five tools. And the same ~30 tools cover nearly 88% of all invocations: reading files, editing code, searching the codebase, running terminal commands. The remaining 170+ tools are only used for specialized scenarios: open a browser page, run a notebook cell, create a GitHub pull request, drag an element on a page. These tools are important, but not on every turn.


When we looked at telemetry, a clear pattern emerged. On any given turn, the agent uses two to five tools. The same ~30 tools cover nearly 88% of all invocations: reading files, editing code, searching the codebase, running terminal commands. The remaining 170+ tools are specialized: open a browser page, run a notebook cell, create a GitHub pull request, drag an element on a page. Important, but not on every turn.

This suggested a natural split. Keep a **core set** of tools always available, and make everything else **discoverable on demand**.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This suggested a natural split. Keep a **core set** of tools always available, and make everything else **discoverable on demand**.
The pattern pointed to a natural split. Keep a **core set** of tools always available, and make everything else **discoverable on demand**.


Here's where our approach diverges from what you might expect. Anthropic's API offers a built-in server-side tool search. You can include `tool_search_tool_regex` or `tool_search_tool_bm25` in your tools array, and the server handles search automatically using regex pattern matching or BM25 text search.

We actually shipped that first. In late 2025, we rolled out Anthropic's server-side search behind a feature flag, tested it via A/B experiment, and enabled it for Opus 4.5. It worked, but we ran into three issues:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We actually shipped that first. In late 2025, we rolled out Anthropic's server-side search behind a feature flag, tested it via A/B experiment, and enabled it for Opus 4.5. It worked, but we ran into three issues:
Initially, we used this approach. In late 2025, we rolled out Anthropic's server-side search behind a feature flag, tested it via A/B experiment, and enabled it for Opus 4.5. It worked, but we ran into three issues:

2. **Dynamic tool discovery.** MCP servers support [dynamic tool discovery](https://modelcontextprotocol.io/specification/2025-06-18/server/tools#tool-discovery), where the set of available tools can change mid-session. Server-side search has no awareness of these changes, but a client-side implementation can re-index tools as they appear.
3. **Enterprise compliance.** Anthropic's server-side tool search was not Zero Data Retention compliant, a blocker for enterprise customers.

So we pivoted. We replaced server-side search with a **client-side implementation** running entirely within VS Code, using **embedding-based semantic similarity**. When the model calls `tool_search`, VS Code computes an embedding of the query, compares it against pre-computed embeddings of every tool's name and description, and returns the closest matches by cosine similarity.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
So we pivoted. We replaced server-side search with a **client-side implementation** running entirely within VS Code, using **embedding-based semantic similarity**. When the model calls `tool_search`, VS Code computes an embedding of the query, compares it against pre-computed embeddings of every tool's name and description, and returns the closest matches by cosine similarity.
So we pivoted. We replaced server-side search with a **client-side implementation** that runs entirely within VS Code and that uses **embedding-based semantic similarity**. When the model calls `tool_search`, VS Code computes an embedding of the query, compares it against pre-computed embeddings of every tool's name and description, and returns the closest matches by cosine similarity.


Embeddings beat an LLM at its own game, with a 24 percentage point improvement over defaults. And the client-side implementation needed fewer search invocations than the server-side regex approach to find the right tools in our evals.

The key insight: Anthropic's API doesn't care *who* does the search. It supports a "custom tool search" pattern where the client implements its own search tool and returns `tool_reference` content blocks, the same format the server-side search would produce. The API handles the rest, expanding those references into full tool schemas for the model. We get the best of both worlds: Anthropic's native deferred-tool infrastructure with our own search quality.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The key insight: Anthropic's API doesn't care *who* does the search. It supports a "custom tool search" pattern where the client implements its own search tool and returns `tool_reference` content blocks, the same format the server-side search would produce. The API handles the rest, expanding those references into full tool schemas for the model. We get the best of both worlds: Anthropic's native deferred-tool infrastructure with our own search quality.
The key insight was that Anthropic's API doesn't care *who* does the search. It supports a "custom tool search" pattern where the client implements its own search tool and returns `tool_reference` content blocks, the same format the server-side search would produce. The API handles the rest, expanding those references into full tool schemas for the model. As a result, we get the best of both worlds: Anthropic's native deferred-tool infrastructure with our own search quality.


## Teaching the model the protocol

Progressive discovery only works if the model knows the rules. We invest significant prompt engineering in making this seamless:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Progressive discovery only works if the model knows the rules. We invest significant prompt engineering in making this seamless:
Progressive discovery only works if the model knows the rules. We invest significantly in prompt engineering to make this seamless for you:

- We explicitly warn against anti-patterns: don't call a deferred tool without searching first, don't re-search for tools already found.
- A reminder at the end of the prompt reinforces the protocol. Models follow instructions well, but critical behaviors deserve repetition.

We also learned the hard way that some tools can't be deferred. Early on, we deferred `view_image`. The model saw the name in a flat list but never its description, and it simply didn't think to search for an image-viewing tool when encountering `.png` files. Benchmark failures led us to a simple rule: tools that represent core capabilities the model might not think to look for need to stay always-available. Telemetry, not guesswork, drives which tools make the cut.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We also learned the hard way that some tools can't be deferred. Early on, we deferred `view_image`. The model saw the name in a flat list but never its description, and it simply didn't think to search for an image-viewing tool when encountering `.png` files. Benchmark failures led us to a simple rule: tools that represent core capabilities the model might not think to look for need to stay always-available. Telemetry, not guesswork, drives which tools make the cut.
We also learned the hard way that some tools can't be deferred. Early on, we deferred `view_image` and the model only saw the name in a flat list but never its description. So the model simply didn't think to search for an image-viewing tool when it encountered `.png` files. Benchmark failures led us to a simple rule: tools that represent core capabilities and that the model might not think to look for, need to stay always-available. Telemetry, not guesswork, drives which tools make the cut.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants