RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation

论文2个月前更新 power

1,015 0 0

RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation

RAG-MCP: Mitigating Prompt Bloat in LLM

Tool Selection via Retrieval-Augmented

Generation

Tiantian Gan1,2 and Qiyao Sun1,2

1 Beijing University of Post and Communications, Beijing, China

2 Queen Mary University of London, London, UK

jp2022213034@qmul.ac.uk, jp2022213402@qmul.ac.uk

Abstract. Large language models (LLMs) struggle to effectively utilize

a growing number of external tools, such as those defined by the Model

Context Protocol (MCP)[1], due to prompt bloat and selection complex-

ity. We introduce RAG-MCP, a Retrieval-Augmented Generation frame-

work that overcomes this challenge by offloading tool discovery. RAG-

MCP uses semantic retrieval to identify the most relevant MCP(s) for a

given query from an external index before engaging the LLM. Only the

selected tool descriptions are passed to the model, drastically reducing

prompt size and simplifying decision-making. Experiments, including an

MCP stress test, demonstrate RAG-MCP significantly cuts prompt to-

kens (e.g., by over 50%) and more than triples tool selection accuracy

(43.13% vs 13.62% baseline) on benchmark tasks. RAG-MCP enables

scalable and accurate tool integration for LLMs.

Keywords: Retrieval-Augmented Generation· Model Context Protocol

· Tool Selection

1 Introduction

1.1 Background and Motivation

Large Language Models (LLMs) have demonstrated remarkable capabilities in

natural dialogue, reasoning, and even code generation. However, they remain

fundamentally constrained by the knowledge encoded in their parameters and

the fixed context window available at inference time. In essence, an LLM with-

out external access is “trapped” with only its training data and cannot easily

update its knowledge or perform actions in the world [12]. To address this lim-

itation, recent efforts have focused on augmenting LLMs with external tools

andfunction-callingabilities[3].Byinvokingtools(e.g.websearch,databases,

calculators) via defined functions or APIs, an LLM can fetch up-to-date infor-

mation and execute complex operations beyond its built-in repertoire [12]. This

paradigm – often referred to as zero-shot tool use or function calling — allows

AI assistants to interface with the latest data and services, unlocking applica-

tions from real-time knowledge queries to specialized tasks in finance and travel2 Gan and Sun

planning [3]. In fact, major AI providers have embraced this trend: for exam-

ple, leading LLM platforms now support plugin APIs and structured function

calls so that models like GPT-4 or Claude can invoke external services through

well-defined interfaces [12].

In the research community, a variety of approaches have been proposed to

enableandimproveLLMtooluse.Prompt-basedstrategiessuchasReActinter-

mix reasoning steps with action commands, allowing an LLM to decide when to

consult a tool in the context of a multi-turn “thought process” [15]. Model-centric

approaches have also emerged: for instance, Toolformer fine-tunes an LLM to

autonomously decide which API to call, when to call it, and how to incorporate

the result, given only a handful of demonstrations per tool [13] Other researchers

have improved tool-use by incorporating it into training data and model tuning.

This includes blending function call demonstrations into instruction-following

datasets and exploring prompt formats that effectively describe available func-

tions to the model [3]. Such efforts have markedly enhanced zero-shot tool usage

performance. For example, fine-tuning a model on API call tasks with extensive

tool-use data can yield impressive results – the Gorilla system augmented a

7B LLaMA-based model with relevant API documentation retrieval, enabling it

to outperform even GPT-4 in generating correct API calls for a wide range of

tools [12]. An important insight from these works is that providing just-in-time

relevant context (be it through optimized prompts or retrieved documentation)

greatly boosts the accuracy of an LLM’s tool selection and use, while mecha-

nisms for the model to explicitly decide on tool use (such as special decision

tokens for “answer vs. act”) can further improve reliability [3].

Despite this progress, a new challenge arises as we scale up the number of

tools available to an LLM. Most prior studies and deployments consider a rel-

atively small set of tools or APIs, often hand-picked and easy for the model

to handle within a prompt [12]. In practice, however, the ecosystem of tools is

rapidly expanding. For instance, Anthropic’s recently introduced Model Con-

text Protocol (MCP) defines a universal, open standard for connecting AI

systems with external data sources and services. MCP enables a single assistant

tointerfacewithmanydatarepositoriesandbusinesstoolsthroughaunifiedpro-

tocol, replacing fragmented one-off integrations. As a result, an advanced LLM

agent could soon have dozens of functions at its disposal – from Google Drive

and Slack connectors to GitHub, databases, maps, and more – all registered as

MCP “tools” it can call [1]. This proliferation of available tools brings significant

hurdles.

Prompt Bloat is one critical issue: providing the definitions or usage in-

structions for every possible tool in the model’s context would consume an enor-

mous number of tokens and risk overwhelming the model. It has been observed

that it is effectively impossible to describe a large collection of APIs or tools

in a single prompt as their number grows, and many APIs have overlapping

functionalities with only nuanced differences. Including too many at once not

only exhausts the context length, but can also confuse the model – the func-

tions may start to blur together. This leads directly to a second issue: decisionRAG-MCP 3

overhead. With a long list of tools (many of them similar in scope), the model

faces a more complex decision when choosing if and which tool to invoke. The

greater the choice, the higher the chance of error, such as selecting an suboptimal

tool or misinterpreting what a tool does. Indeed, even state-of-the-art models

can misfire in such settings: for example, in a scenario with numerous API op-

tions, GPT-4 was reported to hallucinate an API that doesn’t actually exist, and

Anthropic’s Claude picked the wrong library for the user’s request [12]. These

failure cases underscore that naively scaling up the toolset can degrade

an LLM’s performance, due to both the capacity strain on the prompt and

the ambiguity in the model’s decision process.

Fig. 1. Comparation between MCP and RAG-MCP during inference

To tackle these challenges, we propose RAG-MCP, a solution that mar-

ries Retrieval-Augmented Generation (RAG) with the Model Context Protocol

framework. The key idea of RAG-MCP is to avoid presenting all tools to the lan-

guage model at once, and instead dynamically retrieve a relevant subset of tools

based on the user’s query. In our approach, the numerous available tool descrip-

tions (MCP function schemas, usage examples, etc.) are stored in an external

memory indexed by their semantics. When a new query arrives, a dedicated re-

triever(e.g.avector-spacesemanticsearch)firstselectsthetop-k candidatetools

that are most likely to be useful for that query. Only these k tool descriptions are

then injected into the LLM’s prompt (or provided via the function-calling API),

greatly reducing context length and complexity. This retrieval step serves as a

form of focused context filtering, which cuts down on prompt bloat and guides

the model’s choice. The approach is analogous to how retrieval-augmented QA

systems work: rather than feed the entire Wikipedia to the model, one retrieves

only the relevant articles [6]. Here, instead of static knowledge, we retrieve ac-

tionable toolknowledgeon the fly. An added benefit is extensibility – because

the tool information lives in an external index, new tools or updated APIs can4 Gan and Sun

be incorporated by updating that index without retraining the LLM, ensuring

the system remains up-to-date [12]. In short, retrieval helps tame the grow-

ing toolset by providing the right tools at the right time, thereby reducing the

model’s decision burden.

1.2 Contributions

In summary, this paper makes the following contributions:

1. RAG-MCPFramework:Weintroduceanovelarchitecturethatintegrates

a retrieval mechanism with LLM function calling in the MCP setting. To our

knowledge, this is one of the first frameworks to enable an LLM to handle a

largearsenaloftoolsbyqueryingatoolrepositoryforrelevantoptionsinstead

of naively prompting with all tools. This design retains the flexibility of the

open MCP ecosystem while imposing structure to maintain tractability.

2. Scalable Tool Retrieval: We develop a semantic tool retrieval module that

represents each available tool’s description in a vector space and efficiently

matches user queries to the most pertinent tools. This significantly reduces

prompt size and complexity (mitigating prompt bloat) and improves decision

makingbynarrowingthechoices.TheLLM,guidedbythisretrievedcontext,

can more accurately select and use the correct external tool, even as the total

number of tools grows large. Notably, our approach allows new tools to be

added on the fly by indexing them, without requiring additional fine-tuning

of the LLM.

3. Improved Tool-Use Performance: Through comprehensive experiments,

wedemonstratethatRAG-MCPeffectivelyaddressestheperformancedegra-

dation that occurs with naively scaling up the tool set. On a suite of tool-

augmented NLP tasks, we show that as the number of available functions

increases, a baseline LLM’s success rate in selecting and executing the cor-

rect tool drops markedly (illustrating the aforementioned challenge). How-

ever, under the RAG-MCP strategy, the model’s performance is largely re-

stored to its original level, and in some cases even exceeds the small-toolset

baseline. In particular, RAG-MCP yields substantially higher accuracy in

choosing the appropriate tool and reduces errors such as hallucinated or

mis-parameterized function calls. These results underscore the efficacy of us-

ing retrieval to scale up tool-use: the proposed method enables an LLM to

maintain high tool-selection accuracy and reliability even with a large pool

of tools, paving the way for more scalable and capable tool-augmented AI

systems.

Overall, our work demonstrates that the integration of retrieval-based con-

text management is a promising direction to counteract the challenges of tool

proliferation in LLMs. By enabling models to learn which tool to use out of many

and only providing information for those tools, RAG-MCP offers a practical

solution for the next generation of AI agents operating with extensive toolkits.

It combines the strengths of retrieval augmentation and standardized tool APIs

to ensure that more tools do not mean worse performance but rather a broader

range of skills that the model can deploy accurately and efficiently.RAG-MCP 5

2 Related Work

2.1 Tool Use in LLMs

LLMs have been augmented with external tools to overcome limitations in arith-

metic, retrieval, and code execution. Toolformerdemonstrates a self-supervised

method by which a model learns when and how to call APIs such as calculators

or search engines, improving zero-shot performance across tasks [13]. ReAct in-

terleaves chain-of-thought reasoning with action steps to interact with external

environments (e.g., a Wikipedia API), yielding more interpretable and accurate

multi-stepsolutions[15].WebGPTfine-tunesGPT-3inasimulatedbrowseren-

vironment, training it to navigate, search, and cite sources for long-form Q&A,

reducing hallucinations via grounded retrieval [9]. More recently, ChatGPT

Pluginsintroduced aproduction pluginecosystem, allowing ChatGPTto access

up-to-date information and third-party services in a controlled, safety-oriented

framework [11].

2.2 Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) first combined parametric LLMs with

non-parametric memory in a dense vector index, retrieving relevant passages at

inference time to improve knowledge-intensive tasks [6]. Subsequent work has

extended RAG to broad NLP paradigms, including modular and advanced RAG

variants that dynamically adapt retrieval per token or per query [4]. RAG’s

decoupling of memory access and generation inspires our MCP-RAG approach,

wherein MCP discovery is treated as a retrieval subproblem, orthogonal to core

text generation.

2.3 Model Context Protocol

The Model Context Protocol standardizes LLM-to-API interactions by bundling

resource prompts, authentication, and parameter schemas into modular “MCP”

servers.MCPsactasfunction-callextensions,similartoOpenAI’sfunction-calling

API, but with greater community extensibility. The rapid growth of MCP repos-

itories (4,400+ servers on mcp.so as of April 2025 [14]) underscores the need for

scalable discovery and validation mechanisms .

3 Methodology

Overview. We study how the number of available MCP servers affects an LLM’s

abilitytoselectandinvokethecorrecttool(“promptbloat”)andpresentMCP-RAG,

a retrieval-augmented framework that mitigates this degradation by dynamically

retrieving only the most relevant MCP for each query.6 Gan and Sun

3.1 Prompt Bloat and the MCP Stress Test

Modern LLMs must often choose among many possible external tools, each de-

scribed by an MCP schema. As the count of MCPs grows, including all their

descriptions in a single prompt leads to prompt bloat: the context window be-

comes saturated with distractors, reducing the model’s capacity to distinguish

and recall the correct tool.

This phenomenon parallels the Needle-in-a-Haystack (NIAH) test, which em-

beds a random fact (the “needle”) in the middle of a long context (the “haystack”)

and measures an LLM’s ability to retrieve it under varying context lengths and

depths [6] [10] . In NIAH, performance drops sharply as the haystack grows,

revealing limits of in-context retrieval.

Inspired by NIAH, we design an MCP stress test on WebSearch tasks:

for each trial, we present the model with N MCP schemas (one ground-truth

and N− 1 distractors) and ask it to select and invoke the correct WebSearch

MCP. We vary N from 1 to 11100 in 26 intervals, measuring selection accu-

racy, task success, prompt token usage, and latency. This setup quantifies how

tool-selection ability degrades with increasing MCP pool size.

3.2 RAG-MCP Framework

To overcome prompt bloat, RAG-MCP applies Retrieval-Augmented Gen-

eration (RAG) principles to tool selection. Instead of flooding the LLM with

all MCP descriptions, we maintain an external vector index of all available MCP

metadata. At query time:

1. Retrieval. A lightweight LLM-based retriever (e.g., Qwen) encodes the

user’s task description and performs a semantic search over the MCP in-

dex, returning the top-k candidate MCPs most similar to the task [6].

2. Validation. For each retrieved MCP, RAG-MCP can generate a few-shot

examplequeryandtestitsresponsetoensurebasiccompatibility,functioning

as a “sanity check” before invocation.

3. Invocation. Only the single best MCP description, including its tool-use

parameters, is injected into the LLM prompt or function-calling API, which

then performs planning and execution without concern for tool discovery [2].

This design yields several benefits:

– Reduced Prompt Size.By supplying only relevant MCP metadata, RAG-

MCP avoids context window overload even when the full tool registry is

large.

– LowerCognitiveLoad.TheLLMnolongerneedstosiftthroughhundreds

of distractors, improving selection accuracy and reducing hallucinations [2].

– Resource Efficiency. Unlike conventional MCP clients (e.g., Claude or

early GPT-4 integrations) that must instantiate all registered MCP servers

before interaction, MCP-RAG activates only the selected MCP, lowering

startup cost and enabling support for arbitrarily large toolsets without in-

frastructure bottlenecks [10].RAG-MCP 7

– Multi-Turn Robustness. In dialogues spanning many turns, the LLM

need not re-include all MCP prompts; RAG-MCP’s retriever handles tool

recall dynamically, freeing context space for task-specific reasoning.

3.3 Three-Step Pipeline Diagram

WesummarizeRAG-MCP’soperationinthreecoresteps.Theflowchartisshown

in Fig. 3:

1. Task Input → Retriever: The user’s natural-language task is encoded

and submitted to the retriever.

2. Retriever → MCP Selection & Validation: The retriever searches the

vector index of MCP schemas, ranks candidates by semantic similarity, and

optionally tests each via synthetic examples.

3. LLM Execution with Selected MCP: The LLM receives only the se-

lected MCP’s schema and parameters and executes the task via the function-

calling interface.

Fig. 2. RAG-MCP pipeline: (1) encode user query with Qwen-max, (2) re-trieve &

validate top-k MCPs, and (3) invoke chosen MCP

By decoupling tool discovery from generation, RAG-MCP ensures that LLMs

can scale to hundreds or thousands of MCPs without suffering prompt bloat or

decision fatigue, much as RAG systems avoid overwhelming an LLM with entire

corpora by retrieving only relevant passages.

3.4 Discussion

Ourmethodologycombinestherigorof stresstesting(viatheMCPstresstest)

with the effectiveness of retrieval-augmented tool use. The stress test quantifies

thesharpperformancedropthatoccurswhendistractorMCPs swelltheprompt,

mirroring long-context recall failures in NIAH evaluations [5]. RAG-MCP then8 Gan and Sun

counteracts this by dynamically narrowing the toolset, reducing both prompt to-

kens and decision complexity, and thereby restoring—and often improving—task

success rates.

Furthermore, by using an external index, RAG-MCP remains extensible: new

MCPs can be added by indexing their metadata, without retraining the LLM.

And by selectively activating servers on demand, it sidesteps the practical limits

on simultaneous MCP instantiation faced by prior tool-augmented LLM deploy-

ments.

4 Experiments

4.1 Stress Test

Setup To quantify how an LLM’s tool-selection ability scales with the size of

the MCP pool, we conduct a stress test in which the number of candidate MCP

servers, N , is varied from 1 to 11100 in intervals, while the key MCP server

located from the top to the bottom. For each value of N , we randomly select

one “ground-truth” MCP (i.e., the only server capable of satisfying the task

requirement) and N− 1 distractor MCPs drawn from our full registry of over

4,400 publicly listed servers [14]. This design ensures that exactly one in every

N candidates is relevant. We then present the model with each of 20 web-search

tasks, requiring it to (a) choose the correct MCP, (b) issue a valid query or

answer, and (c) return the final result.

Fig. 3. This figure illustrates per-trial success across MCP positions from 1 to 11100,

where yellow denotes successful selection and purple denotes failure.RAG-MCP 9

Results Figure3 plots selection accuracy and task success as N increases. We

observe a clear non-monotonic trend: These results quantitatively confirm that

while MCP-RAG greatly mitigates prompt bloat and maintains high perfor-

mance in small to moderate pools, its retrieval precision and overall throughput

degrade as the tool registry scales to thousands of MCPs.“‘

4.2 RAG-MCP

Setup We evaluated all methods in the web search subset of MCPBench [8],

which we used as our heldout testbed. For each baseline, we perform 20 inde-

pendent trials, and we deem a baseline successful if it produces more than 10

correct answers out of those 20. Within each trial, the model may engage in up

to 10 rounds of interaction with the MCP servers in order to arrive at its final

response.

To assess answer correctness in an automated and reproducible manner, we

employ Deepseek-v3 [7] as our evaluator. Because MCP servers require exter-

nal network access—and can therefore be sensitive to latency or transient fail-

ures—we enforce a controlled network environment throughout all experiments,

ensuring no requests fail due to connectivity issues. Finally, all trials are driven

by qwen-max-0125 as our underlying base LLM.

Baselines We evaluate three selection strategies in our experiments:

1. Blank Conditioning: Prompt the LLM with all N MCP descriptions at

once and ask it to choose the correct one.

2. Actual Match: Pre-filter the candidate pool using simple keyword matching

on the task description and MCP metadata, then prompt the model on this

reduced set.

3. RAG-MCP: Employ our vector-index retriever to semantically rank all N

MCPs and inject only the top candidate’s schema into the LLM prompt for

execution.

Metrics We evaluate performance using three key metrics for each baseline

method:

– Accuracy(%):Percentageoftrialsinwhichthemodelselectedtheground-truth

MCP.

– Avg Prompt Tokens: Mean number of tokens consumed by the prompt,

including injected MCP metadata.

– Avg Completion Tokens: Mean number of tokens generated by the model

as its final output.

Judgment of the final answer is automated using a Llama-based verifier (“Llama

as Judge”) to compare model outputs against ground truth.10 Gan and Sun

Baseline Accuracy (%) Avg Prompt Tokens Avg Completion Tokens

MCP-RAG 43.13 1084.00 78.14

Actual Match 18.20 1646.00 23.60

Blank 13.62 2133.84 162.25

Table 1. Baseline performance comparison on accuracy and token usage

Results Table1summarizestheperformanceoftheevaluatedbaselinemethods,

clearly demonstrating the effectiveness of MCP-RAG:

As the table shows, MCP-RAG achieves the highest accuracy at 43.13%,

significantly outperforming the Actual Match and Blank Conditioning meth-

ods, which scored 18.20% and 13.62%, respectively. Furthermore, MCP-RAG

notably reduces the average number of prompt tokens to 1084, reflecting a

substantial reduction compared to the other baselines, especially Blank Condi-

tioning, which requires 2133.84 tokens. While MCP-RAG shows an increase in

completion tokens (78.14) compared to Actual Match (23.60), this trade-off is

beneficial as it correlates with a higher accuracy and overall task success rate.

5 Analysis

Stress Test Analysis Figure 3 illustrates per-trial success across MCP po-

sitions from 1 to 11100, where yellow denotes successful selection and purple

denotes failure. We observe that:

– High Early-Stage Success: MCP positions below 30 exhibit predomi-

nantlyyellowregions,indicatingsuccessratesabove90%whenthecandidate

pool is minimal.

– Mid-Range Variability: In the range of positions 31–70, clusters of purple

emerge intermittently, reflecting lower accuracy as semantic overlap among

MCP descriptions increases.

– Performance Degradation at Scale: Beyond position ~100, purple domi-

nates, signifying that retrieval precision diminishes when handling very large

tool registries.

– Residual Success Islands: Occasional yellow patches at higher positions

suggest that certain MCPs remain well-aligned to specific queries, providing

robustness even in extensive pools.

These patterns confirm that while MCP-RAG effectively curbs prompt bloat and

maintains high accuracy in small to moderate MCP pools, retrieval precision

challenges arise as the total number of MCPs grows, motivating future work on

hierarchical or adaptive retrieval mechanisms.

5.1 Analysis of RAG-MCP Results

The superior performance of RAG-MCP can be attributed to several factors:RAG-MCP 11

– Focused Context Filtering: By injecting only the single most relevant

MCP schema, the model avoids the distraction caused by irrelevant tool

descriptions, resulting in clearer decision boundaries.

– Prompt Efficiency: The dramatic reduction in prompt tokens allows the

model to allocate more of its context window to reasoning about the task

itself rather than parsing extraneous metadata.

– Balanced Generation: Although RAG-MCP slightly increases completion

token usage relative to Actual Match, this overhead reflects more thorough

reasoning and verification steps, which correlate with higher accuracy.

Overall, these findings confirm that retrieval-augmented selection of MCPs

effectively tames prompt bloat and enhances an LLM’s tool-selection reliability,

making RAG-MCP a compelling solution for scalable external tool integration.

6 Conclusion

We present RAG-MCP, a simple yet powerful framework that tames large

MCP toolsets by retrieving only the most relevant schema for each query. With

focused retrieval, RAG-MCP:

– Drastically reduces prompt size, cutting token usage by over half com-

pared to feeding all tools at once.

– Boosts selection accuracy, more than tripling the success rate of naïve

and keyword-based methods under heavy load.

– Maintains extensibility, since new MCPs can be indexed on–the–fly with-

out retraining the model.

In essence, RAG-MCP turns a sprawling library of hundreds or thousands of

tools into a lean, on-demand toolkit. Future work will refine retrieval at extreme

scale—via hierarchical indexes or adaptive strategies—and explore multi-tool

workflows and real-world agent deployments. RAG-MCP lays the “golden core”

for scalable, reliable LLM agents that wield vast external services with precision

and efficiency.

References

1. Anthropic: Introducing the model context protocol (2024), https://www.

anthropic.com/news/model-context-protocol

2. Blog, N.: What is retrieval-augmented generation aka rag (2025), https://blogs.

nvidia.com/blog/what-is-retrieval-augmented-generation/

3. Chen, Y.C., Hsu, P.C., Hsu, C.J., Shiu, D.s.: Enhancing function-calling capa-

bilities in llms: Strategies for prompt formats, data integration, and multilingual

translation. arXiv preprint arXiv:2412.01130 (2024)

4. Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, H.,

Wang, H.: Retrieval-augmented generation for large language models: A survey.

arXiv preprint arXiv:2312.10997 2 (2023)12 Gan and Sun

5. gkamradt: The needle in a haystack test (2024), https://github.com/gkamradt/

LLMTest_NeedleInAHaystack

6. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küt-

tler, H., Lewis, M., Yih, W.t., Rocktäschel, T., Riedel, S., Kiela, D.: Retrieval-

augmented generation for knowledge-intensive nlp tasks. In: Larochelle, H.,

Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural

Information Processing Systems. vol. 33, pp. 9459–9474. Curran Associates,

Inc. (2020), https://proceedings.neurips.cc/paper_files/paper/2020/file/

6b493230205f780e1bc26945df7481e5-Paper.pdf

7. Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang,

C., Ruan, C., et al.: Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437

(2024)

8. Luo, Z., Shi, X., Lin, X., Gao, J.: Evaluation report on mcp servers (2025), https:

//arxiv.org/abs/2504.11094

9. Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain,

S., Kosaraju, V., Saunders, W., Jiang, X., Cobbe, K., Eloundou, T., Krueger,

G., Button, K., Knight, M., Chess, B., Schulman, J.: Webgpt: Browser-assisted

question-answering with human feedback (2022), https://arxiv.org/abs/2112.

09332

10. OpenAI: Openai function calling, https://platform.openai.com/docs/guides/

function-calling

11. OpenAI: Chatgpt plugins (2023), https://openai.com/index/chatgpt-plugins

12. Patil, S.G., Zhang, T., Wang, X., Gonzalez, J.E.: Gorilla: Large language model

connected with massive apis. Advances in Neural Information Processing Systems

37, 126544–126565 (2024)

13. Schick, T., Dwivedi-Yu, J., Dessi, R., Raileanu, R., Lomeli, M., Hambro,

E., Zettlemoyer, L., Cancedda, N., Scialom, T.: Toolformer: Language mod-

els can teach themselves to use tools. In: Oh, A., Naumann, T., Glober-

son, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural In-

formation Processing Systems. vol. 36, pp. 68539–68551. Curran Associates,

Inc. (2023), https://proceedings.neurips.cc/paper_files/paper/2023/file/

d842425e4bf79ba039352da0f658a906-Paper-Conference.pdf

14. ShipAny: Mcp servers (2025), https://mcp.so/

15. Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: Re-

act: Synergizing reasoning and acting in language models. International Confer-

ence on Learning Representations (ICLR) (2023), https://par.nsf.gov/biblio/

10451467

7 Acknowledgements

WegratefullyacknowledgeZhilingLuo,XiaorongShi,XuanruiLin,andJinyangGao

for their seminal report “Evaluation Report on MCP Servers.” Their publicly re-

leased code framework and the accompanying WebSearch dataset formed the

foundation on which our own work was developed.

RAG-MCP：基于检索增强生成的大模型工具选择优化框架

面对工具生态系统扩张，大型语言模型的工具选择能力因提示词膨胀而受限。RAG-MCP通过检索增强生成技术实现动态工具选择，显著提升处理效率与准确率。本文深入剖析其技术原理与应用价值。

大型语言模型（LLMs）如GPT-4、Claude和Llama的发展标志着人工智能领域的重大突破。这些高级模型展现出卓越的文本生成、逻辑推理及编程能力。尽管技术先进，LLM仍然面临训练数据固化和上下文窗口有限的基础性约束。为使这些模型能够有效应对实时变化的外部环境，配备外部工具接口成为必要的技术路径。

考虑LLM执行旅行规划的场景：它需要访问航班数据库、检索酒店可用性信息并可能查询目的地天气预报。若缺乏相应工具（如各类服务API），即便模型具备出色的语言表达能力，也无法实现真正的功能性服务。而当模型能够调用这些工具时，其功能价值则得到质的提升。

此处，”函数调用”机制与模型上下文协议（Model Context Protocol, MCP）等标准化接口发挥关键作用。Anthropic提出的MCP旨在建立连接AI系统与外部数据源及服务的通用标准。它实质上是一种通用适配器，使LLM能够与Google Drive、Slack、GitHub、数据库等多种”工具”进行交互。

然而，随着可用工具生态系统的爆发式增长，一个新的技术挑战愈发凸显：LLM如何从庞大且持续扩展的工具库中高效准确地选择最适合的工具？研究人员在《RAG-MCP：通过检索增强生成缓解LLM工具选择中的提示词膨胀》一文中针对性地解决了这一问题。

工具选择困境：提示词膨胀与决策效率

构建一个需要访问数十种工具的AI助手时，常规方法是将所有工具描述（功能定义与参数需求）直接包含在提示中传递给LLM。

对于少量工具，这种方法运行效果良好。但当工具数量扩展至50、100甚至1000个时，系统面临以下挑战：

提示词膨胀： 工具描述累积消耗大量token，占用LLM有限的上下文窗口。这导致实际用户查询、对话历史或复杂推理所能利用的空间减少。这种情况类似于在一个拥挤的会议室中，数百人同时提供信息，而真正需要关注的核心对话被淹没。

决策复杂性与计算开销： 即便上下文窗口能够扩展，要求LLM对数百个工具描述进行筛选和评估（尤其是功能重叠且差异微妙的情况）仍会导致效率问题。模型可能选择次优工具，甚至错误识别不存在的工具或API。认知负担的增加显著降低整体性能。

这一问题可类比为：一位厨师（LLM）在一个巨大的厨具仓库（工具生态系统）中工作。若厨师每次需要特定工具时都必须听取仓库中所有物品的详细描述，工作效率将极为低下。实际需要的是一位知识渊博的助手，能够根据当前任务迅速定位并提供最适合的几种工具选项。

RAG-MCP框架正是为解决这一问题而设计。

RAG-MCP：基于语义检索的工具选择框架

RAG-MCP方法借鉴了知识密集型NLP任务中广泛应用的检索增强生成（Retrieval-Augmented Generation, RAG）技术原理。

传统RAG技术主要用于使LLM能够访问大型外部知识库（如维基百科）。与尝试将整个知识库加载至提示中不同，检索系统首先识别与用户查询最相关的文章或段落，然后仅将这些相关内容与查询一起提供给LLM，从而生成更精确的回应。

RAG-MCP应用相同的”检索后生成”原理，但检索目标从事实性知识转向了功能性工具描述。

其核心工作流程如下：

外部工具索引构建： 所有可用工具描述（MCP函数模式、使用示例等）被存储在外部索引系统中，通常采用向量数据库实现。每个工具描述转换为数值向量表示（嵌入），以捕捉其语义特征。
查询时检索处理： 当用户发出查询（例如，”为我预订下周二前往伦敦的航班”）时，专用检索器（可能是较小规模的LLM或语义搜索算法）首先分析查询意图，然后在工具索引中搜索语义相似度最高的前k个工具描述。
聚焦提示构建： 系统仅将筛选出的k个相关工具描述注入LLM提示中（或通过函数调用API提供）。这使LLM面对的选择空间显著缩小且更具针对性。
LLM执行决策： LLM在经过过滤的上下文环境中进行决策并调用选定工具。

标准MCP方法（左）与RAG-MCP（右）在推理过程中的对比图。RAG-MCP引入检索步骤在LLM交互前选择相关工具MCP，有效减轻信息过载问题。

该方法具备以下核心优势：

显著减少提示词规模： 通过仅包含少量相关工具，提示词膨胀问题得到明显缓解。研究数据显示实验中提示词token减少超过50%。
降低认知复杂度： LLM无需处理大量不相关选项，从而提高工具选择精确度，减少错误与幻觉可能性。
系统扩展性提升： 新工具可通过在外部数据库中简单索引其描述完成添加，无需对主要LLM进行重训练或大规模配置调整。
计算资源优化： 在某些系统架构中，工具注册可能意味着需要实例化或预初始化这些组件，消耗大量资源。RAG-MCP可选择性地仅激活或准备检索到的工具，有效降低系统开销。

RAG-MCP技术流程详解

RAG-MCP的三阶段处理流程：
RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation

RAG-MCP处理流程图。1. 用户任务编码；2. 检索器（如Qwen-max）搜索MCP模式向量索引并验证候选项；3. LLM仅接收经选择的MCP信息执行任务。

 1. **任务输入编码与检索准备：** 用户自然语言查询首先被编码（使用如实验中采用的Qwen-max等LLM），然后传递至检索模块。
 2. **MCP选择与验证：** 检索器对索引化的MCP模式执行语义搜索，并按相关度对候选工具进行排序。可选地，系统会执行验证步骤，为检索到的顶级MCP生成合成示例查询，在最终选择前评估其兼容性与响应能力。
 3. **LLM任务执行：** 最后，仅将最优（或前几个）MCP的模式和参数传递给主LLM，通过其函数调用接口完成任务执行。

工具发现过程与核心LLM生成功能的解耦是此框架的关键技术特点。这类似于专业图书管理员（检索器）在主研究员（LLM）利用资源解答复杂问题前，先定位出最相关的参考资料（工具）。

实验评估：MCP压力测试与性能对比

为量化问题严重性及解决方案有效性，研究团队设计了专门的”MCP压力测试”。

MCP压力测试设计： 受”大海捞针”（NIAH）测试启发（该测试评估LLM在长文本中定位特定信息的能力），MCP压力测试评估LLM在干扰工具数量增加情况下的工具选择能力。测试构造如下：

模型每次面对N个MCP模式：其中一个是WebSearch任务所需的”目标”工具，而N-1个为干扰工具。
实验中，N值从1变化至11,100，构成极端测试条件。
研究者测量选择准确率、任务成功率、token使用量及处理延迟。

MCP压力测试中RAG-MCP的成功率热图。黄色表示成功选择工具，紫色表示选择失败。尽管性能随MCP数量（可选工具数）急剧增加而下降，但RAG-MCP在小到中等规模工具池中保持高成功率，且与基准方法相比，即使在大量干扰项存在时也展现出更强的鲁棒性。

热图清晰展示了随MCP数量（x轴上的”MCP Number”）增加，成功率（黄色区域代表成功）呈现下降趋势，特别是当目标MCP位于长列表后部时。然而，数据同时证明即使在工具数量庞大的情况下，RAG-MCP仍能取得成功，尤其是当相关工具具有显著特征或查询非常具体时。该测试定量验证了简单增加LLM可用工具数量的方法难以持续。

基准性能评估： 除压力测试外，研究者还在MCPBench的网络搜索子集上对RAG-MCP与两种基准方法进行了对比：

空白条件（朴素方法）： LLM一次性接收所有N个MCP描述并进行选择。
关键词匹配： 基于任务描述与MCP元数据的关键词匹配进行简单预过滤。

所有测试均使用

qwen-max-0125

作为基础LLM。

实验结果明确体现了RAG-MCP的优势：
RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation

基准性能比较表。RAG-MCP在显著减少提示词token的同时实现最高准确率，相较于空白条件有明显提升。

如表格数据所示：

准确率指标： RAG-MCP达到43.13%的工具选择准确率。虽然这一绝对值看似不高，但与关键词匹配（18.20%）和空白条件（13.62%）相比，是显著的提升。事实上，RAG-MCP的准确率比朴素基线高出三倍多。
平均提示词Token消耗： RAG-MCP将提示词token平均减少至1084.00。相比空白条件的2133.84token——接近减少了一半。这有效释放了宝贵的上下文窗口空间。
平均完成Token量： RAG-MCP的完成token使用量（78.14）略高于关键词匹配（23.60），但显著低于空白条件（162.25）。研究者认为这是合理的资源分配，因为额外的完成token与更高准确率和更完善的推理/验证步骤相关。

这些实验数据有力证明了RAG-MCP在面对大量工具选择时缓解提示词膨胀并提升LLM决策能力的有效性。

技术深度剖析：RAG-MCP内部机制

针对技术实现细节：

检索器实现： 采用”轻量级LLM基础的检索器（如Qwen）”对用户任务描述进行编码，并对MCP索引执行语义搜索。这意味着检索器本身是一个AI模型，经训练理解查询语义，并将其与以向量嵌入形式存储的工具描述语义进行匹配。这些嵌入的质量及检索器识别工具功能细微差异的能力对RAG-MCP的性能至关重要。
验证机制： 可选的验证步骤提供了额外的质量保障。通过让RAG-MCP为检索到的MCP生成示例查询并测试其响应情况，系统能在主LLM调用工具前进行兼容性确认。这有助于过滤语义相似但实际不兼容或无响应的工具。
极端规模下的挑战： 压力测试表明，即使采用RAG-MCP，在处理数千个工具时检索精度也可能下降。当”目标”工具与众多干扰项高度相似，或查询表述不够明确时，检索器可能面临困难。这指出了未来研究方向，可能涉及分层检索或更复杂的排序和验证机制。
基础LLM选择影响： 实验使用qwen-max-0125作为执行任务的基础LLM。RAG-MCP的整体性能必然受该主要LLM能力的影响。更强大的基础LLM可能更擅长利用检索器提供的工具，即使检索结果不完全最优。
单一工具调用与扩展可能： 目前RAG-MCP主要关注选择和注入单个最佳MCP描述。未来研究可探索任务需要多个工具链式或组合使用的场景，以及检索器如何协助识别此类组合。

RAG-MCP的技术意义与应用前景

RAG-MCP框架不仅具有学术价值，更解决了AI助手和自主代理发展面临的核心瓶颈。随着我们期望LLM在动态环境中执行日益复杂的任务，其高效利用多样化外部工具的能力变得尤为关键。

增强AI代理能力： 支持复杂AI代理无缝切换网络搜索、数据库查询、生产力应用交互、智能设备控制等多种功能。RAG-MCP为这些代理提供了可扩展的工具管理机制。
优化开发者体验： 对于构建LLM驱动应用的开发者，工具集成管理可能迅速变得复杂。RAG-MCP提供更结构化和可扩展的方法，降低开发成本并提高工具使用的可靠性。
扩展专业工具可访问性： 通过使LLM更容易发现和使用专业API与服务，RAG-MCP有助于拓展以往需要专业知识才能使用的功能访问渠道。
促进技术标准化： 虽然MCP提供了工具定义标准，RAG-MCP则提供工具发现与选择的模式。随着LLM生态系统成熟，此类标准化模式日益重要。

研究者强调，”RAG-MCP为构建可扩展、可靠的LLM代理奠定了’黄金核心’，使其能够精确高效地利用大量外部服务。”