A Guide To Properly Testing MCP Applications

Matthew Lenard

•

June 2025

Introduction

The Model Context Protocol (MCP) is one of the most exciting technologies I've worked with. It has accelerated my development and debugging speed by allowing me to connect resources from my production environment directly within Cursor.

However, as MCP adoption grows, managing this additional information within the finite context window becomes a complex problem. Even getting the MCP Client to choose the correct tool is often much more difficult than it should be.

What is MCP?

At its core, MCP is a way for you to provide additional context to LLMs. It consists of three separate components.

Hosts are AI tools or Programs like Cursor, Gemini, or ChatGPT.
Clients provide a protocol used by the host to connect to MCP servers.
Servers are external programs that can be run locally or remotely and provide tools to the host.

MCP also introduces several core concepts to standardize interactions.

‍Resources provide Data and content exposed by MCP servers to LLMs.‍
Prompts allow for reusable templates and workflows for consistent queries.‍
Sampling is a mechanism for servers to request completions from LLMs.‍
Tools provide actions that LLMs can perform through MCP servers.‍
Roots define boundaries for MCP servers

The Problem With MCP

There has been an explosion of MCP servers and clients. Unfortunately, adding additional MCP servers leads to performance degradation. Each new tool and tool call description is passed into the context window when additional servers are added. Clients typically pass all available tools into the system prompt, which decides which tool to use when you prompt it.

This leads to a number of issues, like preferences for tools at the beginning and end of the list, context overflow when too many tools are added, and overall performance degradation the more servers you add. Some clients, like Cursor, limit you to 40 tools in total.

Introducing Evals

Evals try to solve the problem of tool call selection by giving you a programmatic way to test your client or server. They simulate the entire flow from prompt to tool call selection to tool call result. Finally, they grade the response using an LLM that acts as an evaluator.

Evals for Hosts

Evals allow Hosts to test how many tools can be passed into the system prompt before performance degradation issues arise and test the system to prompt to ensure the correct tool is being chosen for the task.

Evals for Servers

Evals allow servers to test their tool call descriptions and responses and better understand how clients will interact with their tools. They can also act as an end-to-end testing platform for your server to ensure that tool calls are functioning properly.

Implementing Evals

For the walkthrough, I will be using the mcp evals package. The MCP evals package is something I built internally when trying to test my Client and Server. It can be run either as Github action or as a standalone script via NPM. You can find the complete documentation here

Installation

npm install mcp-evals

Creating the Eval

‍

For the example below, I’ll be using the ArgoCD MCP server. Behind the scenes, the code is provisioning an MCP client, which in turn is passing in the prompt that the grade function accepts as its second parameter. The client then selects a tool and calls that tool. Finally, we use OpenAI to grade the response.

‍

import { EvalConfig } from 'mcp-evals';

import { openai } from '@ai-sdk/openai';

import { grade, EvalFunction } from 'mcp-evals';

‍

const listApplicationsEval: EvalFunction = {

name: 'List Applications Evaluation',

description: 'Evaluates the model\'s ability to list ArgoCD applications',

run: async () => {

const result = await grade(openai("gpt-4"), "Show me all the ArgoCD applications in the cluster.");

return JSON.parse(result);

}

};

‍

const getApplicationEval: EvalFunction = {

name: 'Get Application Evaluation',

description: 'Evaluates the model\'s ability to retrieve a specific ArgoCD application',

run: async () => {

const result = await grade(openai("gpt-4"), "Get details for the 'my-app' application in ArgoCD.");

return JSON.parse(result);

}

};

‍

const createApplicationEval: EvalFunction = {

name: 'Create Application Evaluation',

description: 'Evaluates the model\'s ability to create a new ArgoCD application',

run: async () => {

const result = await grade(openai("gpt-4"), "Create a new ArgoCD application named 'test-app' pointing to 'https://github.com/test/repo' in the 'default' namespace.");

return JSON.parse(result);

}

};

‍

export const evalConfig = {

model: openai('gpt-4'),

evals: [

listApplicationsEval,

getApplicationEval,

]

};

‍

Future Improvements and Conclusion

MCP is an exciting step forward for users of AI applications. It allows for actual agentic actions in existing AI interfaces. While that holds true, because of the nascence of the spec and developer tooling, there are still a number of problems to be solved. A better approach to tool call selection, context management, and a tool call feedback loop is sorely needed

‍

A Guide To Properly Testing MCP Applications

Introduction

What is MCP?

The Problem With MCP

Introducing Evals

Evals for Hosts

Evals for Servers

Implementing Evals

Installation

Creating the Eval

Future Improvements and Conclusion

Related posts

Practical Steps for a Smooth LLM Application Rollout

From vibes to visibility: Why we built Traceloop

GRUEN's Outstanding Performance in LLM Quality Evaluation

Ready to Engineer reliable AI?

Product

Product

Company

Legal