Tiny Agents: an MCP-powered agent in 50 lines of code

Hugging Face - Blog

Waypoint-1.5: Higher-Fidelity Interactive Worlds for Everyday GPUs ALTK‑Evolve: On‑the‑Job Learning for AI Agents Safetensors is Joining the PyTorch Foundation Holo3: Breaking the Computer Use Frontier Any Custom Frontend with Gradio's Backend A New Framework for Evaluating Voice Agents (EVA) Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations One-Shot Any Web App with Gradio's gr.HTML CUGA on Hugging Face: Democratizing Configurable AI Agents New in llama.cpp: Model Management Building Deep Research: How we Achieved State of the Art OVHcloud on Hugging Face Inference Providers 🔥 20x Faster TRL Fine-tuning with RapidFire AI Building for an Open Future - our new partnership with Google Cloud Aligning to What? Rethinking Agent Generalization in MiniMax M2 Building a Healthcare Robot from Simulation to Deployment with NVIDIA Isaac Sentence Transformers is joining Hugging Face! Unlock the power of images with AI Sheets Supercharge your OCR Pipelines with Open Models Google Cloud C4 Brings a 70% TCO improvement on GPT OSS with Intel and Hugging Face Get your VLM running in 3 simple steps on Intel CPUs Nemotron-Personas-India: Synthesized Data for Sovereign AI Introducing RTEB: A New Standard for Retrieval Evaluation Accelerating Qwen3-8B Agent on Intel® Core™ Ultra with Depth-Pruned Draft Models VibeGame: Exploring Vibe Coding Games Nemotron-Personas-Japan: ソブリン AI のための合成データセット Swift Transformers Reaches 1.0 – and Looks to the Future Smol2Operator: Post-Training GUI Agents for Computer Use SyGra: The One-Stop Framework for Building Data for LLMs and SLMs Gaia2 and ARE: Empowering the community to study agents Scaleway on Hugging Face Inference Providers 🔥 Democratizing AI Safety with RiskRubric.ai Public AI on Hugging Face Inference Providers 🔥 `LeRobotDataset:v3.0`: Bringing large-scale datasets to `lerobot` Visible Watermarking with Gradio Introducing the Palmyra-mini family: Powerful, lightweight, and ready to reason! Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers Fine-tune Any LLM from the Hugging Face Hub with Together AI Jupyter Agents: training LLMs to reason with notebooks mmBERT: ModernBERT goes Multilingual Welcome EmbeddingGemma, Google's new efficient embedding model SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence Make your ZeroGPU Spaces go brrr with ahead-of-time compilation NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset Generate Images with Claude and Hugging Face From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels MCP for Research: How to Connect AI to Research Tools Kimina-Prover-RL Arm & ExecuTorch 0.7: Bringing Generative AI to the masses Neural Super Sampling is here! TextQuests: How Good are LLMs at Text-Based Video Games? 🇵🇭 FilBench - Can LLMs Understand and Generate Filipino? Introducing AI Sheets: a tool to work with datasets using open AI models! Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training Vision Language Model Alignment in TRL ⚡️ Welcome GPT OSS, the new open-source model family from OpenAI! Measuring Open-Source Llama Nemotron Models on DeepResearch Bench 📚 3LM: A Benchmark for Arabic LLMs in STEM and Code Implementing MCP Servers in Python: An AI Shopping Assistant with Gradio Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face Say hello to `hf`: a faster, friendlier Hugging Face CLI ✨ Parquet Content-Defined Chunking TimeScope: How Long Can Your Video Large Multimodal Model Go? Fast LoRA inference for Flux with Diffusers and PEFT Accelerate a World of LLMs on Hugging Face with NVIDIA NIM Arc Virtual Cell Challenge: A Primer Consilium: When Multiple LLMs Collaborate Back to The Future: Evaluating AI Agents on Predicting Future Events Five Big Improvements to Gradio MCP Servers Ettin Suite: SoTA Paired Encoders and Decoders Migrating the Hub from Git LFS to Xet Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models Asynchronous Robot Inference: Decoupling Action Prediction and Execution ScreenEnv: Deploy your full stack Desktop Agent Building the Hugging Face MCP Server Reachy Mini - The Open-Source Robot for Today's and Tomorrow's AI Builders Creating custom kernels for the AMD MI300 Upskill your LLMs With Gradio MCP Servers SmolLM3: smol, multilingual, long-context reasoner Three Mighty Alerts Supporting Hugging Face’s Production Infrastructure Efficient MultiModal Data Pipeline Announcing NeurIPS 2025 E2LM Competition: Early Training Evaluation of Language Models Training and Finetuning Sparse Embedding Models with Sentence Transformers Welcome the NVIDIA Llama Nemotron Nano VLM to Hugging Face Hub Gemma 3n fully available in the open-source ecosystem! Transformers backend integration in SGLang (LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware Groq on Hugging Face Inference Providers 🔥 How Long Prompts Block Other Requests - Optimizing LLM Performance Learn the Hugging Face Kernel Hub in 5 Minutes Convert Transformers to ONNX with Hugging Face Optimum Intel and Hugging Face Partner to Democratize Machine Learning Hardware Acceleration Director of Machine Learning Insights [Part 3: Finance Edition] The Annotated Diffusion Model Deep Q-Learning with Space Invaders Graphcore and Hugging Face Launch New Lineup of IPU-Ready Transformers Introducing Pull Requests and Discussions 🥳 Efficient Table Pre-training without Real Data: An Introduction to TAPEX An Introduction to Q-Learning Part 2/2 How Sempre Health is leveraging the Expert Acceleration Program to accelerate their ML roadmap

Julien Chaumond · 2025-04-25 · via Hugging Face - Blog

Back to Articles

How to run the complete demo
Default model and provider
Where does the code live
The foundation for this: tool calling native support in LLMs.
Implementing an MCP client on top of InferenceClient
How to use the tools
Our 50-lines-of-code Agent 🤯
The complete while loop
Next steps
New! (May 23, '25) If you prefer Python, check out the companion post Tiny Agents in Python.

Over the past few weeks, I've been diving into MCP (Model Context Protocol) to understand what the hype around it was all about.

My TL;DR is that it's fairly simple, but still quite powerful: MCP is a standard API to expose sets of Tools that can be hooked to LLMs.

It is fairly simple to extend an Inference Client – at HF, we have two official client SDKs: @huggingface/inference in JS, and huggingface_hub in Python – to also act as a MCP client and hook the available tools from MCP servers into the LLM inference.

But while doing that, came my second realization:

Once you have an MCP Client, an Agent is literally just a while loop on top of it.

In this short article, I will walk you through how I implemented it in Typescript (JS), how you can adopt MCP too and how it's going to make Agentic AI way simpler going forward.

Image credit https://x.com/adamdotdev

How to run the complete demo

If you have NodeJS (with pnpm or npm), just run this in a terminal:

npx @huggingface/mcp-client

or if using pnpm:

pnpx @huggingface/mcp-client

This installs my package into a temporary folder then executes its command.

You'll see your simple Agent connect to two distinct MCP servers (running locally), loading their tools, then prompting you for a conversation.

By default our example Agent connects to the following two MCP servers:

the "canonical" file system server, which gets access to your Desktop,
and the Playwright MCP server, which knows how to use a sandboxed Chromium browser for you.

Note: this is a bit counter-intuitive but currently, all MCP servers are actually local processes (though remote servers are coming soon).

Our input for this first video was:

write a haiku about the Hugging Face community and write it to a file named "hf.txt" on my Desktop

Now let us try this prompt that involves some Web browsing:

do a Web Search for HF inference providers on Brave Search and open the first 3 results

Default model and provider

In terms of model/provider pair, our example Agent uses by default:

"Qwen/Qwen2.5-72B-Instruct"
running on Nebius

This is all configurable through env variables! See:

const agent = new Agent({
    provider: process.env.PROVIDER ?? "nebius",
    model: process.env.MODEL_ID ?? "Qwen/Qwen2.5-72B-Instruct",
    apiKey: process.env.HF_TOKEN,
    servers: SERVERS,
});

Where does the code live

The Tiny Agent code lives in the mcp-client sub-package of the huggingface.js mono-repo, which is the GitHub mono-repo in which all our JS libraries reside.

https://github.com/huggingface/huggingface.js/tree/main/packages/mcp-client

The codebase uses modern JS features (notably, async generators) which make things way easier to implement, especially asynchronous events like the LLM responses. You might need to ask a LLM about those JS features if you're not yet familiar with them.

The foundation for this: tool calling native support in LLMs.

What is going to make this whole blogpost very easy is that the recent crop of LLMs (both closed and open) have been trained for function calling, aka. tool use.

A tool is defined by its name, a description, and a JSONSchema representation of its parameters. In some sense, it is an opaque representation of any function's interface, as seen from the outside (meaning, the LLM does not care how the function is actually implemented).

const weatherTool = {
    type: "function",
    function: {
        name: "get_weather",
        description: "Get current temperature for a given location.",
        parameters: {
            type: "object",
            properties: {
                location: {
                    type: "string",
                    description: "City and country e.g. Bogotá, Colombia",
                },
            },
        },
    },
};

The canonical documentation I will link to here is OpenAI's function calling doc. (Yes... OpenAI pretty much defines the LLM standards for the whole community 😅).

Inference engines let you pass a list of tools when calling the LLM, and the LLM is free to call zero, one or more of those tools. As a developer, you run the tools and feed their result back into the LLM to continue the generation.

Note that in the backend (at the inference engine level), the tools are simply passed to the model in a specially-formatted chat_template, like any other message, and then parsed out of the response (using model-specific special tokens) to expose them as tool calls. See an example in our chat-template playground.

Implementing an MCP client on top of InferenceClient

Now that we know what a tool is in recent LLMs, let us implement the actual MCP client.

The official doc at https://modelcontextprotocol.io/quickstart/client is fairly well-written. You only have to replace any mention of the Anthropic client SDK by any other OpenAI-compatible client SDK. (There is also a llms.txt you can feed into your LLM of choice to help you code along).

As a reminder, we use HF's InferenceClient for our inference client.

The complete McpClient.ts code file is here if you want to follow along using the actual code 🤓

Our McpClient class has:

an Inference Client (works with any Inference Provider, and huggingface/inference supports both remote and local endpoints)
a set of MCP client sessions, one for each connected MCP server (yes, we want to support multiple servers)
and a list of available tools that is going to be filled from the connected servers and just slightly re-formatted.

export class McpClient {
    protected client: InferenceClient;
    protected provider: string;
    protected model: string;
    private clients: Map<ToolName, Client> = new Map();
    public readonly availableTools: ChatCompletionInputTool[] = [];

    constructor({ provider, model, apiKey }: { provider: InferenceProvider; model: string; apiKey: string }) {
        this.client = new InferenceClient(apiKey);
        this.provider = provider;
        this.model = model;
    }
    
    // [...]
}

To connect to a MCP server, the official @modelcontextprotocol/sdk/client TypeScript SDK provides a Client class with a listTools() method:

async addMcpServer(server: StdioServerParameters): Promise<void> {
    const transport = new StdioClientTransport({
        ...server,
        env: { ...server.env, PATH: process.env.PATH ?? "" },
    });
    const mcp = new Client({ name: "@huggingface/mcp-client", version: packageVersion });
    await mcp.connect(transport);

    const toolsResult = await mcp.listTools();
    debug(
        "Connected to server with tools:",
        toolsResult.tools.map(({ name }) => name)
    );

    for (const tool of toolsResult.tools) {
        this.clients.set(tool.name, mcp);
    }

    this.availableTools.push(
        ...toolsResult.tools.map((tool) => {
            return {
                type: "function",
                function: {
                    name: tool.name,
                    description: tool.description,
                    parameters: tool.inputSchema,
                },
            } satisfies ChatCompletionInputTool;
        })
    );
}

StdioServerParameters is an interface from the MCP SDK that will let you easily spawn a local process: as we mentioned earlier, currently, all MCP servers are actually local processes.

For each MCP server we connect to, we slightly re-format its list of tools and add them to this.availableTools.

How to use the tools

Easy, you just pass this.availableTools to your LLM chat-completion, in addition to your usual array of messages:

const stream = this.client.chatCompletionStream({
    provider: this.provider,
    model: this.model,
    messages,
    tools: this.availableTools,
    tool_choice: "auto",
});

tool_choice: "auto" is the parameter you pass for the LLM to generate zero, one, or multiple tool calls.

When parsing or streaming the output, the LLM will generate some tool calls (i.e. a function name, and some JSON-encoded arguments), which you (as a developer) need to compute. The MCP client SDK once again makes that very easy; it has a client.callTool() method:

const toolName = toolCall.function.name;
const toolArgs = JSON.parse(toolCall.function.arguments);

const toolMessage: ChatCompletionInputMessageTool = {
    role: "tool",
    tool_call_id: toolCall.id,
    content: "",
    name: toolName,
};

/// Get the appropriate session for this tool
const client = this.clients.get(toolName);
if (client) {
    const result = await client.callTool({ name: toolName, arguments: toolArgs });
    toolMessage.content = result.content[0].text;
} else {
    toolMessage.content = `Error: No session found for tool: ${toolName}`;
}

Finally you will add the resulting tool message to your messages array and back into the LLM.

Our 50-lines-of-code Agent 🤯

Now that we have an MCP client capable of connecting to arbitrary MCP servers to get lists of tools and capable of injecting them and parsing them from the LLM inference, well... what is an Agent?

Once you have an inference client with a set of tools, then an Agent is just a while loop on top of it.

In more detail, an Agent is simply a combination of:

a system prompt
an LLM Inference client
an MCP client to hook a set of Tools into it from a bunch of MCP servers
some basic control flow (see below for the while loop)

The complete Agent.ts code file is here.

Our Agent class simply extends McpClient:

export class Agent extends McpClient {
    private readonly servers: StdioServerParameters[];
    protected messages: ChatCompletionInputMessage[];

    constructor({
        provider,
        model,
        apiKey,
        servers,
        prompt,
    }: {
        provider: InferenceProvider;
        model: string;
        apiKey: string;
        servers: StdioServerParameters[];
        prompt?: string;
    }) {
        super({ provider, model, apiKey });
        this.servers = servers;
        this.messages = [
            {
                role: "system",
                content: prompt ?? DEFAULT_SYSTEM_PROMPT,
            },
        ];
    }
}

By default, we use a very simple system prompt inspired by the one shared in the GPT-4.1 prompting guide.

Even though this comes from OpenAI 😈, this sentence in particular applies to more and more models, both closed and open:

We encourage developers to exclusively use the tools field to pass tools, rather than manually injecting tool descriptions into your prompt and writing a separate parser for tool calls, as some have reported doing in the past.

Which is to say, we don't need to provide painstakingly formatted lists of tool use examples in the prompt. The tools: this.availableTools param is enough.

Loading the tools on the Agent is literally just connecting to the MCP servers we want (in parallel because it's so easy to do in JS):

async loadTools(): Promise<void> {
    await Promise.all(this.servers.map((s) => this.addMcpServer(s)));
}

We add two extra tools (outside of MCP) that can be used by the LLM for our Agent's control flow:

const taskCompletionTool: ChatCompletionInputTool = {
    type: "function",
    function: {
        name: "task_complete",
        description: "Call this tool when the task given by the user is complete",
        parameters: {
            type: "object",
            properties: {},
        },
    },
};
const askQuestionTool: ChatCompletionInputTool = {
    type: "function",
    function: {
        name: "ask_question",
        description: "Ask a question to the user to get more info required to solve or clarify their problem.",
        parameters: {
            type: "object",
            properties: {},
        },
    },
};
const exitLoopTools = [taskCompletionTool, askQuestionTool];

When calling any of these tools, the Agent will break its loop and give control back to the user for new input.

The complete while loop

Behold our complete while loop.🎉

The gist of our Agent's main while loop is that we simply iterate with the LLM alternating between tool calling and feeding it the tool results, and we do so until the LLM starts to respond with two non-tool messages in a row.

This is the complete while loop:

let numOfTurns = 0;
let nextTurnShouldCallTools = true;
while (true) {
    try {
        yield* this.processSingleTurnWithTools(this.messages, {
            exitLoopTools,
            exitIfFirstChunkNoTool: numOfTurns > 0 && nextTurnShouldCallTools,
            abortSignal: opts.abortSignal,
        });
    } catch (err) {
        if (err instanceof Error && err.message === "AbortError") {
            return;
        }
        throw err;
    }
    numOfTurns++;
    const currentLast = this.messages.at(-1)!;
    if (
        currentLast.role === "tool" &&
        currentLast.name &&
        exitLoopTools.map((t) => t.function.name).includes(currentLast.name)
    ) {
        return;
    }
    if (currentLast.role !== "tool" && numOfTurns > MAX_NUM_TURNS) {
        return;
    }
    if (currentLast.role !== "tool" && nextTurnShouldCallTools) {
        return;
    }
    if (currentLast.role === "tool") {
        nextTurnShouldCallTools = false;
    } else {
        nextTurnShouldCallTools = true;
    }
}

Next steps

There are many cool potential next steps once you have a running MCP Client and a simple way to build Agents 🔥

Experiment with other models
- mistralai/Mistral-Small-3.1-24B-Instruct-2503 is optimized for function calling
- Gemma 3 27B, the Gemma 3 QAT models are a popular choice for function calling though it would require us to implement tool parsing as it's not using native tools (a PR would be welcome!)
Experiment with all the Inference Providers:
- Cerebras, Cohere, Fal, Fireworks, Hyperbolic, Nebius, Novita, Replicate, SambaNova, Together, etc.
- each of them has different optimizations for function calling (also depending on the model) so performance may vary!
Hook local LLMs using llama.cpp or LM Studio

Pull requests and contributions are welcome! Again, everything here is open source! 💎❤️

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Hugging Face - Blog

How to run the complete demo

Default model and provider

Where does the code live

The foundation for this: tool calling native support in LLMs.

Implementing an MCP client on top of InferenceClient

How to use the tools

Our 50-lines-of-code Agent 🤯

The complete while loop

Next steps