Cohere on Hugging Face Inference Providers 🔥

Hugging Face - Blog

Waypoint-1.5: Higher-Fidelity Interactive Worlds for Everyday GPUs ALTK‑Evolve: On‑the‑Job Learning for AI Agents Safetensors is Joining the PyTorch Foundation Holo3: Breaking the Computer Use Frontier Any Custom Frontend with Gradio's Backend A New Framework for Evaluating Voice Agents (EVA) Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine‑Tuning, and On‑Device Optimizations One-Shot Any Web App with Gradio's gr.HTML CUGA on Hugging Face: Democratizing Configurable AI Agents New in llama.cpp: Model Management Building Deep Research: How we Achieved State of the Art OVHcloud on Hugging Face Inference Providers 🔥 20x Faster TRL Fine-tuning with RapidFire AI Building for an Open Future - our new partnership with Google Cloud Aligning to What? Rethinking Agent Generalization in MiniMax M2 Building a Healthcare Robot from Simulation to Deployment with NVIDIA Isaac Sentence Transformers is joining Hugging Face! Unlock the power of images with AI Sheets Supercharge your OCR Pipelines with Open Models Google Cloud C4 Brings a 70% TCO improvement on GPT OSS with Intel and Hugging Face Get your VLM running in 3 simple steps on Intel CPUs Nemotron-Personas-India: Synthesized Data for Sovereign AI Introducing RTEB: A New Standard for Retrieval Evaluation Accelerating Qwen3-8B Agent on Intel® Core™ Ultra with Depth-Pruned Draft Models VibeGame: Exploring Vibe Coding Games Nemotron-Personas-Japan: ソブリン AI のための合成データセット Swift Transformers Reaches 1.0 – and Looks to the Future Smol2Operator: Post-Training GUI Agents for Computer Use SyGra: The One-Stop Framework for Building Data for LLMs and SLMs Gaia2 and ARE: Empowering the community to study agents Scaleway on Hugging Face Inference Providers 🔥 Democratizing AI Safety with RiskRubric.ai Public AI on Hugging Face Inference Providers 🔥 `LeRobotDataset:v3.0`: Bringing large-scale datasets to `lerobot` Visible Watermarking with Gradio Introducing the Palmyra-mini family: Powerful, lightweight, and ready to reason! Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers Fine-tune Any LLM from the Hugging Face Hub with Together AI Jupyter Agents: training LLMs to reason with notebooks mmBERT: ModernBERT goes Multilingual Welcome EmbeddingGemma, Google's new efficient embedding model SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence Make your ZeroGPU Spaces go brrr with ahead-of-time compilation NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset Generate Images with Claude and Hugging Face From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels MCP for Research: How to Connect AI to Research Tools Kimina-Prover-RL Arm & ExecuTorch 0.7: Bringing Generative AI to the masses Neural Super Sampling is here! TextQuests: How Good are LLMs at Text-Based Video Games? 🇵🇭 FilBench - Can LLMs Understand and Generate Filipino? Introducing AI Sheets: a tool to work with datasets using open AI models! Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training Vision Language Model Alignment in TRL ⚡️ Welcome GPT OSS, the new open-source model family from OpenAI! Measuring Open-Source Llama Nemotron Models on DeepResearch Bench 📚 3LM: A Benchmark for Arabic LLMs in STEM and Code Implementing MCP Servers in Python: An AI Shopping Assistant with Gradio Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face Say hello to `hf`: a faster, friendlier Hugging Face CLI ✨ Parquet Content-Defined Chunking TimeScope: How Long Can Your Video Large Multimodal Model Go? Fast LoRA inference for Flux with Diffusers and PEFT Accelerate a World of LLMs on Hugging Face with NVIDIA NIM Arc Virtual Cell Challenge: A Primer Consilium: When Multiple LLMs Collaborate Back to The Future: Evaluating AI Agents on Predicting Future Events Five Big Improvements to Gradio MCP Servers Ettin Suite: SoTA Paired Encoders and Decoders Migrating the Hub from Git LFS to Xet Kimina-Prover: Applying Test-time RL Search on Large Formal Reasoning Models Asynchronous Robot Inference: Decoupling Action Prediction and Execution ScreenEnv: Deploy your full stack Desktop Agent Building the Hugging Face MCP Server Reachy Mini - The Open-Source Robot for Today's and Tomorrow's AI Builders Creating custom kernels for the AMD MI300 Upskill your LLMs With Gradio MCP Servers SmolLM3: smol, multilingual, long-context reasoner Three Mighty Alerts Supporting Hugging Face’s Production Infrastructure Efficient MultiModal Data Pipeline Announcing NeurIPS 2025 E2LM Competition: Early Training Evaluation of Language Models Training and Finetuning Sparse Embedding Models with Sentence Transformers Welcome the NVIDIA Llama Nemotron Nano VLM to Hugging Face Hub Gemma 3n fully available in the open-source ecosystem! Transformers backend integration in SGLang (LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware Groq on Hugging Face Inference Providers 🔥 How Long Prompts Block Other Requests - Optimizing LLM Performance Learn the Hugging Face Kernel Hub in 5 Minutes Convert Transformers to ONNX with Hugging Face Optimum Intel and Hugging Face Partner to Democratize Machine Learning Hardware Acceleration Director of Machine Learning Insights [Part 3: Finance Edition] The Annotated Diffusion Model Deep Q-Learning with Space Invaders Graphcore and Hugging Face Launch New Lineup of IPU-Ready Transformers Introducing Pull Requests and Discussions 🥳 Efficient Table Pre-training without Real Data: An Introduction to TAPEX An Introduction to Q-Learning Part 2/2 How Sempre Health is leveraging the Expert Acceleration Program to accelerate their ML roadmap

Vaibhav Srivastav, ben burtenshaw, merve, Célina Hanouti, Alejan · 2025-04-16 · via Hugging Face - Blog

Back to Articles

We're thrilled to share that Cohere is now a supported Inference Provider on HF Hub! This also marks the first model creator to share and serve their models directly on the Hub.

Cohere is committed to building and serving models purpose-built for enterprise use-cases. Their comprehensive suite of secure AI solutions, from cutting-edge Generative AI to powerful Embeddings and Ranking models, are designed to tackle real-world business challenges. Additionally, Cohere Labs, Cohere’s in house research lab, supports fundamental research and seeks to change the spaces where research happens.

Starting now, you can run serverless inference to the following models via Cohere and Inference Providers:

Light up your projects with Cohere and Cohere Labs today!

Cohere Models

Cohere and Cohere Labs bring a swathe of their models to Inference Providers that excel at specific business applications. Let’s explore some in detail.

CohereLabs/c4ai-command-a-03-2025 🔗

Optimized for demanding enterprises that require fast, secure, and high-quality AI. Its 256k context length (2x most leading models) can handle much longer enterprise documents. Other key features include Cohere’s advanced retrieval-augmented generation (RAG) with verifiable citations, agentic tool use, enterprise-grade security, and strong multilingual performance (support for 23 languages).

CohereLabs/aya-expanse-32b 🔗

Focuses on state-of-the-art multilingual support, applying the latest research on multilingual pre-training. Supports Arabic, Chinese (simplified & traditional), Czech, Dutch, English, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Korean, Persian, Polish, Portuguese, Romanian, Russian, Spanish, Turkish, Ukrainian, and Vietnamese with 128K context length.

CohereLabs/c4ai-command-r7b-12-2024 🔗

Ideal for low-cost or low-latency use cases, bringing state-of-the-art performance in its class of open-weight models across real-world tasks. This model offers a context length of 128k. It delivers a powerful combination of multilingual support, citation-verified retrieval-augmented generation (RAG), reasoning, tool use, and agentic behavior. Also supports 23 languages.

CohereLabs/aya-vision-32b 🔗

32-billion parameter model with advanced capabilities optimized for a variety of vision-language use cases, including OCR, captioning, visual reasoning, summarization, question answering, code, and more. It expands multimodal capabilities to 23 languages spoken by over half the world's population.

How it works

You can use Cohere models directly on the Hub either on the website UI or via the client SDKs.

You can find all the examples mentioned in this section on the Cohere documentation page.

In the website UI

You can search for Cohere models by filtering by the inference provider in the model hub.

From the Model Card, you can select the inference provider and run inference directly in the UI.

From the client SDKs

Let’s walk through using Cohere models from client SDKs. We’ve also made a colab notebook with these snippets, in case you want to try them out right away.

from Python, using huggingface_hub

The following example shows how to use Command A using Cohere as your inference provider. You can use a Hugging Face token for automatic routing through Hugging Face, or your own cohere API key if you have one.

Install huggingface_hub v0.30.0 or later:

pip install -U "huggingface_hub>=0.30.0"

Use the huggingface_hub python library to call Cohere endpoints by defining the provider parameter.

from huggingface_hub import InferenceClient

client = InferenceClient(
    provider="cohere",
    api_key="xxxxxxxxxxxxxxxxxxxxxxxx",
)

messages = [
        {
            "role": "user",
            "content": "How to make extremely spicy Mayonnaise?"
        }
]

completion = client.chat.completions.create(
    model="CohereLabs/c4ai-command-r7b-12-2024",
    messages=messages,
    temperature=0.7,
    max_tokens=512,
)

print(completion.choices[0].message)

Aya Vision, Cohere Labs’ multilingual, multimodal model is also supported. You can include images encoded in base64 as follows:

image_path = "img.jpg"
with open(image_path, "rb") as f:
    base64_image = base64.b64encode(f.read()).decode("utf-8")
image_url = f"data:image/jpeg;base64,{base64_image}"

from huggingface_hub import InferenceClient

client = InferenceClient(
    provider="cohere",
    api_key="xxxxxxxxxxxxxxxxxxxxxxxx",
)

messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What's in this image?"
                },
                {
                    "type": "image_url",
                    "image_url": {"url": image_url},
                },
            ]
        }
]

completion = client.chat.completions.create(
    model="CohereLabs/aya-vision-32b",
    messages=messages,
    temperature=0.7,
    max_tokens=512,
)

print(completion.choices[0].message)

from JS using @huggingface/inference

import { HfInference } from "@huggingface/inference";

const client = new HfInference("xxxxxxxxxxxxxxxxxxxxxxxx");

const chatCompletion = await client.chatCompletion({
    model: "CohereLabs/c4ai-command-a-03-2025",
    messages: [
        {
            role: "user",
            content: "How to make extremely spicy Mayonnaise?"
        }
    ],
    provider: "cohere",
    max_tokens: 512
});

console.log(chatCompletion.choices[0].message);

From OpenAI client

Here's how you can call Command R7B using Cohere as the inference provider via the OpenAI client library.

from openai import OpenAI

client = OpenAI(
    base_url="https://router.huggingface.co/cohere/compatibility/v1",
    api_key="xxxxxxxxxxxxxxxxxxxxxxxx",
)

messages = [
        {
            "role": "user",
            "content": "How to make extremely spicy Mayonnaise?"
        }
]

completion = client.chat.completions.create(
    model="command-a-03-2025",
    messages=messages,
    temperature=0.7,
)

print(completion.choices[0].message)

Tool Use with Cohere Models

Cohere’s models bring state-of-the-art agentic tool use to Inference Providers so let’s explore that in detail. Both the Hugging Face Hub client and the OpenAI client are compatible with tools via inference providers, so the above examples can be expanded.

First, we will need to define tools for the model to use. Below we define the get_flight_info which calls an API for the latest flight information using two locations. This tool definition will be represented by the model’s chat template. Which we can also explore in the model card (🎉 open source).

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_flight_info",
            "description": "Get flight information between two cities or airports",
            "parameters": {
                "type": "object",
                "properties": {
                    "loc_origin": {
                        "type": "string",
                        "description": "The departure airport, e.g. MIA",
                    },
                    "loc_destination": {
                        "type": "string",
                        "description": "The destination airport, e.g. NYC",
                    },
                },
                "required": ["loc_origin", "loc_destination"],
            },
        },
    }
]

Next, we’ll need to pass messages to the inference client for the model to use the tools when relevant. In the example below we define the assistant’s tool call in tool_calls, for the sake of clarity.


messages = [
    {"role": "developer", "content": "Today is April 30th"},
    {
        "role": "user",
        "content": "When is the next flight from Miami to Seattle?",
    },
    {
        "role": "assistant",
        "tool_calls": [
            {
                "function": {
                    "arguments": '{ "loc_destination": "Seattle", "loc_origin": "Miami" }',
                    "name": "get_flight_info",
                },
                "id": "get_flight_info0",
                "type": "function",
            }
        ],
    },
    {
        "role": "tool",
        "name": "get_flight_info",
        "tool_call_id": "get_flight_info0",
        "content": "Miami to Seattle, May 1st, 10 AM.",
    },
]

Finally, the tools and messages are passed to the create method.

from huggingface_hub import InferenceClient

client = InferenceClient(
    provider="cohere",
    api_key="xxxxxxxxxxxxxxxxxxxxxxxx",
)

completion = client.chat.completions.create(
    model="CohereLabs/c4ai-command-r7b-12-2024",
    messages=messages,
    tools=tools,
    temperature=0.7,
    max_tokens=512,
)

print(completion.choices[0].message)

Billing

For direct requests, i.e. when you use a Cohere key, you are billed directly on your Cohere account.

For routed requests, i.e. when you authenticate via the Hub, you'll only pay the standard Cohere API rates. There's no additional markup from us, we just pass through the provider costs directly. (In the future, we may establish revenue-sharing agreements with our provider partners.)

Important Note ‼️ PRO users get $2 worth of Inference credits every month. You can use them across providers. 🔥

Subscribe to the Hugging Face PRO plan to get access to Inference credits, ZeroGPU, Spaces Dev Mode, 20x higher limits, and more.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Hugging Face - Blog

Cohere Models

CohereLabs/c4ai-command-a-03-2025 🔗

CohereLabs/aya-expanse-32b 🔗

CohereLabs/c4ai-command-r7b-12-2024 🔗

CohereLabs/aya-vision-32b 🔗

How it works

In the website UI

From the client SDKs

from Python, using huggingface_hub

from JS using @huggingface/inference

From OpenAI client

Tool Use with Cohere Models

Billing