Choosing Models for an Agentic Chat App on Amazon Bedrock
When building an agentic chat application on Amazon Bedrock, one of the first hard decisions is model selection.
This article is not a rigorous benchmark or formal evaluation. It is simply a set of practical notes from experimenting with multiple Bedrock models while building a personal agentic chat application. Pricing, supported features, and regional behavior change frequently, so you should always validate with official documentation and your own workload before making production decisions.
The app I’m currently building is a serverless agent that gets invoked from Slack. It receives user messages and dynamically calls tools such as memory, task management, calendar integration, web extraction, and custom skills.
So this is not just a simple chatbot.
user message
-> model decides tool usage
-> tool execution
-> model observes result
-> sometimes replans
-> final Slack response
In this setup, model pricing alone is not enough. Tool call stability, Japanese UX quality, retry rate, fallback frequency, and output token volume all matter a lot.
My conclusion, at least for now, is that Moonshot AI’s Kimi K2.5 works best as the primary model.
Sonnet Is Expensive
Claude Sonnet is the baseline reference point.
Claude Sonnet 4.5 costs $3 per 1M input tokens and $15 per 1M output tokens. Claude Haiku 4.5 is much cheaper at $1 / $5, so while Sonnet provides reassuring quality, the cost becomes significant for agentic chat workloads where output tokens can grow quickly.
Agentic chat systems often invoke the model multiple times for a single user message. Tool schemas, tool results, conversation history, and system prompts all inflate token usage compared to ordinary Q&A applications.
Because of that, I positioned Sonnet like this from the beginning:
Claude Sonnet:
fallback on failure
escalator for high-value users
difficult multi-step reasoning
For the main model, I needed something cheaper than Sonnet while still being more reliable for agentic behavior than lightweight models.
Haiku Is Cheap, but Slightly Weak
Claude Haiku 4.5 is attractive from a pricing perspective. If your architecture benefits heavily from prompt caching, it can become extremely cost efficient for applications with large system prompts and repeated tool schemas.
Bedrock prompt caching reduces input token cost and latency by caching repeated prompt prefixes.
Still, in my own testing, Haiku felt slightly too weak to serve as the main model.
It works well for simple classification, lightweight extraction, and short summaries. But I had concerns about tool selection, replanning stability, Japanese response quality, and multi-step reliability.
So Haiku feels better suited as a helper model rather than the primary agent model.
Claude Haiku:
routing
lightweight classification
lightweight extraction
first-pass processing
MiniMax M2.5 Is Cheap and Agent-Friendly — but Japanese UX Is Weak
MiniMax M2.5 was one of the strongest candidates.
According to the Bedrock model card, MiniMax M2.5 is positioned as an “agent-native frontier model” optimized for reasoning efficiency, task decomposition, complex workflows, and agentic scaffolding. It supports a 196K context window and 8K maximum output tokens.
The pricing is also extremely competitive.
In the Tokyo region:
| Model | Approximate Cost for 1,000 Calls |
|---|---|
| MiniMax M2.5 | ~$4.32 |
| Mistral Large 3 | ~$6.70 |
| Kimi K2.5 | ~$9.36 |
On paper, MiniMax M2.5 is very attractive. It also supports Bedrock Agents, Flows, and structured outputs.
However, after actually using it, I felt that the Japanese UX and customer-facing conversational quality were slightly off. It may work well for internal planning or orchestration, but I was not fully comfortable exposing it directly to users in Slack conversations.
MiniMax is probably one of the strongest cost-performance options available today, but I ultimately excluded it as the main chat model.
Gemma Is Extremely Cheap, but Better for First-Pass Processing
The Gemma 3 family was also considered.
In the Tokyo region, Gemma 3 pricing is extremely low:
- Gemma 3 27B: $0.28 / $0.46
- Gemma 3 12B: $0.11 / $0.35
- Gemma 3 4B: $0.05 / $0.10
At those prices, Gemma becomes very useful for:
- classification
- lightweight RAG answers
- short summaries
- routing
- first-pass response generation
However, my target workload was an agentic chat main model. Since even Haiku already felt slightly weak for that role, Gemma was difficult to justify as the primary agent.
Nemotron 3 Super 120B
At one point I also evaluated NVIDIA Nemotron 3 Super 120B.
According to the Bedrock model card, Nemotron 3 Super is a 120B-parameter open hybrid MoE model with 12B active parameters. It targets complex multi-agent applications and supports a 256K context window with 32K output tokens.
Pricing is surprisingly low:
- $0.18 / 1M input tokens
- $0.78 / 1M output tokens
Even cheaper than MiniMax.
On paper, it looked extremely compelling.
However, in my own testing, on-demand invocation latency in the Tokyo region was sometimes very slow, and even short responses occasionally timed out. Meanwhile, in us-east-1, forced tool calls and short responses often completed in around 2–3 seconds.
So I would not conclude that Nemotron itself is fundamentally slow. Regional infrastructure and routing likely have a large impact.
Since my target use case is a customer-facing chat application deployed in Tokyo, I decided not to use it as the main model.
Nemotron 3 Super:
strong pricing and specs
tool use works
but latency in ap-northeast-1 felt risky
Mistral Large 3 Is Good, but Not Decisive
Mistral Large 3 was also a very realistic option.
According to the Bedrock model card, Mistral Large 3 is a 675B-parameter model optimized for coding, reasoning, and multilingual tasks. It supports a 256K context window and 32K output tokens.
In Bedrock Runtime, it supports Agents, Flows, structured outputs, and prompt caching.
Pricing in Tokyo:
- $0.61 / 1M input tokens
- $1.82 / 1M output tokens
Considerably cheaper than Kimi K2.5.
My practical experience with it was not bad at all. But in this specific agentic chat workload, Kimi K2.5 consistently felt more stable.
Also, while the official model card says prompt caching is supported, I occasionally saw Bedrock reject requests when using cachePoint in my own setup.
Mistral offers a very good balance between cost and quality, but Kimi ultimately ranked higher for my use case.
Why I Ended Up Choosing Kimi K2.5
In the end, I chose moonshotai.kimi-k2.5 as the main model.
The reason is simple:
Among all the models I tested, it provided the best balance of agentic behavior stability and Japanese UX quality.
According to the Bedrock model card, Kimi K2.5 offers improved reasoning, coding, and multilingual capabilities. It supports a 256K context window, 16K output tokens, and multimodal image input.
Within Bedrock Runtime, it supports:
- response streaming
- Guardrails
- Prompt Management
- Flows
- Agents
- structured outputs
Pricing in Tokyo:
- $0.72 / 1M input tokens
- $3.60 / 1M output tokens
More expensive than MiniMax or Mistral, but still significantly cheaper than Sonnet.
When selecting models, failure rate matters as much as raw token pricing.
Even if a model is cheap, frequent tool selection failures, malformed JSON, retries, or Sonnet fallbacks can easily increase the total effective cost.
In agentic systems especially, a single bad decision can cascade into failed tool calls and unnecessary replanning.
That is why my final evaluation of Kimi K2.5 became:
Not the cheapest model, but the most stable main model.
No Prompt Cache Support for Kimi K2.5 on Bedrock
One unfortunate limitation is prompt caching.
The Bedrock model card for Kimi K2.5 lists support for Agents, Flows, and structured outputs, but does not currently mention prompt caching.
The Bedrock prompt caching documentation explicitly lists which models support cache checkpoints and where they can be inserted (system, messages, or tools). Claude models and some others are listed there, but Kimi K2.5 currently has weak evidence for Bedrock-side prompt cache support.
Moonshot’s direct API does show cache-hit pricing for Kimi K2.5.
However, that does not automatically mean the same cache behavior or pricing applies through Bedrock.
Reducing Cost with Payload Slimming and Flex Tier
Once Kimi K2.5 became the primary model, the next challenge was cost optimization.
Especially output tokens.
The first thing that matters is payload slimming.
That means minimizing:
- system prompts
- tool schemas
- tool results
- conversation history
- RAG excerpts
In agentic chat systems, tool schemas and tool results can dramatically inflate input token usage.
Some practical optimizations:
- limit
maxTokensdepending on workload - avoid exposing long intermediate reasoning
- trim tool results down to only required fields
- avoid injecting every tool schema every time
- cache repeated FAQ answers, search results, and tool results on the application side
These optimizations matter regardless of which model you choose.
I also started experimenting with Bedrock Flex tier.
Bedrock provides Standard, Flex, Priority, and Reserved service tiers. Flex is intended for workloads that can tolerate slightly more variable latency in exchange for lower cost.
AWS documentation specifically mentions:
- model evaluation
- summarization
- agentic workflows
Moonshot Flex pricing on Bedrock is advertised at roughly a 50% discount compared to Standard.
That means Kimi K2.5 in Tokyo becomes approximately:
| Tier | Input | Output |
|---|---|---|
| Standard | $0.72 | $3.60 |
| Flex | ~$0.36 | ~$1.80 |
Initially, I planned to use Standard for interactive chat and Flex only for asynchronous tasks, evaluations, summaries, and background processing.
However, after trying Kimi K2.5 on Flex, the latency for lightweight Slack interactions felt much better than expected.
This is not a rigorous benchmark, and behavior may differ under heavy load or long tool loops.
Still, for small-scale personal projects or serverless agents, starting with Flex for the main response path actually feels realistic.
My current setup looks roughly like this:
main interactive responses:
moonshotai.kimi-k2.5 / Flex
async processing, summaries, evaluations:
moonshotai.kimi-k2.5 / Flex
failure handling and difficult reasoning:
Claude Sonnet fallback
lightweight classification and routing:
cheaper helper models
Explaining Security Concerns Around Chinese Models
When using Chinese-origin models like Kimi K2.5 or MiniMax M2.5, security concerns often appear internally.
The important point is not to argue that “Chinese models are safe.”
Instead, the distinction between:
- direct API usage
- Bedrock-managed usage
must be explained clearly.
According to Amazon Bedrock documentation, model providers cannot access Bedrock logs or customer prompts/completions.
That means using Kimi or MiniMax through Bedrock has a very different risk profile compared to directly calling vendor APIs.
The explanation I found most practical was:
We are not sending data directly to a Chinese model provider.
The models are executed within Amazon Bedrock’s managed environment.
Customer prompts and completions are not shared with the model provider through Bedrock.
Therefore, the main operational concerns become:
IAM
logging
Guardrails
RAG access control
tool-call permissions
Final Architecture
My final conclusion currently looks like this:
main model:
moonshotai.kimi-k2.5
interactive tier:
currently testing Flex
fallback to Standard if latency becomes problematic
cost-sensitive tier:
Flex
fallback:
Claude Sonnet
helper models:
MiniMax / Gemma / Nemotron for specialized workloads
Model roles ended up being:
| Model | Evaluation | Role |
|---|---|---|
| Claude Sonnet | Excellent but expensive | fallback / escalator |
| Claude Haiku | Cheap but slightly weak | routing / extraction |
| MiniMax M2.5 | Cheap and agent-oriented | not ideal for Japanese-facing UX |
| Gemma 3 | Extremely cheap | first-pass processing |
| Nemotron 3 Super | Cheap, non-Chinese, tool-capable | latency concerns in Tokyo |
| Mistral Large 3 | Strong balance | good, but less stable than Kimi |
| Kimi K2.5 | Strong Japanese UX and tool stability | main model |
Closing Thoughts
This whole exploration started from a simple question:
“Sonnet is expensive. Is there a cheaper main model for agentic chat?”
MiniMax M2.5 was extremely attractive in terms of pricing and agent-oriented behavior, but the Japanese customer-facing UX did not fully work for me.
Mistral Large 3 offered an excellent balance overall, but Kimi K2.5 consistently felt more stable.
Nemotron 3 Super 120B looked fascinating from a pricing and specification perspective, but latency in the Tokyo region made it difficult to trust for customer-facing chat.
Haiku can become highly cost efficient with prompt caching, but it still felt slightly weak for my main agent workload.
As a result, I settled on:
- Kimi K2.5 as the main model
- Sonnet as fallback
- Flex tier and payload slimming for cost optimization
For my own use case, Kimi K2.5 was not the absolute cheapest model.
But once retry rates, UX quality, and operational stability were included in the calculation, it delivered the best effective cost.
Going forward, I want to build more formal evaluations around:
- conversational quality
- tool call success rate
- retry frequency
- latency
- Japanese UX
- token cost
Rather than endlessly adding more candidate models, I want to keep pruning the stack into something operationally simple and reliable.




















