Smart LLM router — one call, the right model.
Async and fully typed, across Anthropic, Gemini, xAI Grok, and OpenAI — routing each request to the right model tier by difficulty. Stop hardcoding model choices and stop re-writing the tool-use / structured-output / web-search / retry plumbing for every provider.
maslul (Hebrew מסלול, "route / lane") is a small library that does exactly two things:
routing (pick a model tier per request, or pin one) and provider normalization (one
Request/Response shape for every SDK). No server, no CLI, no heavy ML deps — providers
live behind extras, and the core is stdlib-only.
import asyncio from maslul import Router, Request, Message router = Router.from_toml("maslul.toml") # tiers + classifier + providers, from config async def main() -> None: resp = await router.complete(Request(messages=[Message(role="user", content="Hello!")])) print(resp.text, "·", resp.level_used, "·", resp.usage.output_tokens, "tokens") asyncio.run(main())
Install
pip install "maslul[anthropic,gemini,grok]" # or just the providers you use
Each provider's SDK lives behind an extra, so import maslul pulls in none of them — you
only install what you route to. maslul[anthropic] → anthropic; maslul[gemini] →
google-genai; maslul[grok] → xai-sdk; maslul[openai] → openai.
How it compares
maslul is a library, not a gateway — you embed the routing brain in your app, you don't run a proxy in front of it.
| maslul | RouteLLM | LiteLLM | |
|---|---|---|---|
| Shape | async library you embed (no server) | research framework / trained router | unified SDK + proxy server |
| Routing | difficulty tiers + swappable strategies (route_default / classify / classify_and_answer / verify_cascade) + injectable bypass / classifier / verifier hooks |
a trained strong-vs-weak router | manual config / fallback lists, load-balancing |
| Providers | Anthropic · Gemini · Grok · OpenAI, normalized | model-agnostic (you wire models) | 100+ providers |
| Tools / structured / vision | one normalized loop for all | — | per-provider |
| Web search | one flag, every provider → Response.sources |
— | per-provider |
| Caching | exact + semantic (in-process) | — | exact + semantic (proxy) |
| Typing / footprint | fully typed, py.typed; stdlib core, SDKs behind extras |
research code | larger; server to operate |
Choose maslul when you want a typed async library you embed — difficulty routing with your own
strategy + hooks, and one Request/Response over several providers (tools, structured output,
vision, web search, retries, cost cache) — without standing up a gateway. Reach for LiteLLM
when you want a provider proxy across 100+ models, or RouteLLM when you specifically want a
trained router.
The routing brain
flowchart LR
R["complete(req)"] --> M{"model= pin?"}
M -- yes --> RUN["run that model"]
M -- no --> L{"level= pin?"}
L -- yes --> RUN
L -- no --> B{"bypass_predicate?"}
B -- "tier" --> RUN
B -- "None" --> H{"hard_signal?<br/>(media · code · long · intent verbs)"}
H -- "yes" --> HARD["HARD tier"] --> RUN
H -- "no" --> S["strategy<br/>route_default · classify ·<br/>classify_and_answer · verify_cascade"] --> RUN
RUN --> X["tool loop · web search ·<br/>retry / fallback · usage breakdown"]
Routing
Difficulty is not readable from surface features — a short prompt can be very hard, a long
paste trivial — so maslul never applies a short ⇒ simple rule. You choose how each request is
routed, in this precedence order:
from maslul import Level await router.complete(req, model="anthropic:claude-opus-4-8") # 0. pin an exact model await router.complete(req, level=Level.HARD) # 1. pin a difficulty tier await router.complete(req) # 2-4. let the router decide
When you don't pin, the routing brain runs: a deterministic bypass (your fast-path, e.g. greetings → SIMPLE) → a hard-signal detector (intent verbs, code, attachments, long context → HARD, up-only) → the configured strategy for the ambiguous middle:
| Strategy | Cost for the middle | What it does |
|---|---|---|
ROUTE_DEFAULT |
0 calls | Default-to-capable (default_level). Best for low volume. |
CLASSIFY |
1 classify + 1 answer | A cheap dedicated classifier model labels the level (cached + budget-guarded), then dispatch. |
CLASSIFY_AND_ANSWER |
1 call | The classifier model answers directly, or emits an escalation sentinel to bump to a stronger tier. |
VERIFY_CASCADE |
1 cheap + verify | Answer cheap, run your verifier, escalate if it rejects — catches silent under-escalation. |
All three injection points are yours to supply:
def my_classifier(req): # your own difficulty call (sync or async); None defers to the strategy return Level.SIMPLE if is_trivial(req) else None def my_verifier(req, resp): # VERIFY_CASCADE: True keeps the cheap answer, False escalates return "I don't know" not in resp.text router = Router.from_toml("maslul.toml", classifier=my_classifier, verifier=my_verifier)
One shape for every capability
The same Request/Response works across all three providers:
from maslul import Request, Message, ToolDef, ToolCall, MediaPart # Tools — the router runs a provider-agnostic tool-use loop async def get_weather(call: ToolCall) -> str: return f"18°C in {call.input['city']}" req = Request( messages=[Message(role="user", content="Weather in Paris?")], tools=[ToolDef(name="get_weather", description="Current weather for a city.", input_schema={"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]})], tool_executor=get_weather, ) # Structured output — response_format → resp.structured (parsed) req = Request(messages=[Message(role="user", content="Extract name + age")], response_format={"type": "object", "properties": {"name": {"type": "string"}, "age": {"type": "integer"}}}) # Vision — images / PDFs req = Request(messages=[Message(role="user", content="What's in this image?")], media=[MediaPart(mime_type="image/png", data=png_bytes)]) # Web search — one flag, grounded on ANY provider (Anthropic web_search / Gemini Google Search / # Grok Agent Tools); citations land in resp.sources regardless of which model answers. req = Request(messages=[Message(role="user", content="Latest news on X?")], web_search=True)
Resilience & observability
def on_usage(resp): # per-model token breakdown for monitoring for rec in resp.usage_records: metrics.incr(f"{rec.provider}:{rec.model}", rec.usage.output_tokens) router = Router.from_toml("maslul.toml", on_complete=on_usage)
Transient errors (RateLimited, Timeout) retry with exponential backoff; on persistent failure
the request falls back to the next-higher tier — which may be a different provider, giving you
cross-provider failover for free. AuthError fails fast. Hooks: on_route (the RoutingDecision),
on_complete (the final Response with usage_records), on_error (each failed attempt).
Build a router with missing_provider="degrade" and any tier whose provider isn't configured
(e.g. a Grok tier with no XAI_API_KEY) falls back to the nearest available tier instead of
erroring — so one config runs across deploys that have different keys.
Cost cache
A [maslul.cache] config returns a prior Response instead of calling a model — exact (identical
request) or semantic (nearest request above a cosine threshold, using an embedder you inject, since
maslul ships no embeddings). A hit comes back with cached=True and zeroed usage, so monitoring
sees the saving. Tool-using requests are never cached.
[maslul.cache] mode = "semantic" # off | exact | semantic max_entries = 1000 ttl_seconds = 86400 similarity_threshold = 0.95
router = Router.from_toml("maslul.toml", embed=my_async_embed) # embed only needed for semantic
Configuration
A TOML file (or a plain dict — Router(config={...})):
[maslul] strategy = "route_default" # route_default | classify | classify_and_answer | verify_cascade default_level = "hard" # default-to-capable for the ambiguous middle min_tokens_to_classify = 40 # CLASSIFY budget guard request_timeout = 60 # per-call seconds (optional) max_retries = 2 fallback = true # escalate to a higher tier on persistent failure [maslul.tiers.simple] provider = "gemini" model = "gemini-2.5-flash-lite" [maslul.tiers.medium] model = "anthropic:claude-haiku-4-5" # or the provider:model shorthand [maslul.tiers.hard] model = "anthropic:claude-sonnet-4-6" [maslul.classifier] # required for the classify strategies model = "anthropic:claude-haiku-4-5" [maslul.providers.anthropic] api_key_env = "ANTHROPIC_API_KEY" # secrets by env-var name, never inlined [maslul.providers.gemini] vertex_project = "my-gcp-project" # Vertex AI + Application Default Credentials (no key) vertex_location = "global" [maslul.providers.grok] api_key_env = "XAI_API_KEY"
Pointing a capability at a different model or provider is a one-line config change — no code
deploy. Providers can also be injected directly (Router(config, providers={...})) for tests or
custom wiring.
Providers
| Provider | SDK (extra) | Auth |
|---|---|---|
anthropic |
anthropic |
ANTHROPIC_API_KEY |
gemini |
google-genai |
Vertex AI + ADC (vertex_project), or a Gemini Developer API key |
grok |
xai-sdk |
XAI_API_KEY |
openai |
openai |
OPENAI_API_KEY |
Status
Beta (0.2.x), fully typed (py.typed), async-first. Routing, tool use, structured output,
vision, web search across all three providers (web_search=True), the four strategies, and
retry/fallback resilience are implemented and exercised against live APIs.
License
MIT © Ilia Tankelevich




















