Cut your LLM API costs by 40-70% with zero code changes.
A semantic caching layer that sits between your app and AI providers (OpenAI, Groq, etc.). When you ask a similar question twice, it returns the cached answer instantly instead of calling the API again.
🎯 What Problem Does This Solve?
You're building an AI app and your API bill is $500/month. 40-70% of that is for repeat questions:
- "What is RAG?" asked 100 times = 100 API calls
- "How do I reset my password?" asked 50 times = 50 API calls
With AI Gateway: Those 150 calls become 2 calls (one for each unique question). You save $200-350/month.
🚀 Deploy in 60 Seconds (3 Options)
Option 1: Railway (Recommended - Includes Redis)
Steps:
- Click the button above
- Sign in with GitHub
- Enter your API key (Groq or OpenAI)
- Click "Deploy"
- Done! Your gateway is live at
https://your-app.up.railway.app
What you get:
- ✅ Hosted gateway (no server management)
- ✅ Redis included (persistent cache)
- ✅ Auto-scaling
- ✅ HTTPS enabled
- ✅ $5/month free credit
Option 2: Render (One-Click Deploy)
Steps:
- Click the button
- Sign in with GitHub
- Add environment variable:
UPSTREAM_API_KEY=your_key - Click "Create Web Service"
- Done!
Note: You'll need to add a Redis addon separately in Render dashboard.
Option 3: Docker (Self-Hosted)
Prerequisites:
- Docker installed
- Docker Compose installed
- A Groq or OpenAI API key
Steps:
# 1. Clone the repo git clone https://github.com/Arnab758/ai-gateway.git cd ai-gateway # 2. Set your API key export UPSTREAM_API_KEY=gsk_your_groq_key_here # 3. Start everything (gateway + Redis) docker compose up -d # 4. Verify it's running curl http://localhost:8080/health # Expected response: {"status":"ok"}
That's it! Your gateway is now running at http://localhost:8080
📖 How to Use
Basic Usage (cURL)
# Send a request through the gateway curl -X POST http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "X-Gateway-Token: my-app" \ -H "Authorization: Bearer sk-your-openai-or-groq-key" \ -d '{ "model": "gpt-4", "messages": [{"role": "user", "content": "What is RAG?"}] }' # Send the SAME request again # Response headers will show: X-Gateway-Cache: HIT # You just saved money! 💰
Python Example
import requests # Your gateway URL (from Railway/Render/Docker) GATEWAY_URL = "https://your-app.up.railway.app" API_KEY = "sk-your-key" response = requests.post( f"{GATEWAY_URL}/v1/chat/completions", headers={ "Content-Type": "application/json", "X-Gateway-Token": "my-app", "Authorization": f"Bearer {API_KEY}" }, json={ "model": "gpt-4", "messages": [{"role": "user", "content": "What is RAG?"}] } ) print(response.json())
Node.js Example
const response = await fetch('https://your-app.up.railway.app/v1/chat/completions', { method: 'POST', headers: { 'Content-Type': 'application/json', 'X-Gateway-Token': 'my-app', 'Authorization': 'Bearer sk-your-key' }, body: JSON.stringify({ model: 'gpt-4', messages: [{ role: 'user', content: 'What is RAG?' }] }) }); const data = await response.json(); console.log(data);
🎮 Try the Interactive Demo
No API key needed! See how caching works:
- Type a prompt and click "Send" (simulation mode)
- Or enter your API key and click "Test Real API" (real caching with Redis)
- Try sending the same prompt twice to see cache hits!
🔥 Key Features
- Semantic Caching - Matches similar questions, not just exact duplicates
- "What is RAG?" = "Explain RAG" = "RAG definition"
- Multi-Tenant - Each customer gets their own isolated cache
- 4-Tier Matching:
- Exact match (100% identical)
- Template match ("weather in London" = "weather in Paris")
- Semantic match (similar meaning)
- Word overlap (partial matches)
- Redis + In-Memory Fallback - Works with or without Redis
- Request Deduplication - 100 concurrent identical requests = 1 API call
- Rate Limiting - Prevent abuse per tenant
- Circuit Breaker - Automatically stops calling if provider is down
- Cost Tracking - See how much you saved
📊 Real-World Example
Scenario: Customer support chatbot with 10,000 users
Without AI Gateway:
- 10,000 users ask 100 common questions each
- 1,000,000 API calls/month
- Cost: $500/month (at $0.0005/call)
With AI Gateway:
- First 100 questions: 100 API calls (cache miss)
- Next 9,900 users asking same questions: 0 API calls (cache hit)
- Total: 100 API calls/month
- Cost: $0.05/month
- Savings: $499.95/month (99.99%)
Even with 30% unique questions:
- 300,000 API calls
- Cost: $150/month
- Savings: $350/month (70%)
🛠️ Configuration
Edit gateway.yaml to customize:
cache: redis_url: "redis://localhost:6379" # Or your Redis URL vector: enabled: true similarity_threshold: 0.85 # 85% similar = cache hit ttl_hours: 24 # Cache entries expire after 24 hours rate_limiter: enabled: true max_requests: 60 # Per minute per tenant
📡 API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions |
POST | Main proxy endpoint with caching |
/health |
GET | Health check |
/stats |
GET | Cache statistics |
/metrics |
GET | Prometheus metrics |
🔍 Monitoring
Check Cache Stats
curl http://localhost:8080/stats
Response:
{
"uptime": 1234567890,
"cache": {
"local_index_entries": 150,
"vector_dimensions": 128,
"vector_threshold": 0.85,
"jaccard_threshold": 0.75,
"template_enabled": true,
"dedup_enabled": true,
"ttl_hours": 24
}
}Response Headers
Every response includes cache information:
X-Gateway-Cache: HIT # or MISS
X-Gateway-Similarity: 0.95 # 95% similar (if HIT)
X-Gateway-Time-Saved: 1234ms # Time saved (if HIT)
🐛 Troubleshooting
Problem: "Redis connection failed"
Solution: Redis is optional! The gateway will fall back to in-memory cache automatically. For production, add Redis:
Railway: Add Redis from the "New" button
Render: Add Redis from the "New" → "Database" → "Redis"
Docker: Already included in docker-compose.yml
Problem: "All upstream providers unavailable"
Cause: You're hitting rate limits on free tier (Groq/OpenAI)
Solutions:
- Wait 1-2 minutes and try again
- Upgrade to paid tier ($0.002/request vs free limits)
- Add your own API key with higher limits
Problem: "Rate limit exceeded"
Cause: Too many requests from one tenant
Solution: Increase rate limits in gateway.yaml:
rate_limiter: max_requests: 120 # Increase from 60 window_minutes: 1
Problem: Cache not hitting
Cause: Prompts are too different
Solution: Lower the similarity threshold in gateway.yaml:
cache: vector: similarity_threshold: 0.75 # Lower from 0.85 jaccard: threshold: 0.65 # Lower from 0.75
🏗️ Architecture
Your App → AI Gateway → [Cache Check] → Redis
↓
[Cache HIT] → Return cached response (instant, $0)
↓
[Cache MISS] → Call LLM Provider → Cache response → Return
🤝 Contributing
Contributions are welcome! Please:
- Fork the repo
- Create a feature branch
- Make your changes
- Submit a pull request
📄 License
MIT License - feel free to use this commercially!
🙋 Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Demo: Live Demo
⭐ Star History
If this project helps you, please give it a star! It helps others find it.
Built with ❤️ for the AI community
Questions? Open an issue and I'll respond within 24 hours.























