GitHub - Arnab758/ai-gateway

Cut your LLM API costs by 40-70% with zero code changes.

A semantic caching layer that sits between your app and AI providers (OpenAI, Groq, etc.). When you ask a similar question twice, it returns the cached answer instantly instead of calling the API again.

🎯 What Problem Does This Solve?

You're building an AI app and your API bill is $500/month. 40-70% of that is for repeat questions:

"What is RAG?" asked 100 times = 100 API calls
"How do I reset my password?" asked 50 times = 50 API calls

With AI Gateway: Those 150 calls become 2 calls (one for each unique question). You save $200-350/month.

🚀 Deploy in 60 Seconds (3 Options)

Option 1: Railway (Recommended - Includes Redis)

Steps:

Click the button above
Sign in with GitHub
Enter your API key (Groq or OpenAI)
Click "Deploy"
Done! Your gateway is live at https://your-app.up.railway.app

What you get:

✅ Hosted gateway (no server management)
✅ Redis included (persistent cache)
✅ Auto-scaling
✅ HTTPS enabled
✅ $5/month free credit

Option 2: Render (One-Click Deploy)

Steps:

Click the button
Sign in with GitHub
Add environment variable: UPSTREAM_API_KEY=your_key
Click "Create Web Service"
Done!

Note: You'll need to add a Redis addon separately in Render dashboard.

Option 3: Docker (Self-Hosted)

Prerequisites:

Docker installed
Docker Compose installed
A Groq or OpenAI API key

Steps:

# 1. Clone the repo
git clone https://github.com/Arnab758/ai-gateway.git
cd ai-gateway

# 2. Set your API key
export UPSTREAM_API_KEY=gsk_your_groq_key_here

# 3. Start everything (gateway + Redis)
docker compose up -d

# 4. Verify it's running
curl http://localhost:8080/health

# Expected response: {"status":"ok"}

That's it! Your gateway is now running at http://localhost:8080

📖 How to Use

Basic Usage (cURL)

# Send a request through the gateway
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "X-Gateway-Token: my-app" \
  -H "Authorization: Bearer sk-your-openai-or-groq-key" \
  -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "What is RAG?"}]
  }'

# Send the SAME request again
# Response headers will show: X-Gateway-Cache: HIT
# You just saved money! 💰

Python Example

import requests

# Your gateway URL (from Railway/Render/Docker)
GATEWAY_URL = "https://your-app.up.railway.app"
API_KEY = "sk-your-key"

response = requests.post(
    f"{GATEWAY_URL}/v1/chat/completions",
    headers={
        "Content-Type": "application/json",
        "X-Gateway-Token": "my-app",
        "Authorization": f"Bearer {API_KEY}"
    },
    json={
        "model": "gpt-4",
        "messages": [{"role": "user", "content": "What is RAG?"}]
    }
)

print(response.json())

Node.js Example

const response = await fetch('https://your-app.up.railway.app/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'X-Gateway-Token': 'my-app',
    'Authorization': 'Bearer sk-your-key'
  },
  body: JSON.stringify({
    model: 'gpt-4',
    messages: [{ role: 'user', content: 'What is RAG?' }]
  })
});

const data = await response.json();
console.log(data);

🎮 Try the Interactive Demo

No API key needed! See how caching works:

👉 Open Live Demo

Type a prompt and click "Send" (simulation mode)
Or enter your API key and click "Test Real API" (real caching with Redis)
Try sending the same prompt twice to see cache hits!

🔥 Key Features

Semantic Caching - Matches similar questions, not just exact duplicates
- "What is RAG?" = "Explain RAG" = "RAG definition"
Multi-Tenant - Each customer gets their own isolated cache
4-Tier Matching:
1. Exact match (100% identical)
2. Template match ("weather in London" = "weather in Paris")
3. Semantic match (similar meaning)
4. Word overlap (partial matches)
Redis + In-Memory Fallback - Works with or without Redis
Request Deduplication - 100 concurrent identical requests = 1 API call
Rate Limiting - Prevent abuse per tenant
Circuit Breaker - Automatically stops calling if provider is down
Cost Tracking - See how much you saved

📊 Real-World Example

Scenario: Customer support chatbot with 10,000 users

Without AI Gateway:

10,000 users ask 100 common questions each
1,000,000 API calls/month
Cost: $500/month (at $0.0005/call)

With AI Gateway:

First 100 questions: 100 API calls (cache miss)
Next 9,900 users asking same questions: 0 API calls (cache hit)
Total: 100 API calls/month
Cost: $0.05/month
Savings: $499.95/month (99.99%)

Even with 30% unique questions:

300,000 API calls
Cost: $150/month
Savings: $350/month (70%)

🛠️ Configuration

Edit gateway.yaml to customize:

cache:
  redis_url: "redis://localhost:6379"  # Or your Redis URL
  vector:
    enabled: true
    similarity_threshold: 0.85  # 85% similar = cache hit
  ttl_hours: 24  # Cache entries expire after 24 hours

rate_limiter:
  enabled: true
  max_requests: 60  # Per minute per tenant

📡 API Endpoints

Endpoint	Method	Description
`/v1/chat/completions`	POST	Main proxy endpoint with caching
`/health`	GET	Health check
`/stats`	GET	Cache statistics
`/metrics`	GET	Prometheus metrics

🔍 Monitoring

Check Cache Stats

curl http://localhost:8080/stats

Response:

{
  "uptime": 1234567890,
  "cache": {
    "local_index_entries": 150,
    "vector_dimensions": 128,
    "vector_threshold": 0.85,
    "jaccard_threshold": 0.75,
    "template_enabled": true,
    "dedup_enabled": true,
    "ttl_hours": 24
  }
}

Response Headers

Every response includes cache information:

X-Gateway-Cache: HIT          # or MISS
X-Gateway-Similarity: 0.95    # 95% similar (if HIT)
X-Gateway-Time-Saved: 1234ms  # Time saved (if HIT)

🐛 Troubleshooting

Problem: "Redis connection failed"

Solution: Redis is optional! The gateway will fall back to in-memory cache automatically. For production, add Redis:

Railway: Add Redis from the "New" button Render: Add Redis from the "New" → "Database" → "Redis" Docker: Already included in docker-compose.yml

Problem: "All upstream providers unavailable"

Cause: You're hitting rate limits on free tier (Groq/OpenAI)

Solutions:

Wait 1-2 minutes and try again
Upgrade to paid tier ($0.002/request vs free limits)
Add your own API key with higher limits

Problem: "Rate limit exceeded"

Cause: Too many requests from one tenant

Solution: Increase rate limits in gateway.yaml:

rate_limiter:
  max_requests: 120  # Increase from 60
  window_minutes: 1

Problem: Cache not hitting

Cause: Prompts are too different

Solution: Lower the similarity threshold in gateway.yaml:

cache:
  vector:
    similarity_threshold: 0.75  # Lower from 0.85
  jaccard:
    threshold: 0.65  # Lower from 0.75

🏗️ Architecture

Your App → AI Gateway → [Cache Check] → Redis
                ↓
            [Cache HIT] → Return cached response (instant, $0)
                ↓
            [Cache MISS] → Call LLM Provider → Cache response → Return

🤝 Contributing

Contributions are welcome! Please:

Fork the repo
Create a feature branch
Make your changes
Submit a pull request

📄 License

MIT License - feel free to use this commercially!

🙋 Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Demo: Live Demo

⭐ Star History

If this project helps you, please give it a star! It helps others find it.

Built with ❤️ for the AI community

Questions? Open an issue and I'll respond within 24 hours.

推荐订阅源

Show HN

🎯 What Problem Does This Solve?

🚀 Deploy in 60 Seconds (3 Options)

Option 1: Railway (Recommended - Includes Redis)

Option 2: Render (One-Click Deploy)

Option 3: Docker (Self-Hosted)

📖 How to Use

Basic Usage (cURL)

Python Example

Node.js Example

🎮 Try the Interactive Demo

🔥 Key Features

📊 Real-World Example

🛠️ Configuration

📡 API Endpoints

🔍 Monitoring

Check Cache Stats

Response Headers

🐛 Troubleshooting

Problem: "Redis connection failed"

Problem: "All upstream providers unavailable"

Problem: "Rate limit exceeded"

Problem: Cache not hitting

🏗️ Architecture

🤝 Contributing

📄 License

🙋 Support

⭐ Star History