Hosting MCP Gateway Registry on AWS ECS: A Practical Blueprint for Enterprise Agentic AI Systems

AI agents are no longer just demo applications that answer questions.

They are slowly becoming systems that can take action: search customer records, update opportunities, generate quotes, create tickets, check inventory, read contracts, trigger workflows, and interact with business applications.

That is where the real enterprise problem begins.

When an AI agent only chats, the risk is limited. But when an agent starts using tools, APIs, and enterprise systems, we need a much stronger operating model. We need to know what the agent can access, who approved that access, what data it can touch, and how we can monitor every action.

This is exactly where an MCP Gateway and Registry becomes important.

The MCP Gateway Registry gives us a central place to register MCP servers, discover available tools, manage authentication, control access, and observe how agents interact with enterprise capabilities.

In this blog, I will walk through how we can host an MCP Gateway Registry on AWS using ECS Fargate, based on the Terraform AWS ECS deployment model from the MCP Gateway Registry project. This blog is based on the repo https://github.com/agentic-community/mcp-gateway-registry/tree/main and all credit goes to repo contributors.

Why This Problem Matters

In early AI agent projects, the architecture usually starts simple.

One agent connects to one or two tools.

For example:

Sales Agent
   |
   |-- Salesforce MCP Server
   |-- Knowledge Base MCP Server

This works well for a proof of concept.

But after some time, more teams start building agents.

The sales team wants Salesforce and quote tools.
The support team wants ticketing and knowledge base tools.
The finance team wants billing and contract tools.
The delivery team wants Jira, project reports, and document search tools.
The leadership team wants reporting and analytics agents.

Very quickly, the environment starts looking like this:

Agent 1 ---> MCP Server A
Agent 1 ---> MCP Server B
Agent 2 ---> MCP Server A
Agent 2 ---> MCP Server C
Agent 3 ---> MCP Server D
Agent 4 ---> MCP Server B
Agent 5 ---> MCP Server E

At this stage, the issue is no longer just technical integration.

The real problems are:

Who owns each MCP server?
Which agent is allowed to use which server?
What permissions does each tool have?
How do we prevent duplicate MCP servers?
How do we audit tool usage?
How do we onboard new tools safely?
How do we remove old or risky tools?
How do we monitor failures?
How do we stop agents from accessing sensitive systems without approval?

If we do not solve this early, the MCP layer can become another uncontrolled integration layer.

And in enterprise systems, uncontrolled integration always becomes a risk.

What an MCP Gateway Registry Actually Does

An MCP Gateway Registry acts as a control plane between AI agents and MCP servers.

Instead of letting every agent directly connect to every MCP server, we introduce a managed gateway and registry layer.

The architecture becomes cleaner:

AI Agents / Developers / Applications
              |
              v
      MCP Gateway and Registry
              |
              v
        Approved MCP Servers
              |
              v
      Enterprise Applications

This gives us a much better operating model.

The registry helps maintain information about available MCP servers:

Server name
Owner
Description
Capabilities
Available tools
Security scopes
Environment
Version
Health status
Approval status
Discovery metadata

The gateway helps control and route access:

Authentication
Authorization
Tool discovery
Request routing
Policy enforcement
Logging
Monitoring
Access control

This is important because enterprise agents should not randomly discover and use tools. They should use approved tools with approved scopes through a governed access path.

Why Hosting This on AWS ECS Makes Sense

There are multiple ways to host an MCP Gateway Registry.

You can run it on virtual machines.
You can deploy it on Kubernetes.
You can run it on ECS.
You can even start with a simple Docker Compose deployment for local testing.

But for an enterprise-grade AWS deployment, ECS Fargate is a very practical option.

It gives us a managed container runtime without the operational overhead of managing EC2 worker nodes or a full Kubernetes control plane.

For this type of gateway, ECS Fargate gives a good balance between simplicity and production readiness.

Key benefits include:

No EC2 server management
Container-based deployment
Built-in integration with IAM
Easy logging through CloudWatch
Service-level health checks
Integration with Application Load Balancer
Auto-scaling support
Good fit for Terraform automation
Lower operational complexity than Kubernetes

In my view, unless an organization already has a mature EKS platform and Kubernetes operating model, ECS Fargate is a better first choice for hosting this kind of control-plane service.

Kubernetes gives more flexibility, but it also adds more operational responsibility. For many teams, that is not needed on day one.

Target AWS Architecture

A production-style AWS architecture for MCP Gateway Registry can look like this:

Users / Agents / Developers
          |
          v
Route 53 Custom Domain
          |
          v
CloudFront
          |
          v
AWS WAF
          |
          v
Application Load Balancer
          |
          v
ECS Fargate Services
   |          |           |
Registry   Auth Server   Keycloak
   |          |           |
   |          |           v
   |          |      Aurora PostgreSQL
   |
   v
Amazon DocumentDB

Supporting Services:
- AWS Secrets Manager
- CloudWatch Logs
- CloudWatch Alarms
- ECR
- IAM
- ACM
- Optional Prometheus and Grafana

This is not just about running containers.

This architecture gives us:

Secure external access
Managed container hosting
Central authentication
Registry persistence
Secret management
Observability
Certificate management
Custom domain support
Infrastructure automation

That is the difference between a demo deployment and an enterprise deployment.

Core AWS Components

1. Amazon ECS Fargate

ECS Fargate runs the containerized services.

The deployment can include multiple services such as:

MCP Gateway Registry
Authentication server
Keycloak
MCP gateway service
Sample MCP servers
Sample agents
Observability components

Each service runs as an ECS task.

In production, I would recommend separating these into clear services rather than bundling too much into one container. This gives better control over scaling, logging, deployments, and troubleshooting.

For example:

Registry service       --> Handles MCP server metadata and discovery
Auth service           --> Handles authentication flow
Keycloak service       --> Identity and access management
Sample MCP services    --> Optional, mostly for demo or validation

For production, sample agents and sample MCP servers should be disabled or deployed only in a non-production environment.

2. Application Load Balancer

The Application Load Balancer exposes the ECS services through HTTPS endpoints.

It performs routing to the correct ECS target group.

For example:

/registry  --> Registry service
/auth      --> Auth service
/keycloak  --> Keycloak service

Or, in a cleaner production model:

registry.company.com  --> Registry service
auth.company.com      --> Auth service
kc.company.com        --> Keycloak

This domain-based separation is better for enterprise usage because it improves clarity, security boundaries, and operational ownership.

3. CloudFront

CloudFront can sit in front of the ALB.

For production, this is useful because it gives:

Global edge access
Better TLS handling
Additional protection layer
Integration point for WAF
Cleaner public access pattern
Potential performance benefits

For internal-only deployments, CloudFront may not always be required. But if the registry is accessed by distributed teams, external developers, or cloud-hosted agents, CloudFront becomes useful.

4. AWS WAF

I would strongly recommend using AWS WAF in front of internet-facing endpoints.

The MCP gateway is a sensitive entry point because it controls access to tools. So it should not be exposed casually.

Useful WAF controls include:

Rate limiting
AWS managed rule groups
IP restrictions
Bot protection
Geo restrictions if required
SQL injection protection
Cross-site scripting protection

This is especially important if agents, developers, or external systems access the gateway over the internet.

5. Route 53 and ACM

Route 53 manages DNS records.

ACM provides SSL/TLS certificates.

This gives us clean URLs such as:

registry.company.com
auth.company.com
kc.company.com

For enterprise adoption, this matters more than people think. Clean domain names make the platform feel like a real internal product rather than a temporary engineering setup.

6. Amazon Aurora PostgreSQL

Aurora PostgreSQL is used for Keycloak data.

Keycloak needs a relational database to store identity-related information, including:

Users
Realms
Clients
Roles
Sessions
Identity provider configuration
Authentication settings

Using Aurora gives better reliability than running a database inside a container.

For production, I would avoid containerized databases for this type of platform. Identity is too important to treat casually.

7. Amazon DocumentDB

DocumentDB is used by the registry layer.

This is where MCP server and agent metadata can be stored.

Example records may include:

MCP server name
MCP server URL
Tool list
Tool descriptions
Security scopes
Server health
Owner team
Environment
Version
Approval state
Risk classification

Over time, this registry becomes the enterprise catalog for agent-accessible capabilities.

This is very valuable.

It allows teams to search and discover what tools already exist instead of rebuilding the same MCP servers again and again.

8. AWS Secrets Manager

Secrets Manager should be used for:

Database credentials
Keycloak admin credentials
JWT secrets
Client secrets
Service credentials
API keys

No production credential should be hardcoded inside Terraform files, Docker images, or environment files stored in Git.

This is basic, but it is often missed in early AI platform projects.

9. CloudWatch Logs and Alarms

Every ECS service should write logs to CloudWatch.

At minimum, we should monitor:

Container startup failures
Authentication failures
Registry API errors
Tool discovery failures
Database connection errors
ECS task restarts
ALB 4xx errors
ALB 5xx errors
High latency
Memory pressure
CPU pressure

But for an MCP gateway, infrastructure logs are not enough.

We also need agent activity logs.

For example:

Which agent requested tool discovery?
Which MCP server was selected?
Which tool was invoked?
Which scope was used?
Was the request allowed or denied?
What was the response status?
How long did the tool call take?
Was sensitive data involved?

This is where the MCP gateway starts becoming a governance system, not just a routing layer.

Deployment Options

The Terraform setup supports different deployment modes.

Option 1: CloudFront Only

This is useful for a quick POC.

You do not need a custom domain. You get a CloudFront-generated URL.

This is suitable for:

Internal demo
Engineering validation
Architecture exploration
Short-term sandbox

This is not my preferred option for production, but it is a good way to start quickly.

Option 2: Custom Domain Only

In this model, Route 53 and ACM are used, but CloudFront may not be enabled.

You get URLs like:

registry.company.com
kc.company.com

This is better than a random generated URL, but it may not give enough edge protection if exposed publicly.

This can work well for private/internal deployments.

Option 3: CloudFront + Custom Domain

This is the best production model.

Traffic flows like this:

User / Agent
    |
    v
Custom Domain
    |
    v
CloudFront
    |
    v
WAF
    |
    v
Application Load Balancer
    |
    v
ECS Fargate Service

This gives a stronger production posture.

My recommendation:

Use CloudFront + Route 53 + WAF for production.
Use CloudFront-only for demo.
Use custom domain-only only for controlled internal environments.

Practical Deployment Flow

The deployment flow can be divided into clear stages.

Stage 1: Prepare AWS Account

Before starting, we should decide:

AWS region
VPC strategy
Domain name
Environment name
Access model
CIDR restrictions
Secrets strategy
Terraform state backend

For production, I would not deploy this into a random shared AWS account.

Better model:

Separate AWS account for dev
Separate AWS account for staging
Separate AWS account for production

At minimum, use separate environments and separate Terraform state.

Stage 2: Build and Push Images to ECR

The services need to be built as Docker images and pushed to Amazon ECR.

A simplified flow:

export AWS_REGION=us-east-1
make build-push

The result is a set of ECR image URIs.

Example:

123456789012.dkr.ecr.us-east-1.amazonaws.com/mcp-gateway-registry:v1.0.0
123456789012.dkr.ecr.us-east-1.amazonaws.com/mcp-gateway-auth:v1.0.0
123456789012.dkr.ecr.us-east-1.amazonaws.com/mcp-gateway:v1.0.0

For production, avoid using latest.

Use versioned immutable tags.

Bad:

mcp-gateway-registry:latest

Better:

mcp-gateway-registry:v1.0.3

Best:

mcp-gateway-registry:v1.0.3-build-20260524

This helps with rollback, audit, and release traceability.

Stage 3: Configure Terraform Variables

The terraform.tfvars file is where we configure the deployment.

Important values include:

aws_region = "us-east-1"

enable_cloudfront  = true
enable_route53_dns = true

base_domain = "company.com"

session_cookie_domain = ".company.com"
session_cookie_secure = true

ingress_cidr_blocks = [
  "YOUR_OFFICE_IP/32",
  "YOUR_VPN_IP/32"
]

Database and admin passwords should be handled carefully.

In a strong production model, these should come from a secure secret injection process rather than being manually placed in local files.

Stage 4: Initialize Terraform

Run:

terraform init -upgrade

For production, Terraform state should be stored remotely.

Recommended backend:

S3 bucket for state
DynamoDB table for locking
KMS encryption
Restricted IAM access

Do not use local state for production.

Local state is acceptable for learning, but not for enterprise infrastructure.

Stage 5: Create Certificates First

ACM certificates often require DNS validation.

That is why the deployment may need a first targeted apply for certificates.

Conceptually:

terraform apply \
  -target=aws_acm_certificate.keycloak \
  -target=aws_acm_certificate.registry \
  -target=aws_acm_certificate_validation.keycloak \
  -target=aws_acm_certificate_validation.registry

This allows certificates to be created and validated before the rest of the infrastructure depends on them.

Stage 6: Deploy Full Infrastructure

After certificate validation:

terraform apply

This deploys the full stack:

Networking
Security groups
ECS cluster
ECS services
ALB
Target groups
CloudFront
Route 53 records
Aurora PostgreSQL
DocumentDB
Secrets
CloudWatch logs
IAM roles
Optional observability stack

At this point, the infrastructure is created, but the application may still need initialization.

Stage 7: Run Post-Deployment Setup

Post-deployment setup is very important.

This step usually performs:

Terraform output extraction
DNS validation
ECS service health checks
Keycloak realm setup
Client setup
Admin user setup
DocumentDB collection initialization
Registry indexes
Scope setup
Service restart
Endpoint validation

This step converts infrastructure into a usable platform.

Without this, the containers may be running, but the gateway may not be fully ready.

How the Gateway Should Be Used After Hosting

Once deployed, teams can start registering MCP servers.

A good MCP server registration should include:

Server name
Business capability
Owner team
Technical owner
Environment
Base URL
Supported tools
Required scopes
Risk level
Data classification
Health check endpoint
Approval status
Version

For example:

Name: Salesforce Opportunity MCP Server
Owner: Sales Platform Team
Environment: Production
Tools:
- searchOpportunity
- updateOpportunityStage
- getAccountDetails
Scopes:
- salesforce.read
- salesforce.opportunity.update
Risk: High
Data: Customer and revenue data
Approval: Required

This level of metadata is important.

Without it, the registry becomes just another technical catalog. With it, the registry becomes a real enterprise control plane.

Enterprise Governance Model

For enterprise usage, I would define a clear lifecycle for MCP servers.

Suggested MCP Server Lifecycle

Draft
   |
Submitted for Review
   |
Security Review
   |
Approved for Dev
   |
Approved for Production
   |
Monitored
   |
Deprecated
   |
Retired

Every MCP server should have an owner.

Every high-risk tool should have approval.

Every production MCP server should have monitoring.

Every deprecated server should have a retirement date.

This may sound heavy, but it is necessary once agents start touching real systems.

Access Control Model

The gateway should not allow all agents to use all MCP servers.

That is a weak design.

A better model is scope-based access.

Example:

Agent: Sales Copilot
Allowed scopes:
- salesforce.read
- quote.read
- product.search

Not allowed:
- discount.approve
- contract.delete
- customer.export

Another example:

Agent: Deal Desk Agent
Allowed scopes:
- quote.read
- quote.update
- discount.request
- contract.read

Requires approval:
- discount.approve
- final_quote.submit

This is how we prevent agents from becoming over-permissioned.

One of the biggest risks in agentic AI systems will be excessive tool permission. If we give one agent too many tools and too much authority, it becomes hard to control behavior and impact.

Observability for Agentic Systems

Traditional application monitoring is not enough here.

We need both system observability and agent observability.

System Observability

Track:

CPU
Memory
Container restarts
Task failures
ALB errors
Request latency
Database connections
Authentication errors

Agent and Tool Observability

Track:

Agent ID
User ID
Tool requested
MCP server used
Scope used
Decision outcome
Policy result
Execution latency
Failure reason
Data classification
External system touched

For example, a useful audit log may look like this:

{
  "agent": "sales-copilot",
  "user": "john@company.com",
  "mcp_server": "salesforce-opportunity-server",
  "tool": "updateOpportunityStage",
  "scope": "salesforce.opportunity.update",
  "decision": "allowed",
  "timestamp": "2026-05-24T10:15:00Z",
  "latency_ms": 450,
  "status": "success"
}

This type of logging becomes extremely important when something goes wrong.

If an agent updates the wrong opportunity or calls a pricing tool incorrectly, we should be able to reconstruct exactly what happened.

CI/CD Model

For production, deployment should not be manual.

A good CI/CD pipeline should look like this:

Developer raises PR
        |
Code review
        |
Build Docker images
        |
Run unit tests
        |
Run container security scan
        |
Push image to ECR
        |
Terraform plan
        |
Manual approval for production
        |
Terraform apply
        |
Run post-deployment setup
        |
Smoke test
        |
Notify platform team

This keeps the deployment controlled and auditable.

For rollback, the team should be able to redeploy a previous image tag quickly.

Recommended Environment Strategy

I would recommend at least three environments.

Development
Staging
Production

Development

Used for engineering testing.

Can have relaxed settings.

Sample MCP servers allowed
Lower database capacity
CloudFront-only mode acceptable
Limited monitoring

Staging

Used for pre-production validation.

Should be close to production.

Custom domain
WAF enabled
Production-like IAM
Production-like secrets
Observability enabled

Production

Used for real enterprise agents.

Should be hardened.

Separate AWS account
CloudFront + WAF
Private subnets
Strict ingress
Immutable images
Centralized logs
Audit trail
Backup enabled
Approval workflow

Production Hardening Checklist

Before calling this production-ready, I would validate the following.

Remote Terraform state enabled
Terraform state encrypted
DynamoDB locking enabled
Separate AWS accounts or environments
Secrets stored in Secrets Manager
No secrets in Git
CloudFront enabled
WAF enabled
Ingress restricted
Keycloak admin access restricted
ECS tasks in private subnets
ALB security groups reviewed
Aurora backups enabled
DocumentDB backups enabled
CloudWatch alarms configured
Container image scanning enabled
Immutable image tags used
IAM least privilege applied
Audit logging enabled
MCP server ownership defined
Tool scopes defined
Production approval process defined
Runbook created
Rollback process tested

The most common mistake is to stop after the Terraform deployment succeeds.

That only means infrastructure exists.

It does not mean the platform is secure, governed, observable, or ready for production.

Operational Runbook

For a serious enterprise setup, the platform team should maintain a simple runbook.

The runbook should answer:

How do we onboard a new MCP server?
How do we approve a production MCP server?
How do we revoke access?
How do we rotate secrets?
How do we check service health?
How do we debug registry failures?
How do we debug authentication failures?
How do we rollback a release?
How do we retire an old MCP server?
How do we investigate suspicious tool usage?

This is where platform maturity comes in.

An MCP gateway is not a one-time deployment. It becomes part of the agentic AI platform.

Where This Fits in an Enterprise Agent Architecture

In a broader enterprise agentic AI architecture, the MCP Gateway Registry sits between orchestration and enterprise tools.

A practical model:

User Interface
      |
      v
Agent Orchestrator
      |
      v
Policy / Guardrail Layer
      |
      v
MCP Gateway Registry
      |
      v
MCP Servers
      |
      v
Enterprise Systems

The orchestrator decides what needs to be done.

The policy layer checks whether the action is allowed.

The MCP gateway provides controlled tool discovery and access.

The MCP server performs the actual system interaction.

This separation is important.

Do not put all responsibilities into one big agent.

That becomes hard to scale, hard to debug, and dangerous to govern.

My Practical Recommendation

For a real enterprise deployment, I would host the MCP Gateway Registry with this setup:

AWS ECS Fargate for services
CloudFront in front
AWS WAF enabled
Route 53 custom domains
ACM certificates
Application Load Balancer
Private subnets for ECS tasks
Aurora PostgreSQL for Keycloak
DocumentDB for registry metadata
Secrets Manager for credentials
CloudWatch for logs and alarms
Optional Grafana and Prometheus for deeper observability
S3 backend for Terraform state
DynamoDB for Terraform locking
CI/CD for image build and deployment
Immutable ECR image tags
Strict admin access
Scope-based authorization
Audit logs for all tool usage

For a POC, I would keep it simple.

For production, I would not compromise on security, logging, and access control.

Key Lessons Learned

The biggest lesson is this:

Hosting the MCP Gateway Registry is not only an infrastructure activity. It is the beginning of an operating model for enterprise agents.

If agents are going to use real tools, then organizations need:

Tool ownership
Tool approval
Tool discovery
Tool scopes
Tool observability
Tool lifecycle management
Tool risk classification

Without this, agentic AI systems may work technically but fail operationally.

And in enterprises, operational failure is usually what blocks adoption.

Final Thought

MCP is making tool integration more standard for AI agents. That is a very important shift.

But standardization also creates scale.

And once we scale the number of agents and tools, we need governance.

That is why an MCP Gateway Registry should be treated as a core platform capability, not as a side component.

It gives engineering teams a structured way to expose tools.
It gives security teams a way to control access.
It gives platform teams a way to monitor usage.
It gives business teams more confidence that agents are not directly and blindly touching enterprise systems.

In my view, this is one of the important building blocks for production-grade agentic AI systems.

The future will not be one agent directly connected to many tools.

The future will be governed agent ecosystems, where tools are registered, discoverable, monitored, secured, and lifecycle-managed through a central control plane.

推荐订阅源

DEV Community