


















In early 2023, Slack faced a foundational challenge: serving Large Language Models (LLMs) at enterprise scale with the security, reliability, and performance our customers expect. Over three years, we evolved from basic infrastructure to orchestrating a sophisticated multi-cloud architecture. We didn’t just want shiny new models; we needed a system resilient to regional outages and GPU scarcity. Our journey moved through four distinct phases, shifting from reactive infrastructure management to proactive, multi-vendor orchestration.
When we built the initial stages of Slack AI, AWS SageMaker was the natural starting point. It was a managed ML Serving platform that offered the key things that we were looking for: Security, FedRamp compliance, model availability and control. We were able to leverage a sophisticated escrow virtual private cloud (VPC) strategy to establish a strict zero-knowledge environment: our data remained private to Slack, and the provider’s proprietary model weights remained inaccessible to us.
To maximize uptime for a global user base, we deployed these containers across multiple AWS regions. This required our teams to manage the operational lifecycle, including cross-region IAM roles, balanced routing across model endpoints, proactive capacity planning, and auto-scaling logic.
While SageMaker provided the necessary security, the overhead was immense. We faced three primary taxes:
By early 2024, we mitigated these via On-Demand Capacity Reservations (ODCR) and proactive, cron-based scaling. However, this reinforced a hard truth: we were spending too many engineering cycles on plumbing. To scale, we needed automated capacity, not manual coordination.
As the AI ecosystem and feature usage accelerated, newer and higher quality models emerged quickly. While we were maintaining a custom serving solution on SageMaker, AWS was heavily prioritizing Amazon Bedrock, its purpose-built managed LLM service.
Hosting Anthropic models via an escrow VPC led to a “catch-up” cycle. Model iterations and optimizations often debuted on Bedrock weeks or months before SageMaker availability. For Slack, where staying at the bleeding edge of model quality is a competitive necessity, this gap became a significant driver for our next architectural evolution.
By mid-2024, AWS Bedrock had matured significantly. It had achieved FedRamp Moderate compliance and also promised the same security posture that we required. The decision to migrate was a strategic pivot as it offered three immediate advantages:
In the Bedrock ecosystem, capacity is measured in Model Units (MUs). Each MU provides a deterministic amount of throughput, measured in tokens per minute. Shifting from GPU instances to MUs allowed us to abstract away the hardware and focus entirely on raw throughput. To minimize migration risk, we prioritized provisioned throughput infrastructure first, leaving on demand infra as a fast follow.
We executed the transition through a multi-stage migration strategy:
The migration to Bedrock delivered immediate, compounding wins for our engineering teams and our customers:
This solidified a core Slack AI engineering principle: measure first, migrate gradually, and monitor continuously.
While Provisioned Throughput was a massive leap forward for predictable, consistent workloads, it wasn’t perfectly optimized for the workloads. We encountered two primary efficiency hurdles:
These challenges led us to our next evolution: finding a way to balance the reliability of provisioned capacity with the economic and technical flexibility of On-Demand scaling.
With high confidence in Bedrock and mature monitoring, we moved to close the final efficiency and quality gap. Historical analysis revealed that feature usage fluctuated with business hours, leaving some idle capacity overnight.
Rather than maintaining a static footprint for 24/7 peak capacity, moving to on-demand infrastructure allowed us to solve the idle capacity problem. It gave us the architectural agility to support highly variable workloads without the friction of manual over-provisioning. For features with a 10x variance between peak and off-peak hours, the efficiency gains were substantial. More importantly, it removed the technical bottleneck we faced in Phase 2: because we were no longer locked into multi-month commitments, we regained the freedom to migrate features to different models. This meant that as soon as a more performant model dropped and passed our internal quality and metrics bars, we could pivot our infrastructure to support it within a day, rather than waiting months for a contract to expire.
We didn’t simply flip a switch and move everything to On-Demand. To balance efficiency with a premium user experience, we implemented a Hybrid Routing strategy. We kept high-volume, latency-sensitive features on dedicated capacity (Provisioned Throughput) to ensure a consistent “snappy” feel. Simultaneously, we moved asynchronous, bursty workloads – like nightly Recaps – to On-Demand capacity. To bridge the gap, we engineered a Spillover Pattern: if a sudden surge pushed us beyond our reserved limits, excess requests automatically “spilled over” to on-demand endpoints, ensuring we never dropped a request due to capacity ceilings.
Shifting to On-Demand traded rigid pre-planning for architectural agility, eliminating manual capacity management. By utilizing Bedrock’s ability to route across different US regions based on real-time availability, we were able to find capacity dynamically while adhering to our regional data boundaries. However, this flexibility introduced a new set of variables that we had to solve for:
To mitigate these risks, we didn’t just accept the trade-offs – we built a more intelligent AI Platform abstraction. We developed a model hierarchy for every AI feature, allowing our system to automatically fall back to different models if the primary model reached a degraded state. Some examples of regressions are elevated time to first token latencies, throttling errors, and downward trend in customer feedback.
This hierarchy was a game-changer for model quality and reliability. If a specific model was underperforming or hitting limits in one region, the platform would reroute the request in real-time to another healthy endpoint. From the customer’s perspective, the experience remained seamless; they continued to receive high-quality results without ever knowing a complex failover had occurred behind the scenes.
While this internal fallback logic significantly increased our service resilience, it also highlighted two strategic gaps. First, no matter how many failovers we engineered within a single cloud, we remained susceptible to any potential provider-wide outage. Second, the AI landscape is moving with incredible velocity and remains highly fragmented. The state-of-the-art model for a specific task – whether it’s summarization, reasoning, or high-speed extraction – can change in a matter of weeks, and these leading models are often exclusive to specific cloud providers. Relying on any single vendor meant we might be artificially limiting our access to the highest-quality technology available. To ensure Slack AI always provides the best possible experience, we need the flexibility to go wherever the best models are while simultaneously meeting our security, compliance, and privacy standards.
As Slack AI scaled to millions of users, we realized that true enterprise-grade reliability and a “best-of-breed” model strategy required looking beyond any single provider. This realization was the primary catalyst for our latest evolution: the move to a Multi-Cloud architecture.
We recognized that providing a world-class AI experience required the best of every ecosystem. By early-2026 we officially expanded our footprint to include Google Cloud Platform (GCP) Vertex AI, not just as a failover for redundancy, but as a strategic engine to accelerate product innovation through access to a broader catalog of state-of-the-art models. Our goal is simple: ensure Slack remains the most intelligent place to get work done. This move wasn’t done just for the sake of complexity, but rather a strategic shift driven by four key factors:
Building a production-ready GCP integration was a massive cross-functional effort. It required tight synchronization across teams such as Security, Risk and Compliance, Trust and Integrity, AI Quality, Legal, and Cloud Providers to ensure our data boundaries remained ironclad across the board. Expanding to GCP Vertex AI turned our infrastructure into a strategic engine for product innovation. Rather than being limited to any single provider’s catalog, we can now granularly match specific features to the models best suited for them – balancing factors like context window, latency, and reasoning capabilities. To make this a reality, we solved cold start engineering hurdles by implementing secretless authentication and an API Normalization layer that translates disparate provider signals into a unified language for our application logic.
The core technical challenge was building a system that abstracted away provider complexity. By enhancing our abstraction layer into an Intelligent Routing Layer, we ensured that users receive the fastest, highest-quality response available. If one model or provider slows down, the system instantly reroutes the request to a better-performing alternative, making the underlying complexity completely invisible to the user while maintaining a seamless experience. It contains:
Running a multi-cloud footprint at our scale is a major technical undertaking. It’s a conscious trade-off: we gain immense flexibility but it requires a much more sophisticated approach to how we manage our systems:
While multi-cloud increases operational overhead, the trade-off is a superior service. We have removed single points of failure, improved quality benchmarks by matching features to specific model strengths, and gained the strategic leverage to adopt new innovations the moment they hit the market.
We arrived at a multi-cloud architecture not for the sake of complexity, but to enhance Slack’s standards for product innovation and reliability. Looking back, five themes stand out:
The biggest hurdles in scaling AI aren’t just technical; they also include legal, risk, compliance, and security related tasks. Achieving deep alignment between these teams and engineering is what allowed us to scale to millions of users without compromising our trust standards.
As seen in our Phase 2 move, the most critical decision wasn’t which model to use, but how we built the logic around them. Agility and speed to market are our primary competitive edge.
Managed services mature monthly. Because we remained provider-agnostic, we can now adopt breakthroughs in latency or reasoning without a total rewrite.
Internal failovers aren’t enough. Our move in Phase 4 to a multi-provider stack ensures Slack stays online even during any potential platform-wide cloud disruption.
An LLM service that is “up” but slow is effectively broken. By treating different dimensions of data such as p90 spikes as soft failures and feedback trends, our routing layer ensures users have a snappy experience.
The future of enterprise AI is multi-cloud, multi-model, and dynamically orchestrated. By prioritizing portability and staying close to the market, we haven’t just built a way to use AI – we’ve built a platform that harnesses the best the industry has to offer the moment it arrives. We’re looking forward to seeing what we build next!
Interested in taking on interesting projects, making people’s work lives easier, or just building some pretty cool forms? We’re hiring! 💼
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。