





















“What goes around comes around; every challenge demands its own solution.” — Adapted from Guiguzi, an ancient Chinese philosopher
To be honest, the WaferLLM project has indeed come to an end. It was accepted at OSDI, but rather than pure joy, my feelings are more complex. As a skeptic, I still don’t believe that LLMs and the Transformer architecture are the ultimate answer, nor do I think the scaling law curve can continue indefinitely. Standing at this technological inflection point, we need to stay clear-headed more than ever—complacency means stagnation.
The current AI development is like climbing a peak that no one has ever conquered. The path ahead is filled with both unknown challenges and unique breakthrough possibilities. This understanding led me to choose this quote from “Guiguzi” at the beginning:
“Transformation follows circular patterns, each with its own dynamics, repeatedly seeking balance, adapting strategies to circumstances.“
This ancient wisdom about technological evolution perfectly mirrors our research: technological development is like interlocking gears, facing different phases of changing dynamics, requiring continuous exploration of fundamental principles and strategic adjustments based on actual circumstances. WaferLLM’s design philosophy perfectly embodies this concept of the “circle.”
So what is Wafer?
Simply put, Wafer-Scale Chip is a technology that uses an entire wafer as a single integrated circuit, rather than cutting the wafer into multiple smaller chips.
Looking at the evolution of chip area, from single-core CPUs to multi-core CPUs, and then to specialized accelerators like GPUs and TPUs, what appears as stacking of computing units is actually an ongoing battle between computational demands and physical limitations. As Moore’s Law gradually fails and single-chip area is locked in the 400-800 square millimeter range, this contradiction becomes increasingly acute under the dual constraints of lithography technology and yield rates.
Wafer-scale integration technology breaks through this deadlock, increasing chip area by two orders of magnitude. Take the Cerebras WSE-2 as an example: with its ultra-large size of 215mm×215mm (46,225 square millimeters), it achieves a hundred-fold area breakthrough compared to traditional GPU chips, integrating 2.6 trillion transistors and 850,000 AI computing cores, breaking through the “area bottleneck.”
From Table 1 in our paper, we can see that wafer-scale chips have several significant advantages over traditional system-level packaging:

In short, we’re trying to follow the traditional multi-core approach and further increase chip area. The above talking points actually come from Wikipedia and other easily accessible resources. It seems like we’re just using better chips and reaping hardware benefits.
But is that really the case?
But if we’ve designed chips with larger areas that release more computing power, what’s the cost? After all, there’s no free lunch. Behind every breakthrough lies new constraints.
The cost here is: most of our previous algorithm designs will no longer be applicable.
First, let’s look at how we previously programmed and designed models.
In traditional chip structures, whether CPU or GPU, for a single accelerator, our design approach is closer to Uniform. When designing, we concentrate computing cores together, and with the distance from computing cores, we set up multi-level caches for more efficient memory access.


Now, turning from hardware to models. Let’s use the simplest DNN example, MNIST digit1 recognition :
1 | |
We’ve probably seen model structure diagrams like this countless times 2,

When we program and design models, our thinking objects are operators and modules, and we practically never consider data placement issues. We don’t think about where each matrix, h1, h2 is placed on the GPU. Most of the time when we think, we only care about what data is in GPU RAM. So when models with special computation-to-memory access ratios appear, like the decode phase in Llama architecture LLM models.
Let’s bring out the LLM decoding code again, using a numpy version of a simple decoding module as an example, referencing the llama3.np project . We can see codes3 like :
1 | |
In our model design process, we also rarely consider data placement for each part.
More specifically, for example:
xq, xk, xv, our norm_x actually needs to be kept as close to the computing units as possiblez3, placing it directly in the original location of z1 seems to bring many improvementsOf course, there are more optimization points, which I won’t elaborate on here. Our current solution, as you’ve probably guessed, is Triton. Indeed, we can customize very high-performance CUDA operators and manually manage the complex multi-level memory just mentioned. But Triton provides limited but sufficient programming abstractions, solving what we care about most and what has the most important impact on performance, namely L1 Cache behavior. For LLMs, this step is undoubtedly the most critical for performance, and the Flash Attention series of work discovered the performance bottleneck here and proposed high-performance optimizations.
On GPUs, we’re already unable to handle complex memory management. So when it comes to LLM inference on Wafer, the main challenge is that chip memory becomes even more complex, making the problem even harder to solve. Let’s first look at the structure of the Cerebras chip 4.

Cerebras’s Wafer Scale Engine (WSE) adopts a unique 2D mesh architecture design, where the entire chip consists of tens of thousands of processing elements (PEs) arranged in a grid pattern on a single silicon wafer. Each PE contains three key components: a compute kernel responsible for actual computation, a small local memory for data storage, and a router for handling communication needs. These PEs are interconnected through a high-performance Network-on-Chip, forming a high-bandwidth, low-latency communication network that enables efficient data flow between different processing units.
So we can see that in such a complex memory structure, even how to cut and arrange model weights becomes a question worth exploring. Not to mention the more complex KV Cache management.
However, in our WaferLLM work, we proposed a complete and feasible solution to all the above problems.
On a Wafer Chip, how many PEs should we use to compute each operator in the model? To write code on a Wafer Chip, the first question we think about is how to map. If there are too few PEs, our parallelism is insufficient and computation becomes the main bottleneck; if we use too many PEs, not only does the number of communications increase, but the communication distance also grows, introducing more communication overhead. Due to current architecture limitations, computation and communication cannot be 100% completely hidden, so the total time overhead has an approximately linear relationship with both. That is, we need to ensure that neither computational overhead nor communication overhead becomes too high.
Therefore, before proposing our algorithms and solutions, we need more precise metrics to guide and evaluate our algorithm design. To address the unique hardware characteristics of Wafer Chips, we proposed the PLMR model, which keenly captures the four attributes we care about most.
PLMR is an acronym for four key hardware attributes:
The PLMR model was proposed based on an in-depth analysis of why existing AI systems cannot fully utilize performance on Wafer Chips. Using the PLMR model, we can analyze why existing AI systems struggle to fully utilize Wafer Chips:
The PLMR model provides the following guidance for designing wafer-scale LLM systems:
For LLMs to work on accelerators like Wafer Chips, we need to fully leverage the parallel capabilities of numerous PEs. In both Prefill and Decode phases, we face various challenges.
First, in the Prefill phase, we extensively use GEMM, i.e., matrix-matrix multiplication. Traditional matrix multiplication cannot satisfy PLMR constraints.

To determine a scalable distributed GEMM suitable for the PLMR model, we defined the following metrics:
Analyzing existing distributed GEMM methods and showing how MeshGEMM satisfies these metrics:
Our design involves two key steps:
Cyclic shifting enables MeshGEMM to satisfy the M and R attributes by limiting communication to two neighbors and minimizing memory usage. It ensures the correctness of GEMM results, following a data movement scheme similar to Cannon.
For communication, we want to further minimize critical path length to satisfy the L attribute. Our key idea is to introduce the INTERLEAVE operation to find the logical-to-physical mapping relationship.
The INTERLEAVE algorithm determines the sending and receiving neighbor indices based on the core’s index value:


This complex pseudocode is easier to understand with visualization.
In the animation above, we first see that in Cannon’s algorithm, there’s an extremely long communication distance step.
To avoid ultra-long communication links, we thought of a ring data structure where the distance between any two neighbors is 1. If we flatten the ring onto a one-dimensional space, we achieve the interleave operation, implementing a movement scheme with a maximum link of 2. This greatly optimizes system communication efficiency.
Time complexity is reduced from O(n) to O(1).
Our discussion based on one-dimensional arrays naturally extends to two-dimensional grids. We perform interleave operations along both X and Y axes. In MeshGEMM, any single operation limits communication overhead to two hops.
The subsequent operations are not much different from Cannon’s algorithm. We adopt the same process of alternating movement and computation. The main steps of the MeshGEMM algorithm are:
A_sub and B_sub along two dimensions, forming N×N blocks distributed across cores. Each core receives one block of A_sub and B_sub. MeshGEMM then uses INTERLEAVE to initialize each core’s neighbor positions.C_sub = A_sub × B_sub + C_subA_sub along X-axis and B_sub along Y-axis to get new A'_sub and B'_sub for the next computation (as shown in ③ of Figure 7)C_subThe completion time of distributed GEMV mainly depends on an Allgather operation, which aggregates partial results from all selected cores and broadcasts the aggregated results back to all cores. Similar to GEMM, we analyze the same metrics.

MeshGEMV is the only method that fully complies with the PLMR model:
Pipeline Allreduce: Commonly used in TPU cluster systems and Cerebras. It limits routing resource usage to O(1) per core (satisfies R). However, its longest aggregation path is from tail to head core, as shown by the red line, spanning O(N) critical path (violates L).
Ring Allreduce: Commonly used in GPU cluster systems as the default configuration. It limits routing resource usage to O(1) (satisfies R). However, it spans O(N) hops on the critical path (violates L).
2-Way K-Tree Allreduce: We build a balanced K-tree reducing from two directions; its longest aggregation path is from head or tail core to the tree root core. The critical path is $O(N^{1/k} × K)$, which can address L. The maximum communication paths per root core is O(K), which can satisfy R constraints by adjusting K.
MeshGEMV algorithm main steps:
B_sub along two dimensions, forming N×N blocks distributed across computing cores. For vector A, MeshGEMV splits along the vector length, forming N blocks distributed on one axis and replicating A on the other axis. Each core receives one block of A_sub and B_sub. Then determine which cores form a group at each stage based on the K-tree structure for efficient aggregation results.A_sub × B_sub, computing their respective partial sums C_sub.C_subC_sub from all K-tree root coresAfter handling operators, we next begin to layout the entire model, where Prefill and Decode have some different challenges:
First, we proposed two different partitioning schemes.
Prefill partitioning scheme:

Decode partitioning scheme:

In Prefill partitioning: We partition the two dimensions of matrices along the X and Y axes of the PE array, achieving finer-grained, million-level parallelism than existing methods. The figure above shows the partitioning method for Self Attention and Feed Forward in the Prefill phase.
In Decode partitioning: When tensor dimensions are insufficient to achieve the high parallelism required for Decode, we replicate vectors in the orthogonal direction to data arrangement in LLMs. This method improves parallelism and ensures load balancing across all cores while avoiding additional communication operations, trading redundant storage for communication.
To eliminate matrix transposition, in Prefill, we designed transpose-free distributed GEMM. We proposed transpose-free operators, changing the communication direction, using transposed distributed GEMM (dist-GEMM-T) to compute Q@K^T during Prefill, avoiding the costly matrix transpose operation on NoC.
In the Decode process, since the operator bottleneck is in Allgather, redesigning the algorithm doesn’t bring benefits, so we pre-optimize model weight layout to avoid matrix transposition. Pre-optimizing model weight layout for Decode, directly reading transposed matrices onto the PE array, can eliminate matrix transposition in the MeshGEMV phase. Although this introduces overhead of rearranging weights between Prefill and Decode phases, with the super-strong communication capability on the NoC network, this overhead is almost negligible compared to generating one token.
KV cache management on PLMR devices is also not simple, requiring storing large amounts of data on distributed cores while adhering to local memory constraints (M) and allocating KV cache computation to achieve high parallelism (P).
Simply put: We implemented an adaptive KV Cache storage scheme on 2D Mesh. Through dynamic balancing, on-chip memory utilization is more充分.

Our findings and solutions include:
Through WaferLLM, we demonstrated the huge potential of Wafer Chips in LLM inference. We conducted comprehensive evaluations on Cerebras WSE-2, comparing with multiple state-of-the-art systems. Experimental results show that WaferLLM achieved significant breakthroughs in system performance, operator optimization, and energy efficiency.
We first evaluated the end-to-end performance comparison of WaferLLM with representative systems, including the distributed memory architecture T10 system and the shared memory architecture Ladder system.
Performance improvement over T10 system:
Although T10 considers memory constraints (M) and routing resource limitations (R) of PLMR devices, it cannot handle the core architecture of mesh NoC interconnects, thus unable to address different hop distances (L) or scale to millions of cores (P).
Performance improvement over Ladder system:
The Ladder system is designed for shared memory architectures and cannot adapt to PLMR device characteristics, resulting in inability to partition LLMs on millions of cores (P), expensive long-range NoC communication (L), inability to handle local memory constraints (M), and limited routing resources (R).


MeshGEMM Performance:
MeshGEMV Performance:
Scalability Analysis: Our operators demonstrate excellent scalability across different core configurations. For large-scale matrix operations like GEMM 8K, computation becomes bandwidth-constrained rather than latency-constrained. Increasing core count can boost aggregate network bandwidth, resolving performance bottlenecks.
Shift-based KV cache management achieves huge improvements over traditional concatenation-based methods (like PagedAttention):
| Model | Concatenation Method (PagedAttention) | Shift Method (WaferLLM) | Improvement |
|---|---|---|---|
| LLaMA3-8B | 382 tokens | 137,548 tokens | 360x |
| LLaMA2-13B | 16 tokens | 6,168 tokens | 385x |
This significant improvement stems from the balanced core utilization achieved by the shift method, addressing the data skew problem caused by concatenation methods.
We conducted a fair comparison between WaferLLM (based on Cerebras WSE-2) and NVIDIA A100 running vLLM. Both are manufactured using TSMC 7nm process.
GEMV Operation Comparison:
This reflects the advantages of wafer-scale devices through massive on-chip memory bandwidth and wafer-scale connections (connecting on-chip memory) compared to GPU’s PCB-level connections (connecting off-chip HBM).
Complete LLM Inference Comparison:
| Model | WaferLLM(WSE-2) | vLLM(A100) | Performance Improvement | Energy Efficiency Improvement |
|---|---|---|---|---|
| LLaMA3-8B | 2,480 tokens/s | 78.36 tokens/s | 31.6x | 1.4x |
| LLaMA2-13B | 1,848 tokens/s | 47.86 tokens/s | 38.6x | 1.7x |
Inference Speed Breakthrough:
While WaferLLM achieved significant performance improvements, we also observed some current limitations:
Performance degradation from GEMV to complete LLM: The 22x energy efficiency advantage of GEMV drops to 1.7x in complete LLM inference, mainly due to:
Hardware maturity impact: As a second-generation product, WSE-2 cores cannot fully overlap memory access and computation, edge core utilization is insufficient, and long-range NoC communication overhead still exists.
Software stack limitations: Cerebras’s current software stack has limited optimization compared to NVIDIA CUDA, affecting overall performance.
Despite these limitations, WaferLLM still achieved order-of-magnitude performance and energy efficiency improvements. We expect performance to further improve as wafer-scale AI computing continues to mature and these limitations are gradually resolved.
Having covered the dry theoretical parts of the paper, let’s circle back to the opening question: What is Wafer?
Technically, it’s wafer-scale integration, an implementation approach for larger hardware; but actually from a systems perspective, it’s a rebalancing of computation, communication, and memory access ratios. Our past distributed system designs will once again be active on the Wafer stage.
Wafer’s emergence challenges our existing understanding of computing paradigms. Its unique PLMR characteristics require us to redesign algorithms and systems at the software level. MeshGEMM and MeshGEMV are just the beginning; there’s still vast research space for future optimizations for sparse matrices, convolutions, and higher-order tensor operations.
Wafer’s potential lies not just as a tool for AI acceleration, but possibly as a glimpse of future computing architecture—a new paradigm that merges “distributed” and “on-chip.” We can boldly speculate that as AI chips reach a fever pitch, all the Tensor Parallel, Data Parallel, Pipeline Parallel, and even popular Expert Parallel approaches we’ve thought about will equally return to their essence in the face of new architectures. Our thinking returns to the most basic and essential systems problem: “scheduling of computation, communication, and memory access.”
Every step forward in technology is a process of “repeatedly seeking balance.” Facing new “dynamics,” we must not only “adapt strategies to circumstances” but also remember the “essence.” Continuously exploring the balance between computation and communication, hardware and software, theory and practice.
So, looking back at my blog posts from recent months, I have many feelings. After DeepSeek appeared, I became numb and self-defeating for a while, but thinking from a systems perspective, I found that most solutions to problems in this world are quite far-fetched. Whether vLLM or sglang, they all propose solutions in special environments and assumptions. And the systems-level problems of “computation, communication, memory access” seem completely unsolved.
After completing the Wafer project, I may have truly touched the boundary of the MLSystem field: abstracting complex hardware structures and balancing various resource overheads. And this is the System I love most.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。