FlashMoE: Fast Distributed MoE in a Single Kernel

Cornell University
NeurIPS 2025

*Now at Stanford University
First research result visualization

Comparing FlashMoE with state-of-the-art techniques that either do not overlap communication and computation (left, top) or do some overlap (left, middle). FlashMoE is a persistent kernel that fuses all computation and communication of the MoE operator (left, bottom). FlashMoE implements device-initiated computation (gate, expert FFN, scale) and communication tasks (right).

Abstract

The computational sparsity of Mixture-of-Experts (MoE) models enables sub-linear growth in compute cost as model size increases, thus offering a scalable path to training massive neural networks. However, existing implementations suffer from low GPU utilization, significant latency overhead, and a fundamental inability to leverage task locality, primarily due to CPU-managed scheduling, host-initiated communication, and frequent kernel launches. To overcome these limitations, we develop FlashMoE, a fully GPU-resident MoE operator that fuses expert computation and inter-GPU communication into a single persistent GPU kernel. FlashMoE enables fine-grained pipelining of dispatch, compute, and combine phases, eliminating launch overheads and reducing idle gaps. Unlike existing work, FlashMoE obviates bulk-synchronous collectives for one-sided, device-initiated, inter-GPU (R)DMA transfers, thus unlocking payload efficiency, where we eliminate bloated or redundant network payloads in sparsely activated layers. When evaluated on an 8-H100 GPU node with MoE models having up to 128 experts and 16K token sequences, FlashMoE achieves up to higher GPU utilization, lower latency, 5.7× higher throughput, and better overlap efficiency compared to state-of-the-art baselines—despite using FP32 while baselines use FP16. FlashMoE shows that principled GPU kernel–hardware co-design is key to unlocking the performance ceiling of large-scale distributed ML. We provide code at https://github.com/osayamenja/FlashMoE.

System Overview

Fused kernel architecture.

Results

GPU SM Utilization

Scaling GPUs

Scaling Tokens (4 GPUs)

Scaling Tokens (8 GPUs)

Scaling Experts (4 GPUs)

Scaling Experts (8 GPUs)

BibTeX

@article{aimuyo2025FlashMoE,
  title={FlashMoE: Fast Distributed MoE in a Single Kernel},
  author={Aimuyo, Osayamen Jonathan and Oh, Byungsoo and Singh, Rachee},
  journal={Advances in Neural Information Processing Systems},
  year={2025},
  url={https://neurips.cc/virtual/2025/poster/119124}
}