FlashDMoE: Fast Distributed MoE in a Single Kernel

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

FlashDMoE: Fast Distributed MoE in a Single Kernel

Osayamen Jonathan Aimuyo^*, Byungsoo Oh, Rachee Singh

Cornell University
NeurIPS 2025
^*Now at Stanford University

Paper Code arXiv

Comparing FlashDMoE with state-of-the-art techniques that either do not overlap communication and computation (left, top) or do some overlap (left, middle). FlashDMoE is a persistent kernel that fuses all computation and communication of the MoE operator (left, bottom). FlashDMoE implements device-initiated computation (gate, expert FFN, scale) and communication tasks (right).

Abstract

The computational sparsity of Mixture-of-Experts (MoE) models enables sub-linear growth in compute cost as model size increases, thus offering a scalable path to training massive neural networks. However, existing implementations suffer from low GPU utilization, significant latency overhead, and a fundamental inability to leverage task locality, primarily due to CPU-managed scheduling, host-initiated communication, and frequent kernel launches. To overcome these limitations, we develop FlashDMoE, a fully GPU-resident MoE operator that fuses expert computation and inter-GPU communication into a single persistent GPU kernel. FlashDMoE enables fine-grained pipelining of dispatch, compute, and combine phases, eliminating launch overheads and reducing idle gaps. Unlike existing work, FlashDMoE obviates bulk-synchronous collectives for one-sided, device-initiated, inter-GPU (R)DMA transfers, thus unlocking payload efficiency, where we eliminate bloated or redundant network payloads in sparsely activated layers. When evaluated on a single 8-H100 GPU node with MoE models having up to 128 experts and 16K token sequences, FlashDMoE achieves up to 9× higher GPU utilization, 6× lower latency, 5.7× higher throughput, and 4× better overlap efficiency compared to state-of-the-art baselines, despite using FP32 while baselines use FP16. FlashDMoE demonstrates that principled GPU kernel-hardware co-design is key to unlocking the performance ceiling of large-scale distributed ML workloads.

System Overview

Fused kernel architecture.

Results

GPU SM Utilization

Scaling GPUs

Scaling Tokens (4 GPUs)

Scaling Tokens (8 GPUs)

Scaling Experts (4 GPUs)

Scaling Experts (8 GPUs)

BibTeX

@article{aimuyo2025FlashDMoE,
  title={FlashDMoE: Fast Distributed MoE in a Single Kernel},
  author={Aimuyo, Osayamen Jonathan and Oh, Byungsoo and Singh, Rachee},
  journal={Advances in Neural Information Processing Systems},
  year={2025},
  url={https://neurips.cc/virtual/2025/poster/119124}
}