Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

Zhang, Shulai; Zheng, Ningxin; Lin, Haibin; Jiang, Ziheng; Bao, Wenlei; Jiang, Chengquan; Hou, Qi; Cui, Weihao; Zheng, Size; Chang, Li-Wen; Chen, Quan; Liu, Xin

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2502.19811 (cs)

[Submitted on 27 Feb 2025 (v1), last revised 4 Mar 2025 (this version, v3)]

Title:Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

Authors:Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen, Xin Liu

View PDF HTML (experimental)

Abstract:Mixture-of-experts (MoE) has been extensively employed to scale large language models to trillion-plus parameters while maintaining a fixed computational cost. The development of large MoE models in the distributed scenario encounters the problem of large communication overhead. The inter-device communication of a MoE layer can occupy 47% time of the entire model execution with popular models and frameworks. Therefore, existing methods suggest the communication in a MoE layer to be pipelined with the computation for overlapping. However, these coarse grained overlapping schemes introduce a notable impairment of computational efficiency and the latency concealing is sub-optimal.
To this end, we present COMET, an optimized MoE system with fine-grained communication-computation overlapping. Leveraging data dependency analysis and task rescheduling, COMET achieves precise fine-grained overlapping of communication and computation. Through adaptive workload assignment, COMET effectively eliminates fine-grained communication bottlenecks and enhances its adaptability across various scenarios. Our evaluation shows that COMET accelerates the execution of a single MoE layer by $1.96\times$ and for end-to-end execution, COMET delivers a $1.71\times$ speedup on average. COMET has been adopted in the production environment of clusters with ten-thousand-scale of GPUs, achieving savings of millions of GPU hours.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2502.19811 [cs.DC]
	(or arXiv:2502.19811v3 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2502.19811

Submission history

From: Shulai Zhang [view email]
[v1] Thu, 27 Feb 2025 06:36:45 UTC (2,242 KB)
[v2] Sat, 1 Mar 2025 14:21:52 UTC (2,250 KB)
[v3] Tue, 4 Mar 2025 09:54:37 UTC (2,251 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators