Efficient routing in sparse mixture-of-experts

Masoumeh Zareapoor, Pourya Shamsolmoali*, Fateme Vesaghati

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Sparse Mixture-of-Experts (MoE) architectures provide the distinct benefit of substantially expanding the model’s parameter space without proportionally increasing the computational load on individual input tokens or samples. However, the efficacy of these models heavily depends on the routing strategy used to assign tokens to experts. Poor routing can lead to under-trained or overly specialized experts, diminishing the overall model performance. Previous approaches have relied on the Topk router, where each token is assigned to a subset of experts. In this paper, we propose a routing mechanism that replaces the Topk router with regularized optimal transport, leveraging the Sinkhorn algorithm to optimize token-expert matching. We conducted a comprehensive evaluation comparing the pre-training efficiency of our model, using computational resources equivalent to those employed in the GShard and Switch Transformers gating mechanisms. The results demonstrate that our model expedites training convergence, achieving a speedup of over 2× compared to these baseline models. Moreover, under the same computational constraints, our model exhibits superior performance across eleven tasks from the GLUE and SuperGLUE benchmarks. We show that our model contributes to the optimization of token-expert matching in sparsely-activated MoE models, offering substantial gains in both training efficiency and task performance.
Original languageEnglish
Title of host publication2024 International Joint Conference on Neural Networks (IJCNN): Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Number of pages8
ISBN (Electronic)9798350359312
ISBN (Print)9798350359329
DOIs
Publication statusPublished - 09 Sept 2024
Event2024 International Joint Conference on Neural Networks (IJCNN) - Yokohama, Japan
Duration: 30 Jun 202405 Jul 2024

Publication series

NameInternational Joint Conference on Neural Networks (IJCNN): Proceedings
ISSN (Print)2161-4393
ISSN (Electronic)2161-4407

Conference

Conference2024 International Joint Conference on Neural Networks (IJCNN)
Country/TerritoryJapan
CityYokohama
Period30/06/202405/07/2024

Keywords

  • Sparse Mixture-of-Experts
  • computational load
  • MoE
  • routing

Fingerprint

Dive into the research topics of 'Efficient routing in sparse mixture-of-experts'. Together they form a unique fingerprint.

Cite this