TY - GEN
T1 - Efficient routing in sparse mixture-of-experts
AU - Zareapoor, Masoumeh
AU - Shamsolmoali, Pourya
AU - Vesaghati, Fateme
PY - 2024/9/9
Y1 - 2024/9/9
N2 - Sparse Mixture-of-Experts (MoE) architectures provide the distinct benefit of substantially expanding the model’s parameter space without proportionally increasing the computational load on individual input tokens or samples. However, the efficacy of these models heavily depends on the routing strategy used to assign tokens to experts. Poor routing can lead to under-trained or overly specialized experts, diminishing the overall model performance. Previous approaches have relied on the Topk router, where each token is assigned to a subset of experts. In this paper, we propose a routing mechanism that replaces the Topk router with regularized optimal transport, leveraging the Sinkhorn algorithm to optimize token-expert matching. We conducted a comprehensive evaluation comparing the pre-training efficiency of our model, using computational resources equivalent to those employed in the GShard and Switch Transformers gating mechanisms. The results demonstrate that our model expedites training convergence, achieving a speedup of over 2× compared to these baseline models. Moreover, under the same computational constraints, our model exhibits superior performance across eleven tasks from the GLUE and SuperGLUE benchmarks. We show that our model contributes to the optimization of token-expert matching in sparsely-activated MoE models, offering substantial gains in both training efficiency and task performance.
AB - Sparse Mixture-of-Experts (MoE) architectures provide the distinct benefit of substantially expanding the model’s parameter space without proportionally increasing the computational load on individual input tokens or samples. However, the efficacy of these models heavily depends on the routing strategy used to assign tokens to experts. Poor routing can lead to under-trained or overly specialized experts, diminishing the overall model performance. Previous approaches have relied on the Topk router, where each token is assigned to a subset of experts. In this paper, we propose a routing mechanism that replaces the Topk router with regularized optimal transport, leveraging the Sinkhorn algorithm to optimize token-expert matching. We conducted a comprehensive evaluation comparing the pre-training efficiency of our model, using computational resources equivalent to those employed in the GShard and Switch Transformers gating mechanisms. The results demonstrate that our model expedites training convergence, achieving a speedup of over 2× compared to these baseline models. Moreover, under the same computational constraints, our model exhibits superior performance across eleven tasks from the GLUE and SuperGLUE benchmarks. We show that our model contributes to the optimization of token-expert matching in sparsely-activated MoE models, offering substantial gains in both training efficiency and task performance.
KW - Sparse Mixture-of-Experts
KW - computational load
KW - MoE
KW - routing
U2 - 10.1109/IJCNN60899.2024.10650737
DO - 10.1109/IJCNN60899.2024.10650737
M3 - Conference contribution
SN - 9798350359329
T3 - International Joint Conference on Neural Networks (IJCNN): Proceedings
BT - 2024 International Joint Conference on Neural Networks (IJCNN): Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 International Joint Conference on Neural Networks (IJCNN)
Y2 - 30 June 2024 through 5 July 2024
ER -