Tianlong Chen, Zhenyu Zhang, Ajay Jaiswal, Shiwei Liu, Zhangyang Wang · Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers · SlidesLive

Kategorien

DE

Anmelden Kostenvoranschlag

Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers

Mai 1, 2023

Sprecher:innen

Über

Exploiting scale to revamp information absorption has recently become central to the success of deep learning and transformers have become de facto choice achieving numerous breakthrough performances on many real-world applications. Despite their enormous success, gigantic transformers suffer not only from exorbitant computational and memory footprints during training but also from severe collapse as evidenced by a high degree of parameter redundancy. Recently proposed Sparsely-activated Mixture-of-Experts (SMoEs) models have shown promise to mitigate the issue of training efficiency, yet they have some critical limitations. In particular, SMoEs models are prone to redundant experts due to representational collapse and poor scalability during inference and downstream fine-tuning primarily due to overfitting of the learned routing policy to the number of activated experts during training. As recent research efforts are predominantly focused on improving routing policies to encourage expert specializations, our work focuses on exploring the overlooked scalability bottleneck of SMoEs, to effectively benefit scaling large-scale transformers. To this end, we propose a new plug-and-play training framework, SMoE-Dropout to enable scaling transformers to better accuracy in the full capacity setting without collapse. Specifically, SMoE-Dropout consists of a randomly initialized and fixed router network to activate experts and gradually increase their number as training progresses over time. SMoE-Dropout naturally provides a “self-slimmable” property offering consistent boosted performance for transformers with an increase in activated experts during inference and downstream fine-tuning, subjected to resource availability. Our extensive experiments across diverse transformer architectures on a variety of tasks validate superior performance and substantial computation savings, compared to densely trained baselines with equivalent parameter counts. More precisely, our trained BERT outperforms their densely trained counterpart with consistent improvements of 1.03%, 0.78%, 1.09% on challenging reasoning tasks ASDiv-A, MAWPS, SVAMP, respectively. Codes and models will be publicly released.

Organisator

Gefällt euch das Format? Vertraut auf SlidesLive, um euer nächstes Event festzuhalten!

Professionelle Aufzeichnung und Livestreaming – weltweit.

Freigeben

Empfohlene Videos

Präsentationen, deren Thema, Kategorie oder Sprecher:in ähnlich sind