Mai 1, 2023
Exploiting scale to revamp information absorption has recently become central to the success of deep learning and transformers have become de facto choice achieving numerous breakthrough performances on many real-world applications. Despite their enormous success, gigantic transformers suffer not only from exorbitant computational and memory footprints during training but also from severe collapse as evidenced by a high degree of parameter redundancy. Recently proposed Sparsely-activated Mixture-of-Experts (SMoEs) models have shown promise to mitigate the issue of training efficiency, yet they have some critical limitations. In particular, SMoEs models are prone to redundant experts due to representational collapse and poor scalability during inference and downstream fine-tuning primarily due to overfitting of the learned routing policy to the number of activated experts during training. As recent research efforts are predominantly focused on improving routing policies to encourage expert specializations, our work focuses on exploring the overlooked scalability bottleneck of SMoEs, to effectively benefit scaling large-scale transformers. To this end, we propose a new plug-and-play training framework, SMoE-Dropout to enable scaling transformers to better accuracy in the full capacity setting without collapse. Specifically, SMoE-Dropout consists of a randomly initialized and fixed router network to activate experts and gradually increase their number as training progresses over time. SMoE-Dropout naturally provides a “self-slimmable” property offering consistent boosted performance for transformers with an increase in activated experts during inference and downstream fine-tuning, subjected to resource availability. Our extensive experiments across diverse transformer architectures on a variety of tasks validate superior performance and substantial computation savings, compared to densely trained baselines with equivalent parameter counts. More precisely, our trained BERT outperforms their densely trained counterpart with consistent improvements of 1.03%, 0.78%, 1.09% on challenging reasoning tasks ASDiv-A, MAWPS, SVAMP, respectively. Codes and models will be publicly released.
Professionelle Aufzeichnung und Livestreaming – weltweit.
Präsentationen, deren Thema, Kategorie oder Sprecher:in ähnlich sind