Log in

Towards Data Science - 120. Liam Fedus and Barrett Zoph - AI scaling with mixture of expert models
share icon

120. Liam Fedus and Barrett Zoph - AI scaling with mixture of expert models

Towards Data Science

04/20/22

40m

About

Comments

Featured In

AI scaling has really taken off. Ever since GPT-3 came out, it’s become clear that one of the things we’ll need to do to move beyond narrow AI and towards more generally intelligent systems is going to be to massively scale up the size of our models, the amount of processing power they consume and the amount of data they’re trained on, all at the same time.

That’s led to a huge wave of highly scaled models that are incredibly expensive to train, largely because of their enormous compute budgets. But what if there was a more flexible way to scale AI — one that allowed us to decouple model size from compute budgets, so that we can track a more compute-efficient course to scale?

That’s the promise of so-called mixture of experts models, or MoEs. Unlike more traditional transformers, MoEs don’t update all of their parameters on every training pass. Instead, they route inputs intelligently to sub-models called experts, which can each specialize in different tasks. On a given training pass, only those experts have their parameters updated. The result is a sparse model, a more compute-efficient training process, and a new potential path to scale.

Google has been pushing the frontier of research on MoEs, and my two guests today in particular have been involved in pioneering work on that strategy (among many others!). Liam Fedus and Barrett Zoph are research scientists at Google Brain, and they joined me to talk about AI scaling, sparsity and the present and future of MoE models on this episode of the TDS podcast.

***

Intro music:

Artist: Ron Gelinas

Track Title: Daybreak Chill Blend (original mix)

Link to Track: https://youtu.be/d8Y2sKIgFWc

***

Chapters:
  • 2:15 Guests’ backgrounds
  • 8:00 Understanding specialization
  • 13:45 Speculations for the future
  • 21:45 Switch transformer versus dense net
  • 27:30 More interpretable models
  • 33:30 Assumptions and biology
  • 39:15 Wrap-up

Previous Episode

There’s an idea in machine learning that most of the progress we see in AI doesn’t come from new algorithms of model architectures. instead, some argue, progress almost entirely comes from scaling up compute power, datasets and model sizes — and besides those three ingredients, nothing else really matters.

Through that lens the history of AI becomes the history f processing power and compute budgets. And if that turns out to be true, then we might be able to do a decent job of predicting AI progress by studying trends in compute power and their impact on AI development.

And that’s why I wanted to talk to Jaime Sevilla, an independent researcher and AI forecaster, and affiliate researcher at Cambridge University’s Centre for the Study of Existential Risk, where he works on technological forecasting and understanding trends in AI in particular. His work’s been cited in a lot of cool places, including Our World In Data, who used his team’s data to put together an exposé on trends in compute. Jaime joined me to talk about compute trends and AI forecasting on this episode of the TDS podcast.

***

Intro music:

Artist: Ron Gelinas

Track Title: Daybreak Chill Blend (original mix)

Link to Track: https://youtu.be/d8Y2sKIgFWc

***

Chapters:

  • 2:00 Trends in compute
  • 4:30 Transformative AI
  • 13:00 Industrial applications
  • 19:00 GPT-3 and scaling
  • 25:00 The two papers
  • 33:00 Biological anchors
  • 39:00 Timing of projects
  • 43:00 The trade-off
  • 47:45 Wrap-up

Next Episode

If the name data2vec sounds familiar, that’s probably because it made quite a splash on social and even traditional media when it came out, about two months ago. It’s an important entry in what is now a growing list of strategies that are focused on creating individual machine learning architectures that handle many different data types, like text, image and speech.

Most self-supervised learning techniques involve getting a model to take some input data (say, an image or a piece of text) and mask out certain components of those inputs (say by blacking out pixels or words) in order to get the models to predict those masked out components.

That “filling in the blanks” task is hard enough to force AIs to learn facts about their data that generalize well, but it also means training models to perform tasks that are very different depending on the input data type. Filling in blacked out pixels is quite different from filling in blanks in a sentence, for example.

So what if there was a way to come up with one task that we could use to train machine learning models on any kind of data? That’s where data2vec comes in.

For this episode of the podcast, I’m joined by Alexei Baevski, a researcher at Meta AI one of the creators of data2vec. In addition to data2vec, Alexei has been involved in quite a bit of pioneering work on text and speech models, including wav2vec, Facebook’s widely publicized unsupervised speech model. Alexei joined me to talk about how data2vec works and what’s next for that research direction, as well as the future of multi-modal learning.

***

Intro music:

Artist: Ron Gelinas

Track Title: Daybreak Chill Blend (original mix)

Link to Track: https://youtu.be/d8Y2sKIgFWc

***

Chapters:
  • 2:00 Alexei’s background
  • 10:00 Software engineering knowledge
  • 14:10 Role of data2vec in progression
  • 30:00 Delta between student and teacher
  • 38:30 Losing interpreting ability
  • 41:45 Influence of greater abilities
  • 49:15 Wrap-up

Promoted