Getting trends after trends in Artificial Intelligence (AI) has now become the order of the day, natural language processing has been the Holy Grail. We have seen new models coming in the market and the older ones improving upon themselves. What has not been challenged is the basic premise that the size of the model defines its utility, the other processes notwithstanding. Now there is a new AI trend, and DeepMind seems to have found the magic sauce. DeepMind has found the secret to cheaply scale large language models. The latest paper from DeepMind dismantles the tried and tested trend of building large models to improve performance.
The democratization of AI or its horizontal extension, whichever dimension you focus on, cannot happen with the current size of models and the astronomical costs involved. The current trend of scale has been believed to be the only way to proceed in getting worthwhile results. All leaders are the successful practitioners of this game, but limitation is very conspicuous, and is coming in the way of the growth of this technology. The well-known pack currently is of OpenAI, Google, Microsoft, Nvidia, Facebook, and even DeepMind itself. Now the current paper gets into a totally different thought process. The company has found a key aspect of scaling large language models that no one has ever applied before.
DeepMind contends that all big tech companies committed to creating powerful language models are doing it wrong. Chinchilla proves that making larger models is neither the best nor the most efficient approach. This directly proportional relationship; model size and increasing performance was established in 2020 by Kaplan and others. So, the logical corollary was as more budgets are made available to train models, the majority should be allocated to making them bigger. Now DeepMind based on the current model is of the view that the models, based on the principle mentioned, missed another key factor, Data. There is no doubt that DeepMind’s findings will define language models of the future.
“DeepMind has found scaling the number of training tokens (that is, the amount of text data the model is fed) is as important as scaling model size.” The compute budget is fixed so there is need to allocate similar proportions to increase model and number of training tokens to reach the compute-optimal model (measured by minimal training loss). So the crux of the matter is, ”For every doubling of model size the number of training tokens should also be doubled.” It means that a smaller model can vastly outperform a larger model, if trained on a significantly higher number of tokens. Chinchilla, the new star, is a 70B parameter model 4 times smaller than the previous leader in language AI, Gopher, but trained on 4 times more data. Chinchilla ”uniformly and significantly” outperforms Gopher, GPT-3, Jurassic-1, and Megatron-Turing NLG across a large set of language benchmarks.
THE WORLD AWAITS FOR A MAJOR BREAKTHROUGH IN ARTIFICIAL INTELLIGENCE NLP FOR ITS DEMOCRATIZATION.