DailyPost 3025
AI’S PEAK DATA PROBLEM
The average professional person has found a panacea to all his problems in ChatGPT, or other LLM models which are out there in the open. The intricacies of the backend AI research is none of his concern but that needs to keep on happening, for the exponential progress of AI, which is already being taken for granted. Just before the current state of the accomplishments of AI becoming public, and its large-scale use, it was believed the information available in the world is too huge and it would be next to impossible to handle it, leave aside put to any purposeful use.
In addition, the knowledge / information by way of documentation was doubling every few years. In just two years of AI taking the centre stage, a problem exactly the opposite to what we had imagined afflicting the future development of AI. OpenAI cofounder Ilya Suskever has recently called it the “peak data” problem. The announcement that the AI industry was hit by such a problem sent them trembling. In simple language it means that all the useful data on the internet has already been used to train the AI models. The process known as pre-training produced generative AI gains, including ChatGPT.
He said that the improvements have now slowed and it was predicted by Sutskever, that this era “will unquestionably end.” The AI models continuing to get better is the sentiment on which trillions of dollars are riding in the stock market. Paradoxically the AI experts are not worried as they have found a way to get around the data wall. It is a relatively new technique that “helps AI models ‘think’ through the challenging tasks for longer.” The revolutionary solution approach to the peak data problem is called test-time or inference time compute. What is the process? Queries are sliced into smaller tasks. Each of the tasks are turned into new prompts that the model tackles.
Each of these steps requires running a new request. This is the inference stage in AI. Each part has to be right for the model to move on to the next stage. Hence, the final response is better. This is an inference based compute. Benchmarked based testing of these new models have shown that they generate better output, more so in Maths and similar tasks with clear final answers. “What if these higher-quality outputs were to be used for new training data?” This opens immense possibilities. This would turn out to be a stockpile of new information, ready to be used as high quality input into other AI model training to produce even better results. A new approach to synthetic data is being worked upon. For now, the AI juggernaut is pushing ahead with full might.
AI WILL HAVE TO KEEP FINDING USEFUL WAYS TO GET AROUND ITS UNENDING HUNGER FOR DATA TO KEEP DELIVERING ASTONISHING RESULTS.
Sanjay Sahay