DailyPost 2699

The invention of Generative Pre-trainer Transformer or GPT was by far the biggest tech breakthrough in long tryst with Artificial Intelligence, going through ups and downs for decades together. The team that created GPT in the OpenAI, an artificial intelligence research laboratory in 2018 was led by Alec Radford. Since then GPT has become synonymous to AI research and also its development. The journey continued since 2018 till GPT-3 and it’s current version then GPT-3.5 with its conversational value ripped apart the world and became a world-renowned product in no time. This global digital utility is ChatGPT.

Since then, GPT-4 has come to the market and now there is talk that work is going on, on GPT-5 already and its digital or AI contours have already become the talk of the town. Without getting into the technical capabilities and what would be the nature of exponential transformation with GPT-5 and beyond, suffice to say that data would be critical to these models as we move forward. The smooth ride so far is not going to happen further, going by the authors protest sometime back. As the details of data guzzling GPTs become more public, there can be backlash of different types. It may be contravening laws, rules, procedures or even a skewed outcome as well.

More importantly are issues 8related to the quantum of data / nature of its availability and many a times given the level GPT has already reached, the quality of input data can turn out to be a major concern. GPT-5 Altman said as expected would need more data; it coming from public sources and proprietary private datasets. OpenAI has started working in direction of collaborating with private datasets. It also doing nascent level work to acquire valuable content from major publishers like Associated Press and News Corp. While it is looking for partners on text, images, or video but they are mainly interested in “long form writings and conversations rather than disconnected snippets” that express “human intention.” Not surprisingly, OpenAI seems to be forced to tap higher quality sources not available publicly.

AI’s extreme data needs continues to be a sticking point. It is possible even more data of higher quality can yield greater near-term results. Recent researches suggests that smaller models fed with large amounts of data perform as well or better than larger models fed less. Data is not in infinite supply, and not any old text will do. “LLMs trained on books are much better writers than those trained on huge batches of social-media posts.” GPT-4’s scraping of internet has already exhausted the low hanging fruits. The doomsday scenario is that the estimated last year supply of publicly accessible, high-quality online data would run out by 2026. It’s certainly not a long way off. One way around so today’s data scarcity, at least in the near term, is to “make deals with the owners of private information hoards.

Sanjay Sahay

Have a nice evening.

Leave a Comment

Your email address will not be published. Required fields are marked *

The reCAPTCHA verification period has expired. Please reload the page.

Scroll to Top