SYNTHETIC DATA – THE MANUFACTURING OF DATA
By 2030 synthetic data will completely overshadow real data in AI models. The hunger for data would reach such levels and that real data would not be sufficient for our needs was something, which on the face of it seems to have no rationale. But that is the reality. The 2030 prediction on wide usage of synthetic data by Gartner and going by their track and substantiation provided, seems to the trajectory of data as we move forward in the AI age. If the march of AI has to remain unstoppable, then synthetic would play a critical role as it seems today.
On the one hand we talk of the huge quantum of data being generated that we are not able to manage, but on the other we are finding even that quantum, failing much short of our needs. As we know real data comes from direct measurements / transactions et al and is constrained by cost, logistics and privacy reasons. Even some output by ChatGPT has raised privacy related issues. Though the data remains real there are many a times than human biases in the itself leads to the output of the AI model not being acceptable. To be precise, synthetic data is artificially created data.
As it stands now synthetic data is being used in conjunction with the real data. When it comes to cost of real data, it is huge. In case of synthetic data, you spend only for the compute. DeepMind AlphaGo had a component of synthetic data and the subsequent model AlphaGo Zero was fully trained on synthetic data, and it created AI history as we all know. Self-play and no historical data at times turn out to be better options. But it is important that the feedback loop is correct. There is a real likelihood that AI models are forced to limit their learning because of human models.
We can safely presume that human models can handle limited complexity, conversely, AI models can manage much more complexity. In this venture, synthetic data has a huge role to play. One trillion tokens would come out of synthetic data as is projected. Large datasets can be created from a few human demos. Infinite streams data has been created in this manner by MimicGen. Think of synthetic textbooks for machine learning. Tailored synthetic data has achieved performance levels that rival or surpass large language models, in some particular cases.
SYNTHETIC DATA WOULD BE THE PROPELLENT IF ARTIFICIAL GENERAL INTELLIGENCE (AGI) IS TO BECOME A REALITY.
Have a nice evening.