With model and model coming out of the LLM fold, some connect within each other could be expected given the nature of development of technology the world has experienced in the last three. Added to the internet and its repository remains a common for LLM pre-training. The newer model as the DeepSeek will try to find an easier way to solve the whole problem given the fact it is supposed to manage with much lesser compute. Cost can also remain a limiting factor given the nascent nature of funding in a large number of countries. The newer models will take advantage of the earlier models in the best possible ways they can.
DeepSeek seems to be on the LLM model spree. It recently released an updated version of its R1 reasoning AI model. The earlier model released in January this year shattered the world forcing a paradigm shift in the creation, costing and compute of AI models. This model performs very well on a number of math and coding benchmarks. The source of data used for training this model has not been revealed. Given the revolving doors system of the IT industry, it is natural to have taken roots in the AI tools as well.
AI researchers speculate that some part of the training data could have come from Google’s Gemini family of AI. Sam Peach, a Melbourne based developer seems to lean on this thought. DeepSeek’s model R1-0528 prefers words and expressions similar to those that Google’s Gemini favours. Another researcher opines that the thought it generates works towards this conclusion – “read like Gemini traces.” DeepSeek’s V3 often identified itself as ChatGPT, with a clear suggestion that it may have been trained on ChatGPT chat logs. Financial Times has found evidence linking DeepSeek to the use of distillation.
To be fair to DeepSeek, distillation isn’t an uncommon practice. Open web is from where the bulk of the training data comes from. Contamination happens but it is not at all easy to thoroughly filter AI outputs from the training data. To prevent distillation AI companies are ramping up security measures. Now OpenAI requires ID verification to access its certain advanced models. They need to be from one of the countries supported by OpenAI’s API; China is missing from it. Google recently began “summarizing” the traces generated by models through its AI developer platform. This makes training performant rival models more challenging. Anthropic is also getting in this summarizing act to protect its “competitive advantage.”
KEEPING LLM MODEL SECURE FROM DISTILLATION HAS BECOME A CHALLENGING TASK.
Sanjay Sahay
Have a nice evening.