DARKBERT – DARK WEB LLM
The GPT story is getting thicker by the day, it seems it would vanquish the world. Tech has always been a double edged sword and GPT would be no different. What it would churn out for the bad actors, depends on their ingenuity, creativity and never say never die approach to achieve their goals. But while that drama would be getting unfolded unknown to us, there are positive changes happening in the ChatGPT use cases front pertaining to Cyber Security and that too in the real battlefield of it all; the dark web. DarkBERT is a language model that has been trained on the fringes of the dark web.
It’s barely been six months since the release of ChatGPT, hence we are still early in the snowball effect unleashed by it. What the final impact of large multi-modal language model released in the wild would be is unknown. What can be predicted with surety is that it would lead to an exponential change in the way world lives. When paired with other open sourced GPT models, the number of applications employing AI is exploding. It is known by now that ChatGPT itself can itself be used to create highly advanced malware.
Applied LLMs would become the order of the day, each specializing in their own area. It has to be trained on carefully curated data for a specific purpose. One such application is making waves and this one has been trained on the data from the dark web itself. Created by South Koreans, DarkBERT has just arrived. Its release paper gives an overall introduction to the dark web itself. DarkBERT was developed back in 2019 and is based on RoBERT architecture. Its renaissance so to say. It has more performance to give that it could be extracted. Back then it was severely undertrained before it was released. Expectantly, the results were not promising.
Now there was a chance to make it worthwhile. To train it, ”researchers crawled the Dark Web through the anonymizing firewall of the TOR network.” Data needed to filtered which was done applying techniques such as deduplication, category balancing, and data pre-processing. DarkBERT is the outcome of that database used to feed it. That how the large language models get trained the DarkBERT is not different. It has the capability to analyze a new piece of Dark Web content. The content is written in its own dialects and are in nature of heavily-coded messages. To extract useful and workable information out of it, is the unique success of this model. This is just the beginning, it has to evolve into an effective tool adding depth and expertise to our anti-Dark Web operations.
SPECIALISED LLMS ARE OUR FUTURE.