Navigating the Shadows: An Overview of DarkBERT's Role in Dark Web Exploration
Learn the Language Model for the Dark Side of the Internet
We often hear about the Dark Web, but few try to access it. If we try to define it, the Dark Web is a part of the Internet not indexed by web search engines (Google, Bing, etc.) and can't be accessed through a standard web browser.
The Dark Web is a hidden part of the Internet where people can do things without being seen. This includes illegal things like selling drugs, sharing stolen information, and more. Because of this, many people who study the Internet are interested in learning more about what happens on the Dark Web.
One research study on the Dark Web involved creating a language model pre-trained on the Dark Web dataset, also known as DarkBERT. This model was proposed by Jin et al. (2023) to handle cyber threats on the Dark Web, as the typical pre-trained language models are not suitable for this environment.
How does DarkBERT function, and how can it assist us in navigating the murky waters of the Dark Web? Let's dive into the fascinating mechanics of this innovative tool.
DarkBERT Construction
As mentioned above, DarkBERT is a language model trained on the Dark Web corpus for tasks related to underground activity. This model utilizes the existing RoBERTa architecture, as the base model opts out of the Next Sentence Prediction task during pre-training, which is more beneficial for the domain-specific corpus.
Regarding the corpus, DarkBERT uses a constructed dataset crawled from hidden services finder Ahmia and onion domains from public repositories. Statistically, the data comprises 6.1 million pages, each containing various tokens (Table 1). However, due to the sensitive nature of the work, the training dataset has not been released to the public.
Ultimately, the model is trained on two versions of the dataset: the raw dataset and the preprocessed one, which takes around 15 days to train the model on both datasets. To summarize the construction of DarkBERT, we can refer to the image below.
Keep reading with a 7-day free trial
Subscribe to Non-Brand Data to keep reading this post and get 7 days of free access to the full post archives.