Non-Brand Data

Non-Brand Data

Share this post

Non-Brand Data
Non-Brand Data
Navigating the Shadows: An Overview of DarkBERT's Role in Dark Web Exploration
Copy link
Facebook
Email
Notes
More

Navigating the Shadows: An Overview of DarkBERT's Role in Dark Web Exploration

Learn the Language Model for the Dark Side of the Internet

Cornellius Yudha Wijaya's avatar
Cornellius Yudha Wijaya
May 19, 2023
∙ Paid
2

Share this post

Non-Brand Data
Non-Brand Data
Navigating the Shadows: An Overview of DarkBERT's Role in Dark Web Exploration
Copy link
Facebook
Email
Notes
More
Share
woman in black dress illustration
Photo by benjamin lehman on Unsplash

We often hear about the Dark Web, but few try to access it. If we try to define it, the Dark Web is a part of the Internet not indexed by web search engines (Google, Bing, etc.) and can't be accessed through a standard web browser.

The Dark Web is a hidden part of the Internet where people can do things without being seen. This includes illegal things like selling drugs, sharing stolen information, and more. Because of this, many people who study the Internet are interested in learning more about what happens on the Dark Web.

One research study on the Dark Web involved creating a language model pre-trained on the Dark Web dataset, also known as DarkBERT. This model was proposed by Jin et al. (2023) to handle cyber threats on the Dark Web, as the typical pre-trained language models are not suitable for this environment.

How does DarkBERT function, and how can it assist us in navigating the murky waters of the Dark Web? Let's dive into the fascinating mechanics of this innovative tool.

Non-Brand Data is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

DarkBERT Construction

As mentioned above, DarkBERT is a language model trained on the Dark Web corpus for tasks related to underground activity. This model utilizes the existing RoBERTa architecture, as the base model opts out of the Next Sentence Prediction task during pre-training, which is more beneficial for the domain-specific corpus.

Regarding the corpus, DarkBERT uses a constructed dataset crawled from hidden services finder Ahmia and onion domains from public repositories. Statistically, the data comprises 6.1 million pages, each containing various tokens (Table 1). However, due to the sensitive nature of the work, the training dataset has not been released to the public.

Dark Web Data Collection Statistics (Jin et al. 2023)

Ultimately, the model is trained on two versions of the dataset: the raw dataset and the preprocessed one, which takes around 15 days to train the model on both datasets. To summarize the construction of DarkBERT, we can refer to the image below.

Keep reading with a 7-day free trial

Subscribe to Non-Brand Data to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Cornellius Yudha Wijaya
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More