Enhancing Web Crawling Capabilities with LLMs and Open-Source Python Library

Leverage the AI technology for your workflow

Sep 01, 2024

🔥Reading Time: More than 3 Minutes🔥
🔥Benefit Time: A lot of Time🔥

Web crawling is a process of systematically browsing the web for data collection. It’s often done by using automation programs such as web crawlers.

We would not talk much about the technical process in the background, but I want to mention that web crawling is one of the most required projects in data collection. Many companies pay a lot of money to get the data for competitive advantages, and web crawling is one way to acquire that data.

Historically, many renowned web-crawling Python libraries, such as Scrapy and Beautiful Soup, have existed. Yet technological advancements provide additional enhancements for the web crawling process.

We will discuss using the Large Language Model to enhance our web crawling process here.

Let’s get into it.

By the way. In case you are missing some of the best FREE End-to-End MLOps Projects, you should visit it here💥👇

End-to-End MLOps Projects and Learning Materials for Your Portfolios

Cornellius Yudha Wijaya

August 18, 2024

End-to-End MLOps Projects and Learning Materials for Your Portfolios

🔥Reading Time: 3 Minutes🔥

Read full story

Or, you might make some mistakes in Machine Learning, but you shouldn’t. 💥👇

The Hidden Mistakes in Machine Learning Models

Cornellius Yudha Wijaya

August 26, 2024

The Hidden Mistakes in Machine Learning Models

🔥Reading Time: More than 3 Minutes🔥

Read full story

Web Crawling with LLMs

Large Language Model or LLM has been used in many text generation tasks but was recently used in web crawling.

Several companies have tried to create a paid solution for the web crawling process with LLM, but we would use the open-source library as I find it to work well.

In this case, we would use Crawl4AI, an open-source library crawling webs with LLM.

First, we would install the Python library.

pip install "crawl4ai @ git+https://github.com/unclecode/crawl4ai.git"

Once the library is installed, we will try to initiate the WebCrawler object with the following code.

from crawl4ai import WebCrawler

def create_crawler():
    crawler = WebCrawler(verbose=True)
    crawler.warmup()
    return crawler

crawler = create_crawler()

Thanks for reading Non-Brand Data! This post is public so feel free to share it.

Classic Web Crawling

Next, we will try to use Crawl4AI for data extraction. In their default form, you can use them as it is for the data collection. In this tutorial, we will try to scrape the Yahoo Tech Financial News. To do that, we can use the code below.

result = crawler.run(url="https://finance.yahoo.com/topic/tech/")

It might look simple, but the crawling process is already powerful enough. Crawl4AI provides structured extraction, so we do not need to preprocess the data initially. It still requires further preprocessing, but the data is ready for the next step.

You can see every data you can access and the methods.

print(result.model_dump().keys())

dict_keys(['url', 'html', 'success', 'cleaned_html', 'media', 'links', 'screenshot', 'markdown', 'extracted_content', 'metadata', 'error_message'])

For example, you can acquire the extracted content with the default structure.

res = json.loads(result.extracted_content)

print(json.dumps(res, indent = 4))

#Example Output
{
        "index": 5,
        "tags": [],
        "content": "  * Health\n    * COVID-19 \n    * Fall allergies \n    * Health news \n    * Mental health \n    * Relax \n    * Sexual health \n    * Studies \n    * The Unwind \n  * Parenting\n    * Family health \n    * So mini ways \n  * Style and beauty\n    * It Figures \n    * Unapologetically \n  * Horoscopes\n  * Shopping\n    * Buying guides \n  * Food\n  * Travel\n  * Autos\n  * Gift ideas\n  * Buying guides"
    },

It’s also possible to get the available links from our extracted content.

result.links['internal'][:3]

[{'href': 'https://finance.yahoo.com/', 'text': 'Finance'},
 {'href': 'https://finance.yahoo.com/portfolios', 'text': 'My Portfolio'},
 {'href': 'https://finance.yahoo.com/news/', 'text': 'News'}]

What's great about using Crawl4AI is that it’s possible to extract all the media from the website, and the crawler can create the description even without any alt text available.

result.media['images'][:1]

[{'src': 'https://s.yimg.com/uu/api/res/1.2/lPxYqS9VeL_inenJk6zCkA--~B/Zmk9c3RyaW07aD01NztxPTgwO3c9NzY7YXBwaWQ9eXRhY2h5b24-/https://s.yimg.com/os/creatr-uploaded-images/2023-10/7b002f10-63c9-11ee-bf9e-9e7255be870b.cf.webp',
  'alt': 'Retailers are pointing to cracks in the consumer, and it may be spooking investors.',
  'desc': "Retail industry shows warnings of 'cash-strapped' US consumers Retailers are pointing to cracks in the consumer, and it may be spooking investors. Yahoo Finance • 23 hours ago LULU +0.18% ULTA -4.01%",
  'score': 3,
  'type': 'image'}]

Join Cornellius Yudha Wijaya’s subscriber chat

Available in the Substack app and on web

Various Classic Strategies for Web Crawling

Crawl4AI also provide several strategies we can employ for crawling to sharpen the data extraction result.

First, it’s possible to have a chunking strategy by splitting the data based on Regex patterns.

from crawl4ai.chunking_strategy import RegexChunking

result = crawler.run(
    url="https://finance.yahoo.com/",
    chunking_strategy=RegexChunking(patterns=["\n\n"])
)

Or we can use an NLP technique to split the data.

from crawl4ai.chunking_strategy import NlpSentenceChunking

result = crawler.run(
    url="https://finance.yahoo.com/",
    chunking_strategy=NlpSentenceChunking()
)

We can have much more precise extraction with the Cosine Similarity and controlled parameters.

from crawl4ai.extraction_strategy import CosineStrategy

result = crawler.run(
    url="https://finance.yahoo.com/",
    extraction_strategy=CosineStrategy(
        word_count_threshold=10, 
        max_dist=0.2, 
        linkage_method="ward", 
        top_k=3
    )
)

We can add parameters to CosineStrategy objects for a semantic filter to extract specific content.

result = crawler.run(
    url="https://finance.yahoo.com/",
    extraction_strategy=CosineStrategy(
        semantic_filter="Stock Price"
    )
)

Web Scrapping with LLM

Let’s get into the main part of this piece; we want to be able to use LLM for better web scrapping methodology.

Crawl4AI uses litellm in the background to connect the web crawler to the LLM provider, so every LLM available to use in the litellm is also available for Crawl4AI.

I would use the Gemini 1.5 Flash as the main LLM for this tutorial. If you don’t have the API key yet for Gemini, you can get it for free at their website. At the time of this tutorial, Gemini 1.5 Flash provide a free-to-use API with powerful longer context windows.

Let’s start to setting up the crawler with LLM.

from crawl4ai.extraction_strategy import LLMExtractionStrategy
import os

result = crawler.run(
    url="https://finance.yahoo.com/",
    extraction_strategy=LLMExtractionStrategy(
        provider="gemini/gemini-1.5-flash", 
        api_token='GEMINI API KEY'
    )
)

By default, the extraction will be similar to our previous crawling process. However, the above process still does not fully utilize the LLM power. Let’s add more detailed instructions and structure on how we want the extraction process to be.

from pydantic import BaseModel, Field

class PageSummary(BaseModel):
    title: str = Field(..., description="Title of the page.")
    summary: str = Field(..., description="Summary of the page.")
    brief_summary: str = Field(..., description="Brief summary of the page.")
    keywords: list = Field(..., description="Keywords assigned to the page.")

url="https://finance.yahoo.com/"
result = crawler.run(
    url=url,
    word_count_threshold=1,
    extraction_strategy= LLMExtractionStrategy(
        provider="gemini/gemini-1.5-flash", 
        api_token='GEMINI API KEY',
        schema=PageSummary.model_json_schema(),
        extraction_type="schema",
        apply_chunking =False,
        instruction="From the crawled content, extract the following details: "\
            "1. Title of the page "\
            "2. Summary of the page, which is a detailed summary "\
            "3. Brief summary of the page, which is a paragraph text "\
            "4. Keywords assigned to the page, which is a list of keywords. "\
            'The extracted JSON format should look like this: '\
            '{ "title": "Page Title", "summary": "Detailed summary of the page.", "brief_summary": "Brief summary in a paragraph.", "keywords": ["keyword1", "keyword2", "keyword3"] }'
    ),
    bypass_cache=True,
)

page_summary = json.loads(result.extracted_content)

print(json.dumps(page_summary, indent = 4))

The example output we have is shown below.

{
        "title": "Yahoo Finance",
        "summary": "Yahoo Finance is a comprehensive financial website that provides news, data, and tools for investors and traders. It offers a wide range of features, including real-time stock quotes, market analysis, investment research, personal finance advice, and sports news. The website is organized into several sections, including News, Life, Entertainment, Finance, and Sports. The News section covers a variety of topics, including US, Politics, World, Tech, Climate change, Health, Science, and the 2024 election. The Life section features content on Health, Parenting, Style and beauty, Horoscopes, Shopping, Food, Travel, Autos, and Gift ideas. The Entertainment section includes Celebrity, TV, Movies, Music, How to Watch, Interviews, Videos, and Shopping. The Finance section provides access to My Portfolio, News, Markets, Research, Personal Finance, and Videos. The Sports section covers Fantasy, Daily fantasy, NFL, MLB, NBA, NHL, Soccer, College football, and more. The website also features a Trending Tickers section that displays the most popular stocks and cryptocurrencies. Yahoo Finance is a valuable resource for anyone interested in finance, investing, or staying up-to-date on the latest news.",
        "brief_summary": "Yahoo Finance is a comprehensive financial website offering news, data, and tools for investors and traders. It covers a wide range of topics, including finance, investing, news, sports, and lifestyle, providing real-time stock quotes, market analysis, investment research, personal finance advice, and more.",
        "keywords": [
            "finance",
            "investing",
            "stock market",
            "news",
            "markets",
            "trading",
            "crypto",
            "personal finance",
            "sports",
            "lifestyle"
        ],
        "error": false
    },

The crawling output now follows the instructions we decided upon. Rather than crude data, we can already have data that is ready for further analysis. By combining classic crawling with LLM, we can have accurate data collection for our project.

That’s all for how you can use Web Crawler with LLM🔥!

What do you think? What do you want me to write about the content even deeper? Share and discuss it together in the comment below.👇

Articles to Read

Here are some of my latest articles you might miss this week. Also, I have some articles I feel would be good for this week.

Everything You Need to Know About the Hugging Face Model Hub and Community in Machine Learning Mastery
5 Tips for Getting Started with Deep Learning in Machine Learning Mastery
A Beginner’s Guide to PyTorch in KDnuggets

Personal Notes

This week is actually one of the hardest weeks in my personal life as my companion for 11 years has passed away. I am still trying to cope with losing one of the best feline friends that has kept me strong all this year.

It’s a little bit too heavy for a Newsletter, but I want to share it with you guys. Life is short, so enjoy the moment to the fullest!

Nevertheless, please look forward to my next post!

Non-Brand Data