Scrapfly integration to scape any web page as HTML, Text, or Markdown for training LLMs

Mazen · May 31, 2024, 7:58am

Scrapfly

The current LLM capabilities open the door for different use cases, one of them being training with scraped data for building RAG systems to create context-aware models.

ScrapFly is a web scraping API that enables extracting any web page data into Markdown or Text, which is accessible for LLMs. It also provides additional scraping utilities, such as proxies, antibot bypass, and headless browsers’ execution. You can learn more via the official Scrapfly documentation .

I am willing to add an integration for Scrapfly that enables the following actions:

Scrape a web page as HTML, Markdown, or Text
Crawl a full website for as Markdown or Text
Taking customized screenshots of a given web page

The above actions will include the available Scrapfly API parameters. Please feel free to upvote or suggest new features!