FetchFox - Python AI Web Scraper

FetchFox can extract any data from any website with just a prompt.

pip install fetchfox-sdk

Quick start

Start scraping in under 1 minute with these examples

Extract Data

Crawl Websites

Price Scraper

basic

⚔️

Battle Tested in Production

The FetchFox AI scraper is battle tested and used every day to power our production facing application at fetchfox.ai to run thousands of user scrapes. The Python SDK is fully supported by our monitoring and SRE team.

This means we've worked out the weird edge cases that make scraping hard: figuring out proxies, handling dynamic pages, caching, retries, timeouts, and more. That means you can scrape in minutes using just a prompt and Python.

items scraped

Web sraping with HTML + CSS + Python = Parsing nightmare 😓
Web sraping with OpenAI directly = Proxy and captcha nightmare 😭
Web sraping with FetchFox = Quick and Easy 😃

Ethan Brooks
Software Engineer, Pinterest

💁‍♀️

Extract with English

No more CSS, XPath, or coding to extract data. Just layout your schema in plain English, and AI figures it out.

🗺️

Crawl Entire Sites

FetchFox can follow URLs that it finds, so you can navigate flexibly.

🥧

Pagination?
It's NBD

We figure out how to navigate multiple pages of results, including infinite scroll.

🏎️

Concurrent and Fast

Run large scraping workflows, and we'll take care of concurrency to provide your results quickly.

🤠

Never Get
Blocked Again

We handle proxying, captchas and bot blocks. If it's on the public web, we scrape it.

🪙

Pay As You Go

Only pay for your usage, with no monthly commitment. Our pricing is clear and simple. Pricing Details »

Two Level Scrapes

Put in a starting URL, and then FetchFox can crawl one, two, three, or more levels deep.

from fetchfox_sdk import FetchFox
fox = FetchFox(api_key="YOUR_API_KEY_HERE")

items = fox \
    .extract(
        "https://pokemondb.net/pokedex/national",
        {
            "name": "Pokemon name",
            "url": "Pokemon URL"
        }) \
    .extract(
        {
            "hp": "Pokemon HP"
        })

# FetchFox can follow the URLs it finds. In this example,
# the "url" found in the first step will be followed opened
# on the second step.

# The results are combined, like this:

# {
#   'name': 'Charmander',
#   'url': 'https://pokemondb.net/pokedex/charmander',
#   'hp': '39'
# }

Filter and Export

We provide AI powered filtering, and easy data export to JSON and CSV.

from fetchfox_sdk import FetchFox
fox = FetchFox(api_key="YOUR_API_KEY_HERE")

fox \
    .extract(
        "https://www.bicyclebluebook.com/marketplace/buy-now",
        {"url": "Find me the links to each detail page for a bicycle for sale."},
        limit=10) \
    .extract({
        "full_description": "Find the seller's entire textual description.",
        "frame_size": "Find the size of the bicyle frame.",
        "price": "What price is the seller asking for the bike?",
        "MSRP": "What is the original MSRP price of the bike, as given here?"}) \
    .filter(
        "Exclude any items where the price is firm.") \
    .export("bikes.jsonl") \
    .export("bikes.csv")

Concurrent Execution

For individual workflows, enjoy automatic concurrent execution on our backend. If you want to run multiple workflows concurrently, that's easy too:

from fetchfox_sdk import FetchFox
fox = FetchFox(api_key="YOUR_API_KEY_HERE")

top_posts_on_hn = \
    fox.extract(
        "https://news.ycombinator.com",
        {"title": "Find me all the titles of the posts."},
        limit=10)

top_posts_on_slashdot = \
    fox.extract(
        "https://news.slashdot.org/",
        {"title": "Find me all the titles of the posts."},
        limit=10)

top_posts_on_hn_future = top_posts_on_hn.results_future()
top_posts_on_slashdot_future = top_posts_on_slashdot.results_future()
# These variables hold standard Python concurrent.futures.Futures

hn_posts_results = top_posts_on_hn_future.result()
slashdot_posts_results = top_posts_on_slashdot_future.result()
# This will block until they all finish.

print(hn_posts_results[0])
print(top_posts_on_slashdot[0]) # results attach to the workflow too

Check us out on GitHub!