ScrapeGoat — Next-Gen Web Scraping Toolkit

What's Inside

Packed with powerful features

Everything you need for serious web scraping, out of the box.

⚡

Blazing Fast Crawling

Run up to hundreds of workers at once. ScrapeGoat handles concurrency, throttling, and retries so your crawls finish fast without crashing target sites.

🎯

Smart Data Extraction

Use CSS selectors, XPath, or regex patterns to grab exactly the data you need. Supports JSON-LD, OpenGraph, and Twitter Cards out of the box.

🔍

Search Engine Mode

Index any website with full-text content, headings hierarchy, metadata, and link graphs. Build your own mini search engine.

🧠

AI-Powered Analysis

Automatically summarize pages, extract named entities (people, places, organizations), and detect sentiment — using Ollama or OpenAI.

💾

Multiple Output Formats

Export your data as JSON, JSONL (streaming), or CSV. Files are written atomically so nothing gets corrupted mid-crawl.

🔄

Proxy Rotation

Route requests through rotating proxies with round-robin or random selection. Auto-rotates on failure and includes health checking.

⏸️

Pause & Resume

Checkpoint your crawl progress automatically. If your process stops, pick up right where you left off — no duplicate work.

📊

Prometheus Metrics

Built-in /metrics and /health endpoints. Monitor requests, errors, bytes downloaded, and active workers in real-time.

Let's Go

Up and running in 3 minutes

You need Go 1.21+ and a terminal. That's it.

Clone & Build

Grab the code from GitHub and compile the binary.

git clone https://github.com/IshaanNene/ScrapeGoat
cd ScrapeGoat
make build

Run Your First Crawl

Point ScrapeGoat at any website and watch it go.

./bin/scrapegoat crawl https://quotes.toscrape.com --depth 2

Check Your Results

Your scraped data is waiting in the output folder, neatly formatted as JSON.

cat output/crawl_*.json | head -50

Try Search or AI Mode

Go beyond basic crawling — index for search or analyze with AI.

# Build a searchable index of a website
./bin/scrapegoat search https://go.dev --depth 2

# Or use AI to summarize + analyze content
./bin/scrapegoat ai-crawl https://news.ycombinator.com

Reference

CLI Commands

Everything you can do from your terminal, explained simply.

scrapegoat crawl

What it does: Visits a website, follows links, and downloads pages. If you've set up parse rules (in a YAML config), it also extracts specific data from each page.

# Simple crawl
./bin/scrapegoat crawl https://quotes.toscrape.com --depth 2

# Stay within one domain, limit pages
./bin/scrapegoat crawl https://en.wikipedia.org/wiki/Web_scraping \
  --depth 1 --max-requests 30 --allowed-domains en.wikipedia.org

# High performance mode
./bin/scrapegoat crawl https://example.com \
  --concurrency 20 --delay 200ms --format jsonl

Flag	Short	Default	What it does
`--depth`	`-d`	3	How many links deep to follow
`--concurrency`	`-n`	10	Number of pages to fetch in parallel
`--delay`		1s	Wait time between requests to the same site
`--format`	`-f`	json	Output format: json, jsonl, or csv
`--output`	`-o`	./output	Where to save the results
`--max-requests`	`-m`	0	Stop after this many requests (0 = no limit)
`--max-retries`		3	How many times to retry a failed page
`--allowed-domains`		all	Only crawl these domains (comma-separated)
`--user-agent`		built-in	Custom browser identity string
`--verbose`	`-v`	off	Show detailed logs for every request

scrapegoat search

What it does: Crawls a website and creates a search-ready index. For each page, it extracts the title, headings, body text, meta tags, outbound links, images, and more. The output is a JSONL file — one document per line — perfect for feeding into a search engine.

# Index a Go documentation site
./bin/scrapegoat search https://go.dev --depth 2 --max-pages 100

# Index Wikipedia articles
./bin/scrapegoat search https://en.wikipedia.org/wiki/Artificial_intelligence \
  --depth 2 --max-pages 50 --output ./wiki_index

Flag	Short	Default	What it does
`--depth`	`-d`	3	How deep to follow links
`--concurrency`	`-n`	10	Parallel workers
`--delay`		200ms	Politeness delay per domain
`--max-pages`		500	Maximum pages to index
`--allowed-domains`		all	Stay within these domains
`--output`	`-o`	./output/search_index	Output directory

scrapegoat ai-crawl

What it does: Crawls a website and then sends each page's content to an AI model for analysis. You get back: a summary (~200 words), named entities (people, organizations, locations), and sentiment analysis (positive/negative/neutral).

# With Ollama (local, free, no API key needed)
ollama serve &
ollama pull llama3.2
./bin/scrapegoat ai-crawl https://news.ycombinator.com

# With OpenAI
OPENAI_API_KEY=sk-... ./bin/scrapegoat ai-crawl https://techcrunch.com \
  --llm openai --model gpt-4o-mini

# Any OpenAI-compatible API
./bin/scrapegoat ai-crawl https://example.com \
  --llm custom --llm-endpoint http://localhost:8080 --model mistral

Flag	Default	What it does
`--depth`	2	Link depth to follow
`--concurrency`	5	Parallel workers (lower for AI to avoid overload)
`--delay`	500ms	Wait time between requests
`--max-pages`	50	Maximum pages to analyze
`--llm`	ollama	AI provider: ollama, openai, or custom
`--model`	—	Which AI model to use (e.g., llama3.2, gpt-4o-mini)
`--llm-endpoint`	—	Custom API endpoint URL

Utility Commands

Quick helpers that don't require a URL.

# Show the version
./bin/scrapegoat version

# Show current configuration
./bin/scrapegoat config

# Use a custom config file
./bin/scrapegoat crawl https://example.com --config configs/default.yaml

For Developers

Use it as a Go Library

Embed ScrapeGoat directly into your Go application — no CLI needed.

package main

import (
    "fmt"
    "time"
    scrapegoat "github.com/IshaanNene/ScrapeGoat/pkg/scrapegoat"
)

func main() {
    crawler := scrapegoat.NewCrawler(
        scrapegoat.WithConcurrency(5),
        scrapegoat.WithMaxDepth(2),
        scrapegoat.WithDelay(500 * time.Millisecond),
        scrapegoat.WithOutput("json", "./output"),
    )

    // Extract quotes from each page
    crawler.OnHTML(".quote", func(e *scrapegoat.Element) {
        e.Item.Set("quote", e.Selection.Find(".text").Text())
        e.Item.Set("author", e.Selection.Find(".author").Text())
    })

    crawler.Start("https://quotes.toscrape.com")
    crawler.Wait()
    fmt.Println("Done!", crawler.Stats())
}

WithConcurrency(n)

How many pages to fetch in parallel

WithMaxDepth(d)

Maximum link depth to crawl

WithDelay(d)

Wait time between requests to be polite

WithOutput(fmt, path)

Output format (json/jsonl/csv) and directory

WithAllowedDomains(...)

Only crawl pages on these domains

WithProxy(...)

Route through these proxy servers

WithRobotsRespect(true)

Follow robots.txt rules (on by default)

WithMaxRequests(n)

Stop after N total requests

Ready to Run

Example Scrapers

Pre-built examples you can run instantly — no configuration needed.

📰

Hacker News

Scrape top stories with rank, title, URL, points, author, and comments count.

go run ./examples/hackernews/

🛒

E-Commerce

Extract product titles, prices, ratings, and stock status from books.toscrape.com.

go run ./examples/ecommerce/

🐙

GitHub Trending

Get trending repos with name, description, language, stars, and forks.

go run ./examples/github/

📚

Wikipedia

Deep crawl Wikipedia articles — titles, summaries, categories, references.

go run ./examples/wikipedia/

Questions?

Frequently Asked Questions

Nope! The CLI tool works from any terminal. Just run commands like scrapegoat crawl <url> — no Go code required. The Go SDK is only needed if you want to embed scraping into your own Go applications.

Web scraping is generally legal for publicly available data, but you should always check a website's Terms of Service and robots.txt file. ScrapeGoat respects robots.txt by default to be a good citizen of the web.

Not if you use Ollama! Ollama runs AI models locally on your machine — completely free, no API keys, no cloud. If you want to use OpenAI models, then yes, you'll need an API key.

ScrapeGoat saves checkpoints of its progress automatically (every 60 seconds by default). If your crawl stops, you can resume from the last checkpoint — no need to start over. It also handles Ctrl+C gracefully, saving state before shutting down.

ScrapeGoat has several built-in safeguards: per-domain politeness delays (default 1 second between requests), User-Agent rotation, proxy support, and robots.txt compliance. You can also configure custom delay values and use the --allowed-domains flag to limit your crawl scope.

Yes! ScrapeGoat includes a headless browser fetcher powered by go-rod. This can render JavaScript, handle dynamic content, and even includes stealth mode to avoid detection. Configure it by setting fetcher.type: browser in your YAML config.

Three formats: JSON (pretty-printed, great for readability), JSONL (one record per line, ideal for streaming and large datasets), and CSV (spreadsheet-friendly). Use the --format flag to choose.

Scrape the web
like a pro.

Wait, what does this thing actually do?

Think of it like a super-smart browser.

Crawl

Extract

Search

AI Power

Packed with powerful features

Blazing Fast Crawling

Smart Data Extraction

Search Engine Mode

AI-Powered Analysis

Multiple Output Formats

Proxy Rotation

Pause & Resume

Prometheus Metrics

Up and running in 3 minutes

Clone & Build

Run Your First Crawl

Check Your Results

Try Search or AI Mode

CLI Commands

scrapegoat crawl

scrapegoat search

scrapegoat ai-crawl

Utility Commands

Use it as a Go Library

Configure Everything

⚙️ Engine

🌐 Fetcher

💾 Storage

🔄 Proxy

📊 Metrics

How ScrapeGoat Works

Example Scrapers

Hacker News

E-Commerce

GitHub Trending

Wikipedia

Frequently Asked Questions

Scrape the weblike a pro.

Wait, what does this thing actually do?

Think of it like a super-smart browser.

Crawl

Extract

Search

AI Power

Packed with powerful features

Blazing Fast Crawling

Smart Data Extraction

Search Engine Mode

AI-Powered Analysis

Multiple Output Formats

Proxy Rotation

Pause & Resume

Prometheus Metrics

Up and running in 3 minutes

Clone & Build

Run Your First Crawl

Check Your Results

Try Search or AI Mode

CLI Commands

scrapegoat crawl

scrapegoat search

scrapegoat ai-crawl

Utility Commands

Use it as a Go Library

Configure Everything

⚙️ Engine

🌐 Fetcher

💾 Storage

🔄 Proxy

📊 Metrics

How ScrapeGoat Works

Example Scrapers

Hacker News

E-Commerce

GitHub Trending

Wikipedia

Frequently Asked Questions

Scrape the web
like a pro.