About Features Setup Commands SDK How It Works FAQ
Open Source Β· Written in Go

Scrape the web
like a pro.

ScrapeGoat is a powerful toolkit that lets you crawl websites, pull out the data you need, and even use AI to understand it β€” all from your terminal.

$ scrapegoat crawl https://quotes.toscrape.com --depth 2
INFO starting crawl seeds=[https://quotes.toscrape.com] depth=2
INFO fetching url=https://quotes.toscrape.com status=200
INFO discovered links=50 items=10

βœ… Crawl complete in 3.2s
   Requests: 52 sent, 0 failed
   Items: 47 scraped, 0 dropped
   Output: ./output
Plain English

Wait, what does this thing actually do?

Let's break it down without the tech jargon.

Think of it like a super-smart browser.

You know how you visit a website, click links, and read pages? ScrapeGoat does that automatically β€” but really fast and at scale. It can visit thousands of pages, grab the information you care about, and save it neatly in files you can actually use.

Whether you're a developer building a data pipeline, a researcher collecting articles, or someone who just wants to download product prices β€” ScrapeGoat has you covered.

πŸ”—

Crawl

Follow links across pages, just like browsing β€” but automatic.

πŸ“‹

Extract

Pick out titles, prices, text β€” whatever data matters to you.

πŸ”

Search

Build a searchable index of any website's content.

πŸ€–

AI Power

Summarize pages and find names, places, and sentiments.

What's Inside

Packed with powerful features

Everything you need for serious web scraping, out of the box.

⚑

Blazing Fast Crawling

Run up to hundreds of workers at once. ScrapeGoat handles concurrency, throttling, and retries so your crawls finish fast without crashing target sites.

🎯

Smart Data Extraction

Use CSS selectors, XPath, or regex patterns to grab exactly the data you need. Supports JSON-LD, OpenGraph, and Twitter Cards out of the box.

πŸ”

Search Engine Mode

Index any website with full-text content, headings hierarchy, metadata, and link graphs. Build your own mini search engine.

🧠

AI-Powered Analysis

Automatically summarize pages, extract named entities (people, places, organizations), and detect sentiment β€” using Ollama or OpenAI.

πŸ’Ύ

Multiple Output Formats

Export your data as JSON, JSONL (streaming), or CSV. Files are written atomically so nothing gets corrupted mid-crawl.

πŸ”„

Proxy Rotation

Route requests through rotating proxies with round-robin or random selection. Auto-rotates on failure and includes health checking.

⏸️

Pause & Resume

Checkpoint your crawl progress automatically. If your process stops, pick up right where you left off β€” no duplicate work.

πŸ“Š

Prometheus Metrics

Built-in /metrics and /health endpoints. Monitor requests, errors, bytes downloaded, and active workers in real-time.

Let's Go

Up and running in 3 minutes

You need Go 1.21+ and a terminal. That's it.

1

Clone & Build

Grab the code from GitHub and compile the binary.

bash
git clone https://github.com/IshaanNene/ScrapeGoat
cd ScrapeGoat
make build
2

Run Your First Crawl

Point ScrapeGoat at any website and watch it go.

bash
./bin/scrapegoat crawl https://quotes.toscrape.com --depth 2
3

Check Your Results

Your scraped data is waiting in the output folder, neatly formatted as JSON.

bash
cat output/crawl_*.json | head -50
4

Try Search or AI Mode

Go beyond basic crawling β€” index for search or analyze with AI.

bash
# Build a searchable index of a website
./bin/scrapegoat search https://go.dev --depth 2

# Or use AI to summarize + analyze content
./bin/scrapegoat ai-crawl https://news.ycombinator.com
Reference

CLI Commands

Everything you can do from your terminal, explained simply.

scrapegoat crawl

What it does: Visits a website, follows links, and downloads pages. If you've set up parse rules (in a YAML config), it also extracts specific data from each page.

bash
# Simple crawl
./bin/scrapegoat crawl https://quotes.toscrape.com --depth 2

# Stay within one domain, limit pages
./bin/scrapegoat crawl https://en.wikipedia.org/wiki/Web_scraping \
  --depth 1 --max-requests 30 --allowed-domains en.wikipedia.org

# High performance mode
./bin/scrapegoat crawl https://example.com \
  --concurrency 20 --delay 200ms --format jsonl
FlagShortDefaultWhat it does
--depth-d3How many links deep to follow
--concurrency-n10Number of pages to fetch in parallel
--delay1sWait time between requests to the same site
--format-fjsonOutput format: json, jsonl, or csv
--output-o./outputWhere to save the results
--max-requests-m0Stop after this many requests (0 = no limit)
--max-retries3How many times to retry a failed page
--allowed-domainsallOnly crawl these domains (comma-separated)
--user-agentbuilt-inCustom browser identity string
--verbose-voffShow detailed logs for every request

scrapegoat ai-crawl

What it does: Crawls a website and then sends each page's content to an AI model for analysis. You get back: a summary (~200 words), named entities (people, organizations, locations), and sentiment analysis (positive/negative/neutral).

bash
# With Ollama (local, free, no API key needed)
ollama serve &
ollama pull llama3.2
./bin/scrapegoat ai-crawl https://news.ycombinator.com

# With OpenAI
OPENAI_API_KEY=sk-... ./bin/scrapegoat ai-crawl https://techcrunch.com \
  --llm openai --model gpt-4o-mini

# Any OpenAI-compatible API
./bin/scrapegoat ai-crawl https://example.com \
  --llm custom --llm-endpoint http://localhost:8080 --model mistral
FlagDefaultWhat it does
--depth2Link depth to follow
--concurrency5Parallel workers (lower for AI to avoid overload)
--delay500msWait time between requests
--max-pages50Maximum pages to analyze
--llmollamaAI provider: ollama, openai, or custom
--modelβ€”Which AI model to use (e.g., llama3.2, gpt-4o-mini)
--llm-endpointβ€”Custom API endpoint URL

Utility Commands

Quick helpers that don't require a URL.

bash
# Show the version
./bin/scrapegoat version

# Show current configuration
./bin/scrapegoat config

# Use a custom config file
./bin/scrapegoat crawl https://example.com --config configs/default.yaml
For Developers

Use it as a Go Library

Embed ScrapeGoat directly into your Go application β€” no CLI needed.

Go
package main

import (
    "fmt"
    "time"
    scrapegoat "github.com/IshaanNene/ScrapeGoat/pkg/scrapegoat"
)

func main() {
    crawler := scrapegoat.NewCrawler(
        scrapegoat.WithConcurrency(5),
        scrapegoat.WithMaxDepth(2),
        scrapegoat.WithDelay(500 * time.Millisecond),
        scrapegoat.WithOutput("json", "./output"),
    )

    // Extract quotes from each page
    crawler.OnHTML(".quote", func(e *scrapegoat.Element) {
        e.Item.Set("quote", e.Selection.Find(".text").Text())
        e.Item.Set("author", e.Selection.Find(".author").Text())
    })

    crawler.Start("https://quotes.toscrape.com")
    crawler.Wait()
    fmt.Println("Done!", crawler.Stats())
}
WithConcurrency(n)

How many pages to fetch in parallel

WithMaxDepth(d)

Maximum link depth to crawl

WithDelay(d)

Wait time between requests to be polite

WithOutput(fmt, path)

Output format (json/jsonl/csv) and directory

WithAllowedDomains(...)

Only crawl pages on these domains

WithProxy(...)

Route through these proxy servers

WithRobotsRespect(true)

Follow robots.txt rules (on by default)

WithMaxRequests(n)

Stop after N total requests

Customize

Configure Everything

Use a YAML file to set defaults, or override anything with CLI flags.

yaml
# configs/default.yaml
engine:
  concurrency: 10
  max_depth: 5
  request_timeout: 30s
  politeness_delay: 1s
  respect_robots_txt: true
  max_retries: 3

storage:
  type: json
  output_path: ./output

proxy:
  enabled: false
  rotation: round_robin

metrics:
  enabled: false
  port: 9090

βš™οΈ Engine

Workers, depth, timeouts, retries, user-agents, domain filters

🌐 Fetcher

HTTP or headless browser mode, redirects, body size limits

πŸ’Ύ Storage

Output format (JSON/JSONL/CSV), path, batch size

πŸ”„ Proxy

Enable rotation, add proxy URLs, health checks

πŸ“Š Metrics

Prometheus endpoint for real-time monitoring

Under the Hood

How ScrapeGoat Works

A modular pipeline where each piece does one job well.

Input Layer
CLI Commands
Go SDK
YAML Config
⬇
Core Engine
Scheduler
Frontier
Dedup
Robots
Checkpoint
⬇
Fetch Layer
HTTP Fetcher
Browser Fetcher
Proxy Rotation
Stealth Mode
⬇
Parse & Process
CSS Parser
XPath Parser
Regex Parser
Pipeline
AI (LLM)
⬇
Output Layer
JSON
JSONL
CSV
Prometheus
Ready to Run

Example Scrapers

Pre-built examples you can run instantly β€” no configuration needed.

πŸ“°

Hacker News

Scrape top stories with rank, title, URL, points, author, and comments count.

bash
go run ./examples/hackernews/
πŸ›’

E-Commerce

Extract product titles, prices, ratings, and stock status from books.toscrape.com.

bash
go run ./examples/ecommerce/
πŸ™

GitHub Trending

Get trending repos with name, description, language, stars, and forks.

bash
go run ./examples/github/
πŸ“š

Wikipedia

Deep crawl Wikipedia articles β€” titles, summaries, categories, references.

bash
go run ./examples/wikipedia/
Questions?

Frequently Asked Questions

Nope! The CLI tool works from any terminal. Just run commands like scrapegoat crawl <url> β€” no Go code required. The Go SDK is only needed if you want to embed scraping into your own Go applications.
Web scraping is generally legal for publicly available data, but you should always check a website's Terms of Service and robots.txt file. ScrapeGoat respects robots.txt by default to be a good citizen of the web.
Not if you use Ollama! Ollama runs AI models locally on your machine β€” completely free, no API keys, no cloud. If you want to use OpenAI models, then yes, you'll need an API key.
ScrapeGoat saves checkpoints of its progress automatically (every 60 seconds by default). If your crawl stops, you can resume from the last checkpoint β€” no need to start over. It also handles Ctrl+C gracefully, saving state before shutting down.
ScrapeGoat has several built-in safeguards: per-domain politeness delays (default 1 second between requests), User-Agent rotation, proxy support, and robots.txt compliance. You can also configure custom delay values and use the --allowed-domains flag to limit your crawl scope.
Yes! ScrapeGoat includes a headless browser fetcher powered by go-rod. This can render JavaScript, handle dynamic content, and even includes stealth mode to avoid detection. Configure it by setting fetcher.type: browser in your YAML config.
Three formats: JSON (pretty-printed, great for readability), JSONL (one record per line, ideal for streaming and large datasets), and CSV (spreadsheet-friendly). Use the --format flag to choose.