ScrapeGoat is a powerful toolkit that lets you crawl websites, pull out the data you need, and even use AI to understand it β all from your terminal.
Let's break it down without the tech jargon.
You know how you visit a website, click links, and read pages? ScrapeGoat does that automatically β but really fast and at scale. It can visit thousands of pages, grab the information you care about, and save it neatly in files you can actually use.
Whether you're a developer building a data pipeline, a researcher collecting articles, or someone who just wants to download product prices β ScrapeGoat has you covered.
Follow links across pages, just like browsing β but automatic.
Pick out titles, prices, text β whatever data matters to you.
Build a searchable index of any website's content.
Summarize pages and find names, places, and sentiments.
Everything you need for serious web scraping, out of the box.
Run up to hundreds of workers at once. ScrapeGoat handles concurrency, throttling, and retries so your crawls finish fast without crashing target sites.
Use CSS selectors, XPath, or regex patterns to grab exactly the data you need. Supports JSON-LD, OpenGraph, and Twitter Cards out of the box.
Index any website with full-text content, headings hierarchy, metadata, and link graphs. Build your own mini search engine.
Automatically summarize pages, extract named entities (people, places, organizations), and detect sentiment β using Ollama or OpenAI.
Export your data as JSON, JSONL (streaming), or CSV. Files are written atomically so nothing gets corrupted mid-crawl.
Route requests through rotating proxies with round-robin or random selection. Auto-rotates on failure and includes health checking.
Checkpoint your crawl progress automatically. If your process stops, pick up right where you left off β no duplicate work.
Built-in /metrics and /health endpoints. Monitor requests, errors, bytes downloaded, and active workers in real-time.
You need Go 1.21+ and a terminal. That's it.
Grab the code from GitHub and compile the binary.
git clone https://github.com/IshaanNene/ScrapeGoat
cd ScrapeGoat
make build
Point ScrapeGoat at any website and watch it go.
./bin/scrapegoat crawl https://quotes.toscrape.com --depth 2
Your scraped data is waiting in the output folder, neatly formatted as JSON.
cat output/crawl_*.json | head -50
Go beyond basic crawling β index for search or analyze with AI.
# Build a searchable index of a website
./bin/scrapegoat search https://go.dev --depth 2
# Or use AI to summarize + analyze content
./bin/scrapegoat ai-crawl https://news.ycombinator.com
Everything you can do from your terminal, explained simply.
What it does: Visits a website, follows links, and downloads pages. If you've set up parse rules (in a YAML config), it also extracts specific data from each page.
# Simple crawl
./bin/scrapegoat crawl https://quotes.toscrape.com --depth 2
# Stay within one domain, limit pages
./bin/scrapegoat crawl https://en.wikipedia.org/wiki/Web_scraping \
--depth 1 --max-requests 30 --allowed-domains en.wikipedia.org
# High performance mode
./bin/scrapegoat crawl https://example.com \
--concurrency 20 --delay 200ms --format jsonl
| Flag | Short | Default | What it does |
|---|---|---|---|
--depth | -d | 3 | How many links deep to follow |
--concurrency | -n | 10 | Number of pages to fetch in parallel |
--delay | 1s | Wait time between requests to the same site | |
--format | -f | json | Output format: json, jsonl, or csv |
--output | -o | ./output | Where to save the results |
--max-requests | -m | 0 | Stop after this many requests (0 = no limit) |
--max-retries | 3 | How many times to retry a failed page | |
--allowed-domains | all | Only crawl these domains (comma-separated) | |
--user-agent | built-in | Custom browser identity string | |
--verbose | -v | off | Show detailed logs for every request |
What it does: Crawls a website and creates a search-ready index. For each page, it extracts the title, headings, body text, meta tags, outbound links, images, and more. The output is a JSONL file β one document per line β perfect for feeding into a search engine.
# Index a Go documentation site
./bin/scrapegoat search https://go.dev --depth 2 --max-pages 100
# Index Wikipedia articles
./bin/scrapegoat search https://en.wikipedia.org/wiki/Artificial_intelligence \
--depth 2 --max-pages 50 --output ./wiki_index
| Flag | Short | Default | What it does |
|---|---|---|---|
--depth | -d | 3 | How deep to follow links |
--concurrency | -n | 10 | Parallel workers |
--delay | 200ms | Politeness delay per domain | |
--max-pages | 500 | Maximum pages to index | |
--allowed-domains | all | Stay within these domains | |
--output | -o | ./output/search_index | Output directory |
What it does: Crawls a website and then sends each page's content to an AI model for analysis. You get back: a summary (~200 words), named entities (people, organizations, locations), and sentiment analysis (positive/negative/neutral).
# With Ollama (local, free, no API key needed)
ollama serve &
ollama pull llama3.2
./bin/scrapegoat ai-crawl https://news.ycombinator.com
# With OpenAI
OPENAI_API_KEY=sk-... ./bin/scrapegoat ai-crawl https://techcrunch.com \
--llm openai --model gpt-4o-mini
# Any OpenAI-compatible API
./bin/scrapegoat ai-crawl https://example.com \
--llm custom --llm-endpoint http://localhost:8080 --model mistral
| Flag | Default | What it does |
|---|---|---|
--depth | 2 | Link depth to follow |
--concurrency | 5 | Parallel workers (lower for AI to avoid overload) |
--delay | 500ms | Wait time between requests |
--max-pages | 50 | Maximum pages to analyze |
--llm | ollama | AI provider: ollama, openai, or custom |
--model | β | Which AI model to use (e.g., llama3.2, gpt-4o-mini) |
--llm-endpoint | β | Custom API endpoint URL |
Quick helpers that don't require a URL.
# Show the version
./bin/scrapegoat version
# Show current configuration
./bin/scrapegoat config
# Use a custom config file
./bin/scrapegoat crawl https://example.com --config configs/default.yaml
Embed ScrapeGoat directly into your Go application β no CLI needed.
package main
import (
"fmt"
"time"
scrapegoat "github.com/IshaanNene/ScrapeGoat/pkg/scrapegoat"
)
func main() {
crawler := scrapegoat.NewCrawler(
scrapegoat.WithConcurrency(5),
scrapegoat.WithMaxDepth(2),
scrapegoat.WithDelay(500 * time.Millisecond),
scrapegoat.WithOutput("json", "./output"),
)
// Extract quotes from each page
crawler.OnHTML(".quote", func(e *scrapegoat.Element) {
e.Item.Set("quote", e.Selection.Find(".text").Text())
e.Item.Set("author", e.Selection.Find(".author").Text())
})
crawler.Start("https://quotes.toscrape.com")
crawler.Wait()
fmt.Println("Done!", crawler.Stats())
}
Use a YAML file to set defaults, or override anything with CLI flags.
# configs/default.yaml
engine:
concurrency: 10
max_depth: 5
request_timeout: 30s
politeness_delay: 1s
respect_robots_txt: true
max_retries: 3
storage:
type: json
output_path: ./output
proxy:
enabled: false
rotation: round_robin
metrics:
enabled: false
port: 9090
Workers, depth, timeouts, retries, user-agents, domain filters
HTTP or headless browser mode, redirects, body size limits
Output format (JSON/JSONL/CSV), path, batch size
Enable rotation, add proxy URLs, health checks
Prometheus endpoint for real-time monitoring
A modular pipeline where each piece does one job well.
Pre-built examples you can run instantly β no configuration needed.
Scrape top stories with rank, title, URL, points, author, and comments count.
go run ./examples/hackernews/
Extract product titles, prices, ratings, and stock status from books.toscrape.com.
go run ./examples/ecommerce/
Get trending repos with name, description, language, stars, and forks.
go run ./examples/github/
Deep crawl Wikipedia articles β titles, summaries, categories, references.
go run ./examples/wikipedia/
scrapegoat crawl <url> β no Go code required. The Go SDK is only
needed if you want to embed scraping into your own Go applications.
--allowed-domains flag
to limit your crawl scope.
fetcher.type: browser in your YAML config.
--format flag
to choose.