ar-crawl · agent-first web crawler

A crawler for the agent holding the leash.

The AI belongs in your agent. The crawler's job is to return clean data — JSON, CSV, or a queryable SQLite file. ar-crawl runs over direct HTTP, local Playwright, or any of four managed services with automatic failover. No LLM calls in the tool. No vendor lock-in in your pipeline.

Direct · Playwright · FireCrawl · ScrapingBee · Browserless · ScraperAPI JSON · CSV · Markdown · SQLite · XPath extraction Racket · Dockerised · production-ready

Crawlers that bake an LLM in charge you twice — and lock you to one model.

The "AI-powered crawler" trend bundled extraction, summarisation, and a proprietary LLM into the crawling tool. You pay per LLM call per page, you're tied to that vendor's model, and you can't reuse the raw data for a second task without re-crawling.

What AI-in-the-crawler gives you

Two LLM bills — the crawler's embedded model and your agent's model — for every page, forever.

Vendor lock-in at the crawler layer. Switch LLMs and your extraction pipeline breaks.

One task per crawl — the extracted output is shaped to a specific prompt. Reuse means re-crawl.

What ar-crawl gives you

Clean data, no LLM charges — HTML, markdown, and text returned in the format your agent asks for.

Any agent, any model — Claude, GPT, Llama, local. The crawler doesn't care; its output is structured text.

Crawl once, use many — dump to SQLite, query it from every agent task you ever run against that data.

An agent crawls a site, queries the SQLite, and thinks.

ar-crawl does the fetching. Your agent does the thinking. Click through the workflow.

01
Single-URL crawl
Direct HTTP, no API key, clean JSON.
02
JS-heavy site
Playwright auto-spawns for React / SPA pages.
03
Site-wide crawl → SQLite
URL filters, depth limits, queryable output.
04
Service failover
FireCrawl → ScrapingBee → Browserless — auto.
05
XPath extraction
Structured fields lifted from the crawl.
ar-crawl · agent session
$ ar-crawl crawl "https://example.com" --verbose

service:     direct (no API key required)
fetched:     200 OK  ·  12.4 KB  ·  214 ms
output:      { "url": "…", "title": "…", "text": "…", "html": "…" }

# Clean, structured, ready for the agent to read.
# No LLM call in sight. No token bill.
$ ar-crawl -s playwright crawl "https://react-app.com"

[playwright]  auto-spawning local browser service
[playwright]  page loaded  ·  DOM ready 1.8 s  ·  hydrated 3.1 s
service:     playwright (local)
rendered:    345 KB  ·  JS hydrated
recommend:   --wait-for networkidle --delay 500

# Page-load probing: ar-crawl measures JS timing and tells
# you what scraping parameters it thinks it needs.
$ ar-crawl crawl-site "https://news.example.com" \
    --output        news.db \
    --format        sqlite \
    --max-pages     200 \
    --url-pattern   ".*news\.example\.com/(2025|2026)/.*" \
    --crawl-delay   1000

crawled:  200 pages  ·  38 MB html  ·  4m 12s
output:   news.db   (SQLite · pages, links, meta, text)

$ # your agent then queries the DB — no re-crawl
$ sqlite3 news.db "SELECT title FROM pages WHERE text LIKE '%outage%'"
$ ar-crawl crawl "https://cloudflare-protected.com"

[firecrawl]    attempt 1 …  HTTP 403 · challenge page
[scrapingbee]  attempt 2 …  timeout after 30s
[browserless]  attempt 3 …  200 OK  ·  rendered

service used:  browserless   (after 2 failovers)
total time:    42.1 s

# Failover is built in. Configure any subset of:
# direct, playwright, firecrawl, scrapingbee, browserless, scraperapi.
$ ar-crawl extract news.db \
    --xpath  "//article//h1[@class='title']/text()"  --as title \
    --xpath  "//time/@datetime"                        --as published \
    --output articles.json

extracted:  200 rows  ·  2 fields
output:     articles.json

$ # the agent now sees structured rows instead of raw html

$ jq '.[0]' articles.json
{
  "url":        "https://news.example.com/2026/04/outage",
  "title":      "Region-wide outage resolved",
  "published":  "2026-04-09T14:22:00Z"
}

Three properties most "agent crawlers" miss.

01 · Agent-first

A tool, not a co-pilot.

ar-crawl fetches and structures. Your agent reads the structured data and decides. No LLM is called from inside the crawler — so there's no API key, no model preference, and no per-page token charge baked into the tool.

  • Zero LLM calls in the crawl path
  • Outputs shaped for agent consumption
  • JSON, CSV, Markdown, SQLite
  • XPath extraction for structured fields
  • LLM-agnostic by construction
02 · Service-agnostic

Six fetch strategies, one CLI.

Direct HTTP, local Playwright, and four managed services behind one interface. Configure a priority list; ar-crawl fails over when a service errors, times out, or hits a challenge page.

  • Built-in direct HTTP — no key needed
  • Playwright auto-spawns for JS-heavy pages
  • FireCrawl · ScrapingBee · Browserless · ScraperAPI
  • Health checks and service probing
  • Automatic failover with attempt logs
03 · Production-grade

Built to run, not to demo.

Docker, monitoring, health endpoints, retry budgets, queue with dedup and depth control. Crawl a 200-page site at 2 AM and the logs will tell you exactly what happened.

  • Docker Compose for the full stack
  • SQLite output with full-text queries
  • Regex URL filters & domain restrictions
  • Depth control, dedup, crawl delay
  • Written in Racket, shipped as a CLI

A queue, a service plan, an output your agent can query.

ar-crawl maintains a crawl queue with dedup and depth tracking. Each URL is dispatched to a fetch service chosen by priority; on failure, the next service attempts. Clean output lands in your chosen format.

┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ your LLM agent │ │ ar-crawl CLI │ │ fetch layer │ │ │ │ │ │ │ │ invokes CLI │─────►│ queue │─────►│ direct HTTP │ │ parses JSON or │ │ dedup │ │ playwright │ │ queries SQLite │ │ depth control │ │ firecrawl │ └──────────────────┘ │ URL filters │ │ scrapingbee │ ▲ │ failover policy │ │ browserless │ │ └──────────────────┘ │ scraperapi │ │ │ └──────────────────┘ │ ▼ │ │ ┌─────────────────────────────────┐ │ │ │ output writer │ │ └──────────────┤ JSON · CSV · Markdown · SQLite │◄──────┘ │ XPath extraction │ └─────────────────────────────────┘

No keys to start

The direct and Playwright paths work out of the box. API keys are optional, added only when you want the managed services.

Playwright when it matters

JS-heavy sites auto-route to a local Playwright service. The tool probes page load timing and suggests parameters.

SQLite is the plan

Crawl once to .db, query it forever. Your agent runs SQL against a real database instead of re-crawling to re-read.

XPath extracts cleanly

Lift structured fields at extraction time — titles, dates, authors — so agents never re-parse HTML with a regex.

Failover is declarative

Configure a service priority list. ar-crawl attempts in order, logs the reason for each skip, and reports which service succeeded.

Deploys in a container

Docker Compose starts the CLI and the Playwright service together. Health checks and monitoring are included.

Frequently asked questions.

How is this different from FireCrawl or ScrapingBee directly?
ar-crawl is a thin layer above them — plus a direct-HTTP path, Playwright, queueing, failover, and a choice of output formats. You can use ar-crawl without ever touching those services (direct + Playwright is plenty for most sites) and switch providers without changing your agent code.
Why "no AI in the tool"?
AI-integrated crawlers shape the output to a specific prompt and charge for every LLM call. Your agent already has an LLM. Shaping and paying for a second one inside the crawler is duplication at best, lock-in at worst. ar-crawl stays a pure crawling tool.
Why Racket?
Racket has excellent XML/HTML parsing, pattern matching, and a fast HTTP client out of the box. The codebase stays small and the contract-level language features catch errors early. A CLI wrapper means the agent doesn't see Racket — it just calls ar-crawl.
Can I use this from Claude Code / a Claude agent?
Yes — that's the primary use case. The agent invokes the CLI, reads the JSON or opens the SQLite, and processes the data with its own model. ar-crawl's only job is to return clean bytes.
How does failover decide?
Configured priority list, per-service timeout, and an error classifier. 4xx → try next service, 5xx → retry with backoff, challenge page → try a rendering service. Every attempt is logged so your agent can see what happened.
Where's the source?
Open source. Written in Racket, shipped as a single CLI and a Playwright companion container. Docker Compose for the full stack.

Give your agent a crawler that does its job — and nothing else.

Clean data out. No LLM calls in. Six fetch strategies behind one CLI, with JSON or SQLite waiting for your agent to read.

install: curl -sSf ar-crawl.sh/install | sh