ar-crawl · agent-first web crawler

A crawler for the agent holding the leash.

The AI belongs in your agent. The crawler's job is to return clean data — JSON, CSV, or a queryable SQLite file. ar-crawl runs over direct HTTP, local Playwright, or any of four managed services with automatic failover. No LLM calls in the tool. No vendor lock-in in your pipeline.

Install → See an agent use it

Direct · Playwright · FireCrawl · ScrapingBee · Browserless · ScraperAPI JSON · CSV · Markdown · SQLite · XPath extraction Racket · Dockerised · production-ready

The problem

Crawlers that bake an LLM in charge you twice — and lock you to one model.

The "AI-powered crawler" trend bundled extraction, summarisation, and a proprietary LLM into the crawling tool. You pay per LLM call per page, you're tied to that vendor's model, and you can't reuse the raw data for a second task without re-crawling.

What AI-in-the-crawler gives you

Two LLM bills — the crawler's embedded model and your agent's model — for every page, forever.

Vendor lock-in at the crawler layer. Switch LLMs and your extraction pipeline breaks.

One task per crawl — the extracted output is shaped to a specific prompt. Reuse means re-crawl.

What ar-crawl gives you

Clean data, no LLM charges — HTML, markdown, and text returned in the format your agent asks for.

Any agent, any model — Claude, GPT, Llama, local. The crawler doesn't care; its output is structured text.

Crawl once, use many — dump to SQLite, query it from every agent task you ever run against that data.

See it work

An agent crawls a site, queries the SQLite, and thinks.

ar-crawl does the fetching. Your agent does the thinking. Click through the workflow.

Single-URL crawl

Direct HTTP, no API key, clean JSON.

JS-heavy site

Playwright auto-spawns for React / SPA pages.

Site-wide crawl → SQLite

URL filters, depth limits, queryable output.

Service failover

FireCrawl → ScrapingBee → Browserless — auto.

XPath extraction

Structured fields lifted from the crawl.

ar-crawl · agent session

$ ar-crawl crawl "https://example.com" --verbose

service:     direct (no API key required)
fetched:     200 OK  ·  12.4 KB  ·  214 ms
output:      { "url": "…", "title": "…", "text": "…", "html": "…" }

# Clean, structured, ready for the agent to read.
# No LLM call in sight. No token bill.

$ ar-crawl -s playwright crawl "https://react-app.com"

[playwright]  auto-spawning local browser service
[playwright]  page loaded  ·  DOM ready 1.8 s  ·  hydrated 3.1 s
service:     playwright (local)
rendered:    345 KB  ·  JS hydrated
recommend:   --wait-for networkidle --delay 500

# Page-load probing: ar-crawl measures JS timing and tells
# you what scraping parameters it thinks it needs.

$ ar-crawl crawl-site "https://news.example.com" \
    --output        news.db \
    --format        sqlite \
    --max-pages     200 \
    --url-pattern   ".*news\.example\.com/(2025|2026)/.*" \
    --crawl-delay   1000

crawled:  200 pages  ·  38 MB html  ·  4m 12s
output:   news.db   (SQLite · pages, links, meta, text)

$ # your agent then queries the DB — no re-crawl
$ sqlite3 news.db "SELECT title FROM pages WHERE text LIKE '%outage%'"

$ ar-crawl crawl "https://cloudflare-protected.com"

[firecrawl]    attempt 1 …  HTTP 403 · challenge page
[scrapingbee]  attempt 2 …  timeout after 30s
[browserless]  attempt 3 …  200 OK  ·  rendered

service used:  browserless   (after 2 failovers)
total time:    42.1 s

# Failover is built in. Configure any subset of:
# direct, playwright, firecrawl, scrapingbee, browserless, scraperapi.

$ ar-crawl extract news.db \
    --xpath  "//article//h1[@class='title']/text()"  --as title \
    --xpath  "//time/@datetime"                        --as published \
    --output articles.json

extracted:  200 rows  ·  2 fields
output:     articles.json

$ # the agent now sees structured rows instead of raw html

$ jq '.[0]' articles.json
{
  "url":        "https://news.example.com/2026/04/outage",
  "title":      "Region-wide outage resolved",
  "published":  "2026-04-09T14:22:00Z"
}

Why ar-crawl

Three properties most "agent crawlers" miss.

01 · Agent-first

A tool, not a co-pilot.

ar-crawl fetches and structures. Your agent reads the structured data and decides. No LLM is called from inside the crawler — so there's no API key, no model preference, and no per-page token charge baked into the tool.

Zero LLM calls in the crawl path
Outputs shaped for agent consumption
JSON, CSV, Markdown, SQLite
XPath extraction for structured fields
LLM-agnostic by construction

02 · Service-agnostic

Six fetch strategies, one CLI.

Direct HTTP, local Playwright, and four managed services behind one interface. Configure a priority list; ar-crawl fails over when a service errors, times out, or hits a challenge page.

Built-in direct HTTP — no key needed
Playwright auto-spawns for JS-heavy pages
FireCrawl · ScrapingBee · Browserless · ScraperAPI
Health checks and service probing
Automatic failover with attempt logs

03 · Production-grade

Built to run, not to demo.

Docker, monitoring, health endpoints, retry budgets, queue with dedup and depth control. Crawl a 200-page site at 2 AM and the logs will tell you exactly what happened.

Docker Compose for the full stack
SQLite output with full-text queries
Regex URL filters & domain restrictions
Depth control, dedup, crawl delay
Written in Racket, shipped as a CLI

How it works

A queue, a service plan, an output your agent can query.

ar-crawl maintains a crawl queue with dedup and depth tracking. Each URL is dispatched to a fetch service chosen by priority; on failure, the next service attempts. Clean output lands in your chosen format.

┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ your LLM agent │ │ ar-crawl CLI │ │ fetch layer │ │ │ │ │ │ │ │ invokes CLI │─────►│ queue │─────►│ direct HTTP │ │ parses JSON or │ │ dedup │ │ playwright │ │ queries SQLite │ │ depth control │ │ firecrawl │ └──────────────────┘ │ URL filters │ │ scrapingbee │ ▲ │ failover policy │ │ browserless │ │ └──────────────────┘ │ scraperapi │ │ │ └──────────────────┘ │ ▼ │ │ ┌─────────────────────────────────┐ │ │ │ output writer │ │ └──────────────┤ JSON · CSV · Markdown · SQLite │◄──────┘ │ XPath extraction │ └─────────────────────────────────┘

No keys to start

The direct and Playwright paths work out of the box. API keys are optional, added only when you want the managed services.

Playwright when it matters

JS-heavy sites auto-route to a local Playwright service. The tool probes page load timing and suggests parameters.

SQLite is the plan

Crawl once to .db, query it forever. Your agent runs SQL against a real database instead of re-crawling to re-read.

XPath extracts cleanly

Lift structured fields at extraction time — titles, dates, authors — so agents never re-parse HTML with a regex.

Failover is declarative

Configure a service priority list. ar-crawl attempts in order, logs the reason for each skip, and reports which service succeeded.

Deploys in a container

Docker Compose starts the CLI and the Playwright service together. Health checks and monitoring are included.

Questions

Frequently asked questions.

How is this different from FireCrawl or ScrapingBee directly?

ar-crawl is a thin layer above them — plus a direct-HTTP path, Playwright, queueing, failover, and a choice of output formats. You can use ar-crawl without ever touching those services (direct + Playwright is plenty for most sites) and switch providers without changing your agent code.

Why "no AI in the tool"?

AI-integrated crawlers shape the output to a specific prompt and charge for every LLM call. Your agent already has an LLM. Shaping and paying for a second one inside the crawler is duplication at best, lock-in at worst. ar-crawl stays a pure crawling tool.

Why Racket?

Racket has excellent XML/HTML parsing, pattern matching, and a fast HTTP client out of the box. The codebase stays small and the contract-level language features catch errors early. A CLI wrapper means the agent doesn't see Racket — it just calls ar-crawl.

Can I use this from Claude Code / a Claude agent?

Yes — that's the primary use case. The agent invokes the CLI, reads the JSON or opens the SQLite, and processes the data with its own model. ar-crawl's only job is to return clean bytes.

How does failover decide?

Configured priority list, per-service timeout, and an error classifier. 4xx → try next service, 5xx → retry with backoff, challenge page → try a rendering service. Every attempt is logged so your agent can see what happened.

Where's the source?

Open source. Written in Racket, shipped as a single CLI and a Playwright companion container. Docker Compose for the full stack.