The AI belongs in your agent. The crawler's job is to return clean data — JSON, CSV, or a queryable SQLite file. ar-crawl runs over direct HTTP, local Playwright, or any of four managed services with automatic failover. No LLM calls in the tool. No vendor lock-in in your pipeline.
The "AI-powered crawler" trend bundled extraction, summarisation, and a proprietary LLM into the crawling tool. You pay per LLM call per page, you're tied to that vendor's model, and you can't reuse the raw data for a second task without re-crawling.
Two LLM bills — the crawler's embedded model and your agent's model — for every page, forever.
Vendor lock-in at the crawler layer. Switch LLMs and your extraction pipeline breaks.
One task per crawl — the extracted output is shaped to a specific prompt. Reuse means re-crawl.
Clean data, no LLM charges — HTML, markdown, and text returned in the format your agent asks for.
Any agent, any model — Claude, GPT, Llama, local. The crawler doesn't care; its output is structured text.
Crawl once, use many — dump to SQLite, query it from every agent task you ever run against that data.
ar-crawl does the fetching. Your agent does the thinking. Click through the workflow.
$ ar-crawl crawl "https://example.com" --verbose service: direct (no API key required) fetched: 200 OK · 12.4 KB · 214 ms output: { "url": "…", "title": "…", "text": "…", "html": "…" } # Clean, structured, ready for the agent to read. # No LLM call in sight. No token bill.
$ ar-crawl -s playwright crawl "https://react-app.com" [playwright] auto-spawning local browser service [playwright] page loaded · DOM ready 1.8 s · hydrated 3.1 s service: playwright (local) rendered: 345 KB · JS hydrated recommend: --wait-for networkidle --delay 500 # Page-load probing: ar-crawl measures JS timing and tells # you what scraping parameters it thinks it needs.
$ ar-crawl crawl-site "https://news.example.com" \ --output news.db \ --format sqlite \ --max-pages 200 \ --url-pattern ".*news\.example\.com/(2025|2026)/.*" \ --crawl-delay 1000 crawled: 200 pages · 38 MB html · 4m 12s output: news.db (SQLite · pages, links, meta, text) $ # your agent then queries the DB — no re-crawl $ sqlite3 news.db "SELECT title FROM pages WHERE text LIKE '%outage%'"
$ ar-crawl crawl "https://cloudflare-protected.com" [firecrawl] attempt 1 … HTTP 403 · challenge page [scrapingbee] attempt 2 … timeout after 30s [browserless] attempt 3 … 200 OK · rendered service used: browserless (after 2 failovers) total time: 42.1 s # Failover is built in. Configure any subset of: # direct, playwright, firecrawl, scrapingbee, browserless, scraperapi.
$ ar-crawl extract news.db \ --xpath "//article//h1[@class='title']/text()" --as title \ --xpath "//time/@datetime" --as published \ --output articles.json extracted: 200 rows · 2 fields output: articles.json $ # the agent now sees structured rows instead of raw html $ jq '.[0]' articles.json { "url": "https://news.example.com/2026/04/outage", "title": "Region-wide outage resolved", "published": "2026-04-09T14:22:00Z" }
ar-crawl fetches and structures. Your agent reads the structured data and decides. No LLM is called from inside the crawler — so there's no API key, no model preference, and no per-page token charge baked into the tool.
Direct HTTP, local Playwright, and four managed services behind one interface. Configure a priority list; ar-crawl fails over when a service errors, times out, or hits a challenge page.
Docker, monitoring, health endpoints, retry budgets, queue with dedup and depth control. Crawl a 200-page site at 2 AM and the logs will tell you exactly what happened.
ar-crawl maintains a crawl queue with dedup and depth tracking. Each URL is dispatched to a fetch service chosen by priority; on failure, the next service attempts. Clean output lands in your chosen format.
The direct and Playwright paths work out of the box. API keys are optional, added only when you want the managed services.
JS-heavy sites auto-route to a local Playwright service. The tool probes page load timing and suggests parameters.
Crawl once to .db, query it forever. Your agent runs SQL against a real database instead of re-crawling to re-read.
Lift structured fields at extraction time — titles, dates, authors — so agents never re-parse HTML with a regex.
Configure a service priority list. ar-crawl attempts in order, logs the reason for each skip, and reports which service succeeded.
Docker Compose starts the CLI and the Playwright service together. Health checks and monitoring are included.
ar-crawl.
Clean data out. No LLM calls in. Six fetch strategies behind one CLI, with JSON or SQLite waiting for your agent to read.
curl -sSf ar-crawl.sh/install | sh