Crawlee for Python: Unified Web Scraping and Browser Automation

Curated January 10, 2024 1 min read

web-scrapingpythonplaywrightcrawlerbrowser-automationdata-extractionai-dataproxy-rotation

My notes

Summary

Crawlee is a production-grade Python library by Apify that unifies HTTP and headless browser crawling under a single API. It handles proxy rotation, session management, retries, and persistent URL queues out of the box, making it viable for scraping JavaScript-heavy sites without custom boilerplate. Designed for feeding AI pipelines - LLMs, RAG systems, and vector stores - with structured, reliable data.

Key Insight

Two crawler modes in one lib: BeautifulSoupCrawler for fast HTML parsing (no JS), PlaywrightCrawler for JS-heavy pages - swap without restructuring your code
Anti-bot by default: proxy rotation + session management built in; crawlers appear human-like with zero extra config
State persistence: if a crawl is interrupted, it resumes from where it left off - no wasted compute on large jobs
Asyncio-native: built on Python’s standard async library, plays well with modern async stacks (FastAPI, aiohttp, etc.)
uv-first: official CLI (uvx 'crawlee[cli]' create my-crawler) uses uv for project scaffolding - already the recommended toolchain
Modular extras: install only what you need (crawlee[beautifulsoup], crawlee[playwright], crawlee[all]) - keeps environments lean
Cloud-ready: designed to deploy on Apify platform, but runs standalone anywhere
Typed codebase: full type hints mean IDE autocompletion works properly and static analysis catches bugs early