Crawlee for Python: Unified Web Scraping and Browser Automation
1 min read
Originally from github.com
View source
My notes
Summary
Crawlee is a production-grade Python library by Apify that unifies HTTP and headless browser crawling under a single API. It handles proxy rotation, session management, retries, and persistent URL queues out of the box, making it viable for scraping JavaScript-heavy sites without custom boilerplate. Designed for feeding AI pipelines - LLMs, RAG systems, and vector stores - with structured, reliable data.
Key Insight
- Two crawler modes in one lib:
BeautifulSoupCrawlerfor fast HTML parsing (no JS),PlaywrightCrawlerfor JS-heavy pages - swap without restructuring your code - Anti-bot by default: proxy rotation + session management built in; crawlers appear human-like with zero extra config
- State persistence: if a crawl is interrupted, it resumes from where it left off - no wasted compute on large jobs
- Asyncio-native: built on Python’s standard async library, plays well with modern async stacks (FastAPI, aiohttp, etc.)
- uv-first: official CLI (
uvx 'crawlee[cli]' create my-crawler) uses uv for project scaffolding - already the recommended toolchain - Modular extras: install only what you need (
crawlee[beautifulsoup],crawlee[playwright],crawlee[all]) - keeps environments lean - Cloud-ready: designed to deploy on Apify platform, but runs standalone anywhere
- Typed codebase: full type hints mean IDE autocompletion works properly and static analysis catches bugs early