Crawlee for Python: Unified Web Scraping and Browser Automation

1 min read
web-scrapingpythonplaywrightcrawlerbrowser-automationdata-extractionai-dataproxy-rotation
View as Markdown
Originally from github.com
View source

My notes

Summary

Crawlee is a production-grade Python library by Apify that unifies HTTP and headless browser crawling under a single API. It handles proxy rotation, session management, retries, and persistent URL queues out of the box, making it viable for scraping JavaScript-heavy sites without custom boilerplate. Designed for feeding AI pipelines - LLMs, RAG systems, and vector stores - with structured, reliable data.

Key Insight

  • Two crawler modes in one lib: BeautifulSoupCrawler for fast HTML parsing (no JS), PlaywrightCrawler for JS-heavy pages - swap without restructuring your code
  • Anti-bot by default: proxy rotation + session management built in; crawlers appear human-like with zero extra config
  • State persistence: if a crawl is interrupted, it resumes from where it left off - no wasted compute on large jobs
  • Asyncio-native: built on Python’s standard async library, plays well with modern async stacks (FastAPI, aiohttp, etc.)
  • uv-first: official CLI (uvx 'crawlee[cli]' create my-crawler) uses uv for project scaffolding - already the recommended toolchain
  • Modular extras: install only what you need (crawlee[beautifulsoup], crawlee[playwright], crawlee[all]) - keeps environments lean
  • Cloud-ready: designed to deploy on Apify platform, but runs standalone anywhere
  • Typed codebase: full type hints mean IDE autocompletion works properly and static analysis catches bugs early