Crawlee

Crawlee is an open-source web scraping and browser automation library for Node.js, maintained by Apify and available on GitHub at apify/crawlee with over 22,000 stars. It provides a unified, batteries-included API for building production-grade web crawlers with automatic retries, proxy rotation, session management, request queuing, concurrency control, and anti-blocking countermeasures. Crawlee is the recommended way to build web scraping actors on the Apify platform, and most of the 19,000+ actors in the Apify Store are built with it. Crawlee matters because web scraping at scale is riddled with engineering challenges that have nothing to do with data extraction: handling retries when pages fail to load, rotating proxies to avoid IP bans, managing cookies and sessions for stateful crawls, controlling concurrency to respect rate limits, and persisting crawl state so crashed jobs can resume. Crawlee solves all of these problems with sensible defaults while remaining fully configurable for advanced use cases. Without Crawlee, you would spend 80% of your development time on infrastructure code and 20% on actual data extraction. With Crawlee, that ratio flips. Crawlee supports three crawler types, each optimized for different scenarios. CheerioCrawler downloads raw HTML via HTTP requests and parses it with Cheerio (a jQuery-like library), making it the fastest and cheapest option at 10-50 pages per second on 256-512 MB of memory. PlaywrightCrawler launches a real Chromium browser via Playwright for full JavaScript rendering, handling SPAs, dynamic content, and interactive pages at 1-5 pages per second on 1-4 GB of memory. PuppeteerCrawler is similar to PlaywrightCrawler but uses Google's Puppeteer library for Chrome automation — it is slightly older and less cross-browser than Playwright but has a larger ecosystem of plugins. To get started with Crawlee: install it with npm install crawlee, then create a crawler: import { CheerioCrawler } from 'crawlee'; import { Actor } from 'apify'; Actor.main(async () => { const crawler = new CheerioCrawler({ requestHandler: async ({ $, request, enqueueLinks }) => { const title = $('h1').text(); await Actor.pushData({ title, url: request.url }); await enqueueLinks({ selector: 'a.pagination' }); } }); await crawler.addRequests([{ url: 'https://example.com' }]); await crawler.run(); }); This creates a crawler that extracts h1 titles from pages and follows pagination links automatically. Compared to using Puppeteer or Playwright directly, Crawlee adds automatic request queuing with deduplication, configurable retry logic with exponential backoff, built-in proxy rotation via ProxyConfiguration, session pool management for maintaining cookies across requests, autoscaling that adjusts concurrency based on available memory and CPU, and integration with Apify storage (Datasets, Key-Value Stores, Request Queues). You could build all of this yourself, but Crawlee provides it out of the box with years of battle-testing across millions of production crawls. Common mistakes with Crawlee include not using the right crawler type for the job — always try CheerioCrawler first and only upgrade to PlaywrightCrawler if content is missing from the raw HTML. Another mistake is setting maxConcurrency too high, which causes memory spikes and OOM crashes. Start with the default (autoscaled) and only increase if you have confirmed the memory allocation can handle it. Not using enqueueLinks for link discovery is also common — manually parsing href attributes and adding requests is more code and misses Crawlee's built-in URL normalization and filtering. Crawlee also works outside Apify as a standalone library for local development. Install it, write your crawler, and run it with node src/main.js. When you are ready to deploy to the cloud, wrap it in Actor.main() and push to Apify. This local-first development workflow makes debugging faster since you can use standard Node.js debugging tools. Related concepts: Cheerio Crawler, Playwright Crawler, Request Queue, Proxy, Actor.

Related Terms