Cheerio Crawler

Cheerio Crawler (CheerioCrawler) is a Crawlee crawler type that downloads raw HTML using plain HTTP requests and parses it with Cheerio, a fast jQuery-like DOM manipulation library for Node.js. CheerioCrawler is the fastest, cheapest, and most efficient way to scrape websites on the Apify platform because it does not launch a browser — it simply fetches the HTML response and extracts data using familiar CSS selectors and DOM traversal methods. CheerioCrawler matters because the majority of scraping targets do not require a full browser. Server-rendered pages, static websites, blog articles, news sites, product listings on traditional e-commerce platforms, government databases, directory listings, and API responses all deliver their content in the initial HTML response. For these sites, launching a browser is pure waste — it consumes 5-10x more memory, runs 10-50x slower, and costs 10-40x more in compute units. A typical CheerioCrawler actor uses 256-512 MB of memory and processes 10-50 pages per second. Compare that to PlaywrightCrawler at 1-4 GB of memory and 1-5 pages per second. For a 10,000-page crawl, the cost difference can be $0.04 with Cheerio versus $1.70 with Playwright. To build a CheerioCrawler actor: import { CheerioCrawler } from 'crawlee'; import { Actor } from 'apify'; Actor.main(async () => { const crawler = new CheerioCrawler({ requestHandler: async ({ $, request, enqueueLinks }) => { const title = $('h1').text().trim(); const price = $('span.price').text().replace('$', ''); const description = $('div.description p').text(); await Actor.pushData({ title, price: parseFloat(price), description, url: request.url, scrapedAt: new Date().toISOString() }); await enqueueLinks({ selector: 'a.next-page', label: 'LISTING' }); }, maxRequestRetries: 3, maxConcurrency: 20 }); await crawler.addRequests([{ url: 'https://example.com/products' }]); await crawler.run(); }); The $ parameter is a Cheerio context that supports jQuery-style selectors like $('h1'), $('div.class'), $('table tr td:nth-child(2)'), and attribute access with $('a').attr('href'). Common mistakes with CheerioCrawler include trying to scrape JavaScript-heavy single-page applications (SPAs) built with React, Vue, or Angular. These frameworks render content client-side after JavaScript execution, so the initial HTML response contains only an empty div like <div id='root'></div>. CheerioCrawler sees this empty HTML and returns no data. If your selectors are returning empty strings or null, check the raw HTML (log it with console.log($.html())) to see if the content is actually present. If it is not, switch to PlaywrightCrawler. Another mistake is not handling encoding correctly. Some sites serve content in non-UTF-8 encodings (like Shift_JIS for Japanese sites or Windows-1252 for older European sites). CheerioCrawler uses UTF-8 by default, which can produce garbled text. Set the additionalMimeTypes and forceResponseEncoding options if you encounter encoding issues. Not using the enqueueLinks helper for pagination and link discovery is also a common inefficiency. Manually parsing links, constructing URLs, and adding them to the queue is error-prone and misses Crawlee's built-in URL normalization (removing fragments, resolving relative URLs, deduplication). Use enqueueLinks with a globs or selector option to automatically discover and enqueue relevant URLs. When choosing between CheerioCrawler and PlaywrightCrawler, follow this rule: always start with CheerioCrawler. If the data you need is missing from the raw HTML, upgrade to Playwright. This approach saves compute costs by default and only incurs browser overhead when necessary. Some advanced actors use both: CheerioCrawler for listing pages (which are usually server-rendered) and PlaywrightCrawler only for detail pages that require JavaScript rendering. Crawlee supports this pattern with router.addHandler() for different request labels. Related concepts: Playwright Crawler, Crawlee, Request Queue, Proxy, Compute Unit, Dataset.

Related Terms