Dataset

An Apify Dataset is a structured, append-only storage system designed for tabular data produced by actor runs. Every actor run automatically creates a default dataset, and your code pushes results to it using Actor.pushData(). Datasets hold the primary output of your actor — scraped products, extracted contacts, crawled URLs, processed records, search results, or any array of structured JSON objects. You can export dataset contents as JSON, CSV, XML, Excel, HTML table, or RSS feed via the Apify API or Console, making it straightforward to integrate actor output into downstream pipelines, spreadsheets, databases, or BI tools. Datasets matter because they are what users actually receive and pay for when running your actor. Every record pushed to the dataset is a deliverable result — it is the tangible output that justifies the cost. For PPE actors, the number of dataset items often directly determines the charge (e.g., $0.05 per product scraped). For free actors, the dataset is the entire value proposition. A well-structured dataset with consistent fields, clean data, and no duplicates is the difference between a 5-star actor and one that gets abandoned after first use. To push data to a dataset in your actor code: await Actor.pushData({ title: 'Product Name', price: 29.99, url: 'https://example.com/product', scrapedAt: new Date().toISOString() }); You can push single objects or arrays of objects. For large scrapes, push in batches of 100-1000 items rather than one at a time (reduces API calls) or all at once (risks memory overflow). Access dataset contents via the API at GET /v2/datasets/{datasetId}/items with query parameters for format (json, csv, xlsx), pagination (offset, limit), and field selection (fields=title,price). From the CLI, download dataset results with: apify call username/actor-name -i input.json, then use apify dataset download --format csv. Or use the Apify client library: const { ApifyClient } = require('apify-client'); const client = new ApifyClient({ token: 'YOUR_TOKEN' }); const items = await client.dataset('datasetId').listItems(); Common mistakes with datasets include pushing duplicate records — always deduplicate by URL or unique ID before pushing. Use a Set or Map to track already-seen items within a single run. Another frequent error is pushing intermediate or debug data alongside real results, which confuses users who expect clean output. Keep debug data in the Key-Value Store instead. Not defining a Dataset Schema is also problematic: without a schema, there is no validation, and your actor may silently push malformed data (null fields, wrong types, missing required fields) that breaks user pipelines. Type mismatches between your code output and the dataset schema are the number one cause of actors going under maintenance. Dataset retention depends on your Apify plan: 7 days on the free plan, 14 days on Personal, and 30 days on Team and Enterprise plans. After the retention period, datasets are automatically deleted. Users should export or download results promptly, or use integrations (Google Sheets, Zapier, webhooks) to push data to permanent storage automatically. Named datasets (created with Actor.openDataset('my-name')) persist independently of runs and follow the same retention rules. For large-scale operations, datasets can hold millions of items. The Apify API supports efficient pagination, so consumers can process results in chunks without loading everything into memory. The Console provides a visual table view of dataset contents with sorting, filtering, and column selection, making it easy to inspect results before downloading. Related concepts: Dataset Schema, Key-Value Store, Actor Run, PPE, Actor.

Related Terms