Website Content to Markdown
Convert any website to clean Markdown for RAG pipelines, LLM training, and AI apps. Crawls pages, strips boilerplate, preserves headings, tables, and code blocks. GFM support.
Maintenance Pulse
93/100Cost Estimate
How many results do you need?
Pricing
Pay Per Event model. You only pay for what you use.
| Event | Description | Price |
|---|---|---|
| page-converted | Charged per page converted to Markdown. Includes HTML-to-Markdown conversion, metadata extraction, and content filtering. | $0.02 |
Example: 100 events = $2.00 · 1,000 events = $20.00
Documentation
Website Content to Markdown crawls any website and converts its HTML pages into clean, structured Markdown — purpose-built for RAG pipelines, LLM fine-tuning, AI chatbots, and documentation archival. Give it a list of URLs and get back structured Markdown documents with navigation, ads, cookie banners, and boilerplate stripped away.
Feed the output directly into LangChain, LlamaIndex, Pinecone, Weaviate, or any vector database. The word count field on every page helps you estimate token usage before chunking. No browser, no JavaScript runtime, no setup — just clean Markdown at scale.
What data can you extract?
| Data Point | Source | Example |
|---|---|---|
| 📄 Markdown content | Converted page HTML | # Getting Started\n\nThis guide covers... |
| 🔗 Page URL | Request URL | https://docs.pinnacletech.io/guides/setup |
| 📌 Page title | OpenGraph, <title>, or <h1> | Getting Started — Pinnacle Docs |
| 📝 Meta description | OpenGraph or <meta name="description"> | Learn how to set up your Pinnacle account... |
| 🔢 Word count | Counted from Markdown output | 1,843 |
| 🌐 Language | <html lang> attribute | en |
| 🔽 Crawl depth | Hops from starting URL | 2 |
| 🕐 Crawled at | Apify runtime timestamp | 2026-03-15T09:12:44.000Z |
Why use Website Content to Markdown?
Large language models and RAG pipelines need clean text — not raw HTML packed with <nav> elements, cookie consent banners, sidebar widgets, and tracking scripts. Preparing web content for AI consumption by hand means copy-pasting from dozens of pages, reformatting manually, and re-doing the work every time the source changes. That process does not scale.
This actor automates the entire pipeline: it discovers pages through sitemap.xml and internal link following, extracts the main content using semantic HTML selectors, strips more than 30 categories of boilerplate, and converts the result to GitHub Flavored Markdown in a single run. Every page becomes a clean, consistently formatted document ready for downstream AI processing.
Beyond the conversion itself, the Apify platform gives you tools that matter at scale:
- Scheduling — run weekly or on a custom cron to keep your knowledge base snapshots current
- API access — trigger runs from Python, JavaScript, or any HTTP client and pipe results directly into your pipeline
- Proxy rotation — scrape at scale without IP blocks using Apify's built-in residential and datacenter proxy infrastructure
- Monitoring — get Slack or email alerts when runs fail or produce unexpected results
- Integrations — connect output to LangChain, LlamaIndex, Pinecone, Weaviate, Zapier, Make, or webhooks in minutes
Features
- Semantic main content extraction — tries 10 semantic selectors in priority order:
<main>,<article>,[role="main"],#content,.content,.post-content,.entry-content,.article-body,.page-content,.main-content. Uses the first match with 200+ characters of HTML to ensure the container actually holds content, not just a wrapper - 30+ category boilerplate removal — strips navigation, site headers, page footers, sidebars, widget areas, ad units (including
.adsbygoogle), cookie/GDPR banners, social share buttons, comment sections, modals and popups, breadcrumbs, ARIA-hidden elements, scripts, styles, iframes, and SVG decorative elements - GitHub Flavored Markdown output — full GFM support via Turndown and turndown-plugin-gfm. ATX-style headings, fenced code blocks, inline links, tables, task lists, and strikethrough all convert correctly
- Sitemap discovery — automatically fetches and parses
/sitemap.xmlbefore crawling starts, including sitemap index files that reference child sitemaps. Combines sitemap-discovered URLs with the starting URL for maximum coverage - Breadth-first link following — BFS link traversal up to 5 levels deep. The crawler resolves relative hrefs, filters same-domain links only, skips fragment anchors (
#), and excludes binary file extensions (jpg, png, gif, svg, css, js, pdf, zip, ico, mp4, mp3, woff, ttf, eot) - Per-domain page limits — enforced at 1–100 pages per domain, tracked at runtime with exact URL deduplication (trailing slash normalized). The budget cap stops spending once your limit is reached
- Concurrent crawling — 10 concurrent workers at up to 120 requests/minute. Session pool with persistent cookies ensures stable connections across multi-page crawls
- Quality filter — pages that produce fewer than 50 characters of Markdown are silently skipped as near-empty. Skipped pages do not count toward the per-domain limit
- Image handling — preserves meaningful images that have alt text. Strips data-URI inline images and tracking pixels. Empty anchor tags are removed
- Metadata extraction — title resolves from OpenGraph
og:titlefirst, then<title>, then first<h1>. Description resolves from OpenGraphog:description, then<meta name="description">. Language code from<html lang>, lowercased and stripped of region suffix - Markdown cleanup — collapses three or more consecutive newlines to two, trims trailing whitespace from every line, removes lines that contain only whitespace
- Proxy support — pass any Apify proxy configuration including residential proxies for sites that block datacenter IPs
Use cases for converting websites to Markdown
RAG pipeline ingestion
AI engineers building retrieval-augmented generation systems need clean text to chunk and embed. This actor converts entire documentation sites into structured Markdown pages in a single run. The wordCount field on each record lets you estimate token cost before committing to chunking and embedding, avoiding expensive surprises downstream.
LLM fine-tuning dataset preparation
Teams preparing fine-tuning datasets for instruction-following or domain-specific models need high-quality, boilerplate-free text. This actor converts blog posts, knowledge bases, and technical documentation with all navigation and ad content removed — so training data reflects actual prose, not menu structures.
AI chatbot and knowledge base construction
Product teams building internal chatbots or customer-facing support tools need to ingest their documentation into a vector store. This actor converts company wikis, help centers, and product docs into Markdown that integrates directly with LangChain's UnstructuredMarkdownLoader and LlamaIndex's SimpleDirectoryReader.
Competitive content analysis
Marketing and strategy teams analyzing competitor websites can convert entire competitor blogs and resource libraries to Markdown, then run LLM-based content gap analysis, keyword extraction, and tone comparison — all from structured text rather than raw HTML.
Documentation archival and migration
Engineering teams migrating from legacy CMS platforms or creating offline documentation snapshots need a reliable way to extract content as portable Markdown. This actor crawls the full site and produces files ready for import into Hugo, Jekyll, Astro, or any Markdown-based documentation tool.
Content monitoring and freshness tracking
When combined with the Website Change Monitor, this actor enables a recurring pipeline: detect changed pages, re-convert them to Markdown, and update the corresponding records in your vector database or knowledge base.
How to convert a website to Markdown
- Enter your URLs — Paste one or more website URLs into the "Website URLs" field. Bare domains like
pinnacletech.ioare auto-prefixed withhttps://. Use section-specific URLs likehttps://docs.pinnacletech.ioto target only relevant content rather than an entire homepage. - Set your depth and page limit — The defaults (10 pages, depth 2) work for most documentation sections. Set depth to 0 if you only need the specific pages you listed. Increase
maxPagesPerDomainup to 100 for larger sites. - Run the actor — Click "Start" and wait. A 10-page run typically completes in under 30 seconds. A 100-page run takes 2–5 minutes depending on page size.
- Download your Markdown — Open the Dataset tab and export as JSON. Each record contains the
markdownfield with clean, LLM-ready content. You can also stream results via the API as they arrive.
Input parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
urls | string[] | Yes | — | Starting URLs to crawl. Bare domains are auto-prefixed with https://. One URL per line in the UI |
maxPagesPerDomain | integer | No | 10 | Maximum pages to convert per domain (range: 1–100) |
maxCrawlDepth | integer | No | 2 | Link-following depth from each starting URL. 0 = only the starting pages, 5 = maximum |
includeMetadata | boolean | No | true | Include title, meta description, language code, and word count in each output record |
onlyMainContent | boolean | No | true | Strip navigation, footers, sidebars, and ads. Extract only main article content |
proxyConfiguration | object | No | Apify Proxy | Proxy settings for crawling. Defaults to Apify's datacenter proxy pool |
Input examples
Convert a documentation site (most common use case):
{
"urls": ["https://docs.pinnacletech.io"],
"maxPagesPerDomain": 50,
"maxCrawlDepth": 3,
"includeMetadata": true,
"onlyMainContent": true
}
Batch-convert multiple sites for a knowledge base:
{
"urls": [
"https://docs.pinnacletech.io",
"https://help.betaindustries.com",
"https://support.acmecorp.com/articles"
],
"maxPagesPerDomain": 25,
"maxCrawlDepth": 2,
"includeMetadata": true,
"onlyMainContent": true
}
Single-page extraction (no link following):
{
"urls": [
"https://blog.acmecorp.com/2026/03/product-launch-guide",
"https://blog.acmecorp.com/2026/02/api-best-practices"
],
"maxPagesPerDomain": 1,
"maxCrawlDepth": 0,
"includeMetadata": true,
"onlyMainContent": true
}
Input tips
- Start with depth 0 for specific pages — if you already know which pages you need, list them explicitly and set
maxCrawlDepth: 0to avoid crawling unrelated content. - Use section-specific URLs — targeting
https://docs.acmecorp.com/api-referencerather thanhttps://acmecorp.commeans the crawler starts in the right area and page limits apply to the relevant section. - Keep "main content only" enabled for AI workflows — disabling it includes navigation and sidebar text in the Markdown, which degrades chunking quality and wastes token budget in downstream LLM calls.
- Process multiple sites in one run — the actor deduplicates by domain and tracks page limits per domain independently, so batching 10 sites in one run is more efficient than 10 separate runs.
- Check word counts before embedding — the
wordCountfield lets you filter out near-empty pages and estimate token costs before sending to an embedding API.
Output example
{
"url": "https://docs.pinnacletech.io/guides/getting-started",
"title": "Getting Started — Pinnacle Docs",
"description": "Everything you need to set up your first Pinnacle integration in under 10 minutes.",
"markdown": "# Getting Started\n\nThis guide walks you through creating your first Pinnacle integration.\n\n## Prerequisites\n\nBefore you begin, make sure you have:\n\n- A Pinnacle account ([sign up free](https://pinnacletech.io/signup))\n- Node.js 18+ or Python 3.10+\n- Your API key from the [dashboard](https://dashboard.pinnacletech.io)\n\n## Step 1: Install the SDK\n\n```bash\nnpm install @pinnacle/sdk\n```\n\nOr with Python:\n\n```bash\npip install pinnacle-sdk\n```\n\n## Step 2: Initialize the client\n\n```javascript\nimport { PinnacleClient } from '@pinnacle/sdk';\n\nconst client = new PinnacleClient({\n apiKey: process.env.PINNACLE_API_KEY\n});\n```\n\n## Next steps\n\n- [Authentication guide](/guides/auth)\n- [API reference](/api)\n- [Example projects](/examples)",
"wordCount": 412,
"language": "en",
"crawlDepth": 0,
"crawledAt": "2026-03-15T09:12:44.331Z"
}
Output fields
| Field | Type | Description |
|---|---|---|
url | string | Full URL of the converted page |
title | string | Page title from OpenGraph og:title, then <title> tag, then first <h1>. Empty string if includeMetadata is false |
description | string | Meta description from OpenGraph or <meta name="description">. Empty string if not present or includeMetadata is false |
markdown | string | Full page content converted to GitHub Flavored Markdown. The primary output field |
wordCount | integer | Word count of the Markdown text. Multiply by ~1.3 to estimate token usage for most LLMs |
language | string or null | Language code from <html lang> attribute, lowercased and trimmed of region suffix (e.g., "en-US" becomes "en"). Null if not set |
crawlDepth | integer | Number of link hops from the starting URL. 0 means the starting page itself |
crawledAt | string | ISO 8601 timestamp of when the page was crawled and converted |
How much does it cost to convert websites to Markdown?
Website Content to Markdown runs on Apify's compute-unit pricing — you pay for the compute time used, not per page. At 256 MB memory (the default), the actor is among the lowest-cost crawlers on the platform.
| Scenario | Pages converted | Estimated cost | Estimated run time |
|---|---|---|---|
| Quick test | 1 | < $0.01 | ~5 seconds |
| Small section | 10 | ~$0.01 | ~25 seconds |
| Medium documentation site | 50 | ~$0.03 | ~2 minutes |
| Large documentation site | 100 | ~$0.05–0.08 | ~4–6 minutes |
| Multi-site batch (500 pages total) | 500 | ~$0.20–0.35 | ~15–25 minutes |
You can set a maximum run time and memory limit in the run configuration to cap spending. The actor stops automatically when your page limit is reached.
Compare this to manual copy-paste preparation: a developer spending 2 minutes per page preparing 100 pages for a RAG pipeline takes over 3 hours. This actor does the same job in under 5 minutes for under $0.10.
Apify's free tier includes $5 of monthly compute credits, which covers approximately 5,000–10,000 page conversions.
Convert websites to Markdown using the API
Python
from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("ryanclinton/website-content-to-markdown").call(run_input={
"urls": ["https://docs.pinnacletech.io"],
"maxPagesPerDomain": 30,
"maxCrawlDepth": 2,
"includeMetadata": True,
"onlyMainContent": True,
})
for page in client.dataset(run["defaultDatasetId"]).iterate_items():
print(f"{page['url']} — {page['wordCount']} words (~{int(page['wordCount'] * 1.3)} tokens)")
# Save each page as a .md file for LangChain / LlamaIndex ingestion
safe_name = page["url"].replace("https://", "").replace("/", "_")
with open(f"{safe_name}.md", "w") as f:
f.write(page["markdown"])
JavaScript
import { ApifyClient } from "apify-client";
import { writeFileSync } from "fs";
const client = new ApifyClient({ token: "YOUR_API_TOKEN" });
const run = await client.actor("ryanclinton/website-content-to-markdown").call({
urls: ["https://docs.pinnacletech.io"],
maxPagesPerDomain: 30,
maxCrawlDepth: 2,
includeMetadata: true,
onlyMainContent: true,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
for (const page of items) {
console.log(`${page.url} — ${page.wordCount} words`);
// Feed into LangChain UnstructuredMarkdownLoader or a vector database
const safeName = page.url.replace("https://", "").replace(/\//g, "_");
writeFileSync(`${safeName}.md`, page.markdown);
}
cURL
# Start the actor run
curl -X POST "https://api.apify.com/v2/acts/ryanclinton~website-content-to-markdown/runs?token=YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://docs.pinnacletech.io"],
"maxPagesPerDomain": 30,
"maxCrawlDepth": 2,
"includeMetadata": true,
"onlyMainContent": true
}'
# Fetch results (replace DATASET_ID from the run response above)
curl "https://api.apify.com/v2/datasets/DATASET_ID/items?token=YOUR_API_TOKEN&format=json"
How Website Content to Markdown works
Phase 1: URL discovery and sitemap parsing
When the actor starts, it normalizes each input URL (adding https:// for bare domains, validating format) and deduplicates by hostname. For each unique starting URL, it fetches /sitemap.xml using a 10-second timeout with an ApifyBot/1.0 User-Agent. The sitemap parser handles both standard sitemaps (extracting <loc> tags) and sitemap index files (fetching the first child sitemap). URLs matching binary file extensions — jpg, png, gif, pdf, zip, mp4, xml — are excluded. The combined list of starting URLs and sitemap-discovered URLs forms the initial request queue.
Phase 2: Breadth-first crawling
A CheerioCrawler runs with 10 concurrent workers at up to 120 requests/minute, with session pooling and persistent cookies for stable multi-page crawls. Three retries are attempted on failure. The handler skips responses without an html Content-Type to avoid processing XML sitemaps or JSON feeds that sneak through. Per-domain page counts and visited URL sets are tracked in a shared Map<string, DomainState> that enforces both the page cap and URL deduplication (trailing slash normalized). For each successfully processed page, the handler enqueues same-domain internal links from <a href> elements, filtering out fragments, external domains, and binary file extensions, up to maxCrawlDepth levels deep using BFS with __crawlDepth userData propagation.
Phase 3: Content extraction and Markdown conversion
The extraction pipeline runs in sequence for each page. First, extractContent() tries the 10 semantic selectors in order (<main>, <article>, [role="main"], #content, .content, .post-content, .entry-content, .article-body, .page-content, .main-content) and uses the first element with 200+ characters of inner HTML. A second Cheerio pass strips the 30+ non-content selectors from within the matched container. If no semantic container matches, the full <body> is used with the same stripping applied globally.
Next, htmlToMarkdown() passes the cleaned HTML to a pre-configured Turndown instance with ATX heading style, fenced code blocks (triple backtick), bullet markers, inline links, and preformattedCode: true to preserve code block whitespace. The turndown-plugin-gfm plugin adds table and strikethrough support. Two custom Turndown rules are applied: images without alt text or with data-URI sources are dropped entirely; anchor tags with empty text content are removed. The resulting Markdown is cleaned with cleanMarkdown() which collapses multiple blank lines, trims line-trailing whitespace, and strips whitespace-only lines.
Pages producing fewer than 50 characters of Markdown after the full pipeline are silently skipped — these are typically login redirect pages or thin landing pages. The final record includes the URL, title (OpenGraph > <title> > <h1>), description (OpenGraph > meta description), Markdown, word count, language, crawl depth, and ISO timestamp.
Tips for best results
-
Target section roots, not homepages. Pointing at
https://docs.acmecorp.comrather thanhttps://acmecorp.comensures your page budget is spent on documentation rather than marketing pages. The crawler follows internal links from the starting point. -
Use depth 0 when you have a URL list. If you already know which pages to convert, list them all in
urlsand setmaxCrawlDepth: 0. This is faster and more predictable than relying on link discovery. -
Estimate token budgets before embedding. Sum the
wordCountvalues in your output and multiply by 1.3. A 100-page documentation site averaging 800 words per page produces roughly 104,000 tokens — helpful to know before choosing an embedding model. -
Disable metadata for bulk training data. If you are building a fine-tuning dataset and only need raw Markdown text, set
includeMetadata: false. It has negligible cost impact but keeps output records leaner. -
Run on a schedule for living knowledge bases. Use Apify's scheduling to re-run this actor weekly against your source sites. Pair it with the Website Change Monitor to trigger re-conversion only when content actually changes.
-
For SPAs, use the Pro version. If a site loads content through JavaScript (React, Vue, Angular apps), this actor will return the skeleton HTML, not the rendered content. See the Limitations section.
-
Combine with Company Deep Research for enterprise content. Feed company website Markdown directly into the Company Deep Research actor for comprehensive intelligence reports that include the company's own published content.
-
Set proxy for rate-limited sites. Enable
proxyConfigurationwith Apify residential proxies if a site returns 429 or 403 errors during crawling. The session pool will rotate identities across requests.
Combine with other Apify actors
| Actor | How to combine |
|---|---|
| AI Training Data Curator | Convert websites to Markdown, then pass to the curator for deduplication, quality filtering, and fine-tuning dataset formatting |
| Website Change Monitor | Detect when source pages change, then trigger this actor to re-convert only the updated pages for incremental knowledge base updates |
| Company Deep Research | Convert a company's public website to Markdown and feed the content into deep research workflows for comprehensive intelligence reports |
| Website Contact Scraper | Run both actors on the same domain: one extracts contacts, the other extracts page content for enriched company profiles |
| Website Tech Stack Detector | Identify a site's technology stack first, then convert its content to Markdown — useful for contextualizing technical documentation |
| Competitor Analysis Report | Convert competitor sites to Markdown, then run competitive analysis on the structured text using LLMs |
| B2B Lead Gen Suite | Enrich lead profiles with content extracted from their company websites converted to Markdown |
Limitations
- No JavaScript rendering — the actor uses CheerioCrawler, which parses the server-delivered HTML response. Single-page applications (React, Vue, Angular, Next.js with client-side rendering) that load content via JavaScript will return an empty shell or loading spinner. For JS-rendered sites, a headless browser approach is required.
- No authenticated content — only publicly accessible pages are processed. Login walls, paywalls, and members-only content produce their gate page, not the protected content behind it.
- Same-domain crawling only — the crawler never follows links to external domains. If a site's documentation is split across multiple subdomains (e.g.,
docs.acmecorp.comandapi.acmecorp.com), list both as separate starting URLs. - 100-page maximum per domain — set by the input schema's
maximumconstraint. For very large sites, run multiple targeted crawls against specific sections. - Sitemap-dependent discovery — pages that are not linked from any crawled page and not present in
sitemap.xmlwill not be discovered. Orphaned pages require explicit URL input. - No PDF or binary content — only HTML pages are converted. PDF documents, Word files, and embedded media are skipped.
- English-biased class name selectors — the semantic content selectors use English CSS class names (
.content,.post-content,.entry-content). Sites using non-English or unusual class naming conventions may needonlyMainContent: falseto capture all content, at the cost of including some boilerplate. - No JavaScript execution in content — dynamically inserted content (lazy-loaded sections, infinite scroll, tab-hidden content) is not captured because it requires browser execution.
Integrations
- LangChain / LlamaIndex — use the Apify integration to load Markdown output directly into your RAG pipeline as document chunks
- Zapier — send converted Markdown pages to Notion, Google Docs, Confluence, or Slack on run completion
- Make — chain conversion runs with Airtable, HubSpot, or Slack steps in automated content workflows
- Google Sheets — export URL, title, word count, and language to a spreadsheet for content audits
- Apify API — trigger runs programmatically from CI/CD pipelines and retrieve Markdown via REST for embedding workflows
- Webhooks — receive a POST notification with the dataset URL when conversion finishes, enabling async pipeline triggers
- Vector databases (Pinecone, Weaviate, Qdrant, Chroma) — pipe the
markdownfield directly into your embedding and upsert pipeline after chunking
Troubleshooting
-
Output Markdown is empty or very short — the source site likely uses JavaScript to render its content. CheerioCrawler only parses server-sent HTML. Check the page in your browser with JavaScript disabled; if you see a blank or loading page, this actor cannot process it. A headless browser alternative is required for SPAs.
-
Getting unexpected navigation or sidebar content in Markdown — some sites use non-standard markup without semantic HTML elements (
<main>,<article>). The actor falls back to body-level stripping, which may miss some structural elements. Try disablingonlyMainContentand stripping the specific selectors yourself in post-processing, or provide a more specific section URL. -
Run stopped before reaching the page limit — the actor logs a warning when the per-domain
pageCountcap is reached. IncreasemaxPagesPerDomain(up to 100) or run multiple crawls targeting different sections of the site. -
Some pages failing with 403 or 429 errors — the target site is blocking the crawler. Enable
proxyConfigurationwith"useApifyProxy": trueand optionally setproxyUrlsto residential proxies. The session pool will rotate IPs across requests. -
Sitemap URLs not being picked up — some sites serve their sitemap at a non-standard path (e.g.,
/sitemap_index.xmlor/sitemaps/pages.xml). The actor only checks/sitemap.xml. For sites with non-standard sitemap locations, add the specific page URLs manually to theurlsinput.
Responsible use
- This actor only accesses publicly available web pages.
- Respect
robots.txtdirectives and website terms of service regarding automated access. - Do not use converted content in ways that violate the original site's copyright or content license.
- Comply with applicable data protection laws (GDPR, CCPA) when storing or processing scraped content.
- For guidance on web scraping legality, see Apify's guide.
FAQ
How do I convert a website to Markdown for a RAG pipeline?
Enter the documentation site URL, set maxPagesPerDomain to the number of pages you want, and set onlyMainContent: true. Each output record contains a markdown field ready for chunking and embedding. The wordCount field helps you estimate token counts before sending to your embedding API.
What types of websites does this actor convert best? Text-heavy, server-rendered sites: documentation portals, developer guides, help centers, blogs, knowledge bases, and informational pages. Sites that rely on JavaScript to render their content (React SPAs, Angular apps) are not supported — use a headless browser approach for those.
Can I use the Markdown output directly with ChatGPT, Claude, or Gemini?
Yes. The Markdown format is natively understood by all major LLMs. Feed the markdown field directly into prompts, or use the word count to gauge how many pages fit within a context window (rough estimate: 1 word ≈ 1.3 tokens).
How many pages can I convert in one run? Up to 100 pages per domain per run, across as many domains as you provide. For larger sites, run multiple targeted crawls against different sections and merge the datasets. There is no limit on the number of domains in a single run.
Does this actor follow links to other domains?
No. The crawler only follows internal links within the same domain (and subdomain) as each starting URL. If you need content from multiple domains, add each as a separate entry in the urls input.
How is this different from manually copy-pasting website content? Manual copy-paste for 50 pages takes 2–4 hours and produces inconsistent formatting. This actor processes 50 pages in under 2 minutes, produces consistently formatted GitHub Flavored Markdown, strips all boilerplate automatically, and runs unattended on a schedule. The per-page word count and metadata fields are not available from manual copying.
How does the "main content only" mode work?
The actor tries 10 semantic HTML selectors in priority order — <main>, <article>, [role="main"], and 7 common content class names. The first matching element with 200+ characters of inner HTML is used as the content container. Non-content elements (nav, footer, sidebar, ads, etc.) are then stripped from within that container. If no semantic container is found, the full <body> is used with the same stripping applied.
Is it legal to convert website content to Markdown?
Accessing publicly available web pages is generally legal in most jurisdictions. However, you should review each target website's terms of service, respect robots.txt directives, and ensure your use of the converted content complies with copyright law. For commercial AI training use cases, some site terms explicitly restrict automated scraping. See Apify's guide on web scraping legality for a detailed overview.
Can I schedule this actor to run automatically? Yes. Apify's scheduling feature lets you set recurring runs on a cron schedule (daily, weekly, or custom). This is ideal for keeping documentation snapshots current or monitoring competitor content.
What happens to pages that fail to load? Failed requests are retried up to 3 times with exponential backoff. If still failing after retries, the page is logged as a warning and skipped. Skipped pages do not count toward the per-domain page limit, so your budget is not wasted on failures.
How is this different from Apify's Website Content Crawler? Both convert web pages to text, but this actor is a lightweight, cost-efficient solution for straightforward HTML sites. It uses CheerioCrawler (no browser, ~256 MB memory) and outputs structured JSON with per-page metadata. Apify's Website Content Crawler uses a full browser and supports JavaScript rendering but runs at higher cost. Choose this actor for static and server-rendered sites; choose a browser-based solution for SPAs.
Can I use this actor's output with LangChain or LlamaIndex?
Yes. The markdown field integrates directly with LangChain's UnstructuredMarkdownLoader and LlamaIndex's SimpleDirectoryReader. Apify also provides a native LangChain integration that loads dataset items as LangChain documents without any custom code.
Help us improve
If you encounter issues, you can help us debug faster by enabling run sharing in your Apify account:
- Go to Account Settings > Privacy
- Enable Share runs with public Actor creators
This lets us see your run details when something goes wrong, so we can fix issues faster. Your data is only visible to the actor developer, not publicly.
Support
Found a bug or have a feature request? Open an issue in the Issues tab on this actor's page. For custom solutions or enterprise integrations, reach out through the Apify platform.
How it works
Configure
Set your parameters in the Apify Console or pass them via API.
Run
Click Start, trigger via API, webhook, or set up a schedule.
Get results
Download as JSON, CSV, or Excel. Integrate with 1,000+ apps.
Use cases
Sales Teams
Build targeted lead lists with verified contact data.
Marketing
Research competitors and identify outreach opportunities.
Data Teams
Automate data collection pipelines with scheduled runs.
Developers
Integrate via REST API or use as an MCP tool in AI workflows.
Related actors
Bulk Email Verifier
Verify email deliverability at scale. MX record validation, SMTP mailbox checks, disposable and role-based detection, catch-all flagging, and confidence scoring. No external API costs.
GitHub Repository Search
Search GitHub repositories by keyword, language, topic, stars, forks. Sort by stars, forks, or recently updated. Returns metadata, topics, license, owner info, URLs. Free API, optional token for higher limits.
Website Tech Stack Detector
Detect 100+ web technologies on any website. Identifies CMS, frameworks, analytics, marketing tools, chat widgets, CDNs, payment systems, hosting, and more. Batch-analyze multiple sites with version detection and confidence scoring.
PubMed Biomedical Literature Search
Search and extract structured metadata from PubMed, the world's largest biomedical literature database with over 37 million citations. Query by keyword, author, journal, date range, and article type using the NCBI E-utilities API. Returns clean JSON with titles, authors, DOIs, PMC IDs, journal details, and direct PubMed links -- ready for systematic reviews, bibliometric analysis, and research monitoring.
Ready to try Website Content to Markdown?
Start for free on Apify. No credit card required.
Open on Apify Store