AIDEVELOPER TOOLS

Website Content to Markdown

Convert any website to clean Markdown for RAG pipelines, LLM training, and AI apps. Crawls pages, strips boilerplate, preserves headings, tables, and code blocks. GFM support.

Try on Apify Store
$0.02per event
5
Users (30d)
35
Runs (30d)
93
Actively maintained
Maintenance Pulse
$0.02
Per event

Maintenance Pulse

93/100
Last Build
1d ago
Last Version
4d ago
Builds (30d)
8
Issue Response
6h avg

Cost Estimate

How many results do you need?

page-converteds
Estimated cost:$2.00

Pricing

Pay Per Event model. You only pay for what you use.

EventDescriptionPrice
page-convertedCharged per page converted to Markdown. Includes HTML-to-Markdown conversion, metadata extraction, and content filtering.$0.02

Example: 100 events = $2.00 · 1,000 events = $20.00

Documentation

Website Content to Markdown crawls any website and converts its HTML pages into clean, structured Markdown — purpose-built for RAG pipelines, LLM fine-tuning, AI chatbots, and documentation archival. Give it a list of URLs and get back structured Markdown documents with navigation, ads, cookie banners, and boilerplate stripped away.

Feed the output directly into LangChain, LlamaIndex, Pinecone, Weaviate, or any vector database. The word count field on every page helps you estimate token usage before chunking. No browser, no JavaScript runtime, no setup — just clean Markdown at scale.

What data can you extract?

Data PointSourceExample
📄 Markdown contentConverted page HTML# Getting Started\n\nThis guide covers...
🔗 Page URLRequest URLhttps://docs.pinnacletech.io/guides/setup
📌 Page titleOpenGraph, <title>, or <h1>Getting Started — Pinnacle Docs
📝 Meta descriptionOpenGraph or <meta name="description">Learn how to set up your Pinnacle account...
🔢 Word countCounted from Markdown output1,843
🌐 Language<html lang> attributeen
🔽 Crawl depthHops from starting URL2
🕐 Crawled atApify runtime timestamp2026-03-15T09:12:44.000Z

Why use Website Content to Markdown?

Large language models and RAG pipelines need clean text — not raw HTML packed with <nav> elements, cookie consent banners, sidebar widgets, and tracking scripts. Preparing web content for AI consumption by hand means copy-pasting from dozens of pages, reformatting manually, and re-doing the work every time the source changes. That process does not scale.

This actor automates the entire pipeline: it discovers pages through sitemap.xml and internal link following, extracts the main content using semantic HTML selectors, strips more than 30 categories of boilerplate, and converts the result to GitHub Flavored Markdown in a single run. Every page becomes a clean, consistently formatted document ready for downstream AI processing.

Beyond the conversion itself, the Apify platform gives you tools that matter at scale:

  • Scheduling — run weekly or on a custom cron to keep your knowledge base snapshots current
  • API access — trigger runs from Python, JavaScript, or any HTTP client and pipe results directly into your pipeline
  • Proxy rotation — scrape at scale without IP blocks using Apify's built-in residential and datacenter proxy infrastructure
  • Monitoring — get Slack or email alerts when runs fail or produce unexpected results
  • Integrations — connect output to LangChain, LlamaIndex, Pinecone, Weaviate, Zapier, Make, or webhooks in minutes

Features

  • Semantic main content extraction — tries 10 semantic selectors in priority order: <main>, <article>, [role="main"], #content, .content, .post-content, .entry-content, .article-body, .page-content, .main-content. Uses the first match with 200+ characters of HTML to ensure the container actually holds content, not just a wrapper
  • 30+ category boilerplate removal — strips navigation, site headers, page footers, sidebars, widget areas, ad units (including .adsbygoogle), cookie/GDPR banners, social share buttons, comment sections, modals and popups, breadcrumbs, ARIA-hidden elements, scripts, styles, iframes, and SVG decorative elements
  • GitHub Flavored Markdown output — full GFM support via Turndown and turndown-plugin-gfm. ATX-style headings, fenced code blocks, inline links, tables, task lists, and strikethrough all convert correctly
  • Sitemap discovery — automatically fetches and parses /sitemap.xml before crawling starts, including sitemap index files that reference child sitemaps. Combines sitemap-discovered URLs with the starting URL for maximum coverage
  • Breadth-first link following — BFS link traversal up to 5 levels deep. The crawler resolves relative hrefs, filters same-domain links only, skips fragment anchors (#), and excludes binary file extensions (jpg, png, gif, svg, css, js, pdf, zip, ico, mp4, mp3, woff, ttf, eot)
  • Per-domain page limits — enforced at 1–100 pages per domain, tracked at runtime with exact URL deduplication (trailing slash normalized). The budget cap stops spending once your limit is reached
  • Concurrent crawling — 10 concurrent workers at up to 120 requests/minute. Session pool with persistent cookies ensures stable connections across multi-page crawls
  • Quality filter — pages that produce fewer than 50 characters of Markdown are silently skipped as near-empty. Skipped pages do not count toward the per-domain limit
  • Image handling — preserves meaningful images that have alt text. Strips data-URI inline images and tracking pixels. Empty anchor tags are removed
  • Metadata extraction — title resolves from OpenGraph og:title first, then <title>, then first <h1>. Description resolves from OpenGraph og:description, then <meta name="description">. Language code from <html lang>, lowercased and stripped of region suffix
  • Markdown cleanup — collapses three or more consecutive newlines to two, trims trailing whitespace from every line, removes lines that contain only whitespace
  • Proxy support — pass any Apify proxy configuration including residential proxies for sites that block datacenter IPs

Use cases for converting websites to Markdown

RAG pipeline ingestion

AI engineers building retrieval-augmented generation systems need clean text to chunk and embed. This actor converts entire documentation sites into structured Markdown pages in a single run. The wordCount field on each record lets you estimate token cost before committing to chunking and embedding, avoiding expensive surprises downstream.

LLM fine-tuning dataset preparation

Teams preparing fine-tuning datasets for instruction-following or domain-specific models need high-quality, boilerplate-free text. This actor converts blog posts, knowledge bases, and technical documentation with all navigation and ad content removed — so training data reflects actual prose, not menu structures.

AI chatbot and knowledge base construction

Product teams building internal chatbots or customer-facing support tools need to ingest their documentation into a vector store. This actor converts company wikis, help centers, and product docs into Markdown that integrates directly with LangChain's UnstructuredMarkdownLoader and LlamaIndex's SimpleDirectoryReader.

Competitive content analysis

Marketing and strategy teams analyzing competitor websites can convert entire competitor blogs and resource libraries to Markdown, then run LLM-based content gap analysis, keyword extraction, and tone comparison — all from structured text rather than raw HTML.

Documentation archival and migration

Engineering teams migrating from legacy CMS platforms or creating offline documentation snapshots need a reliable way to extract content as portable Markdown. This actor crawls the full site and produces files ready for import into Hugo, Jekyll, Astro, or any Markdown-based documentation tool.

Content monitoring and freshness tracking

When combined with the Website Change Monitor, this actor enables a recurring pipeline: detect changed pages, re-convert them to Markdown, and update the corresponding records in your vector database or knowledge base.

How to convert a website to Markdown

  1. Enter your URLs — Paste one or more website URLs into the "Website URLs" field. Bare domains like pinnacletech.io are auto-prefixed with https://. Use section-specific URLs like https://docs.pinnacletech.io to target only relevant content rather than an entire homepage.
  2. Set your depth and page limit — The defaults (10 pages, depth 2) work for most documentation sections. Set depth to 0 if you only need the specific pages you listed. Increase maxPagesPerDomain up to 100 for larger sites.
  3. Run the actor — Click "Start" and wait. A 10-page run typically completes in under 30 seconds. A 100-page run takes 2–5 minutes depending on page size.
  4. Download your Markdown — Open the Dataset tab and export as JSON. Each record contains the markdown field with clean, LLM-ready content. You can also stream results via the API as they arrive.

Input parameters

ParameterTypeRequiredDefaultDescription
urlsstring[]YesStarting URLs to crawl. Bare domains are auto-prefixed with https://. One URL per line in the UI
maxPagesPerDomainintegerNo10Maximum pages to convert per domain (range: 1–100)
maxCrawlDepthintegerNo2Link-following depth from each starting URL. 0 = only the starting pages, 5 = maximum
includeMetadatabooleanNotrueInclude title, meta description, language code, and word count in each output record
onlyMainContentbooleanNotrueStrip navigation, footers, sidebars, and ads. Extract only main article content
proxyConfigurationobjectNoApify ProxyProxy settings for crawling. Defaults to Apify's datacenter proxy pool

Input examples

Convert a documentation site (most common use case):

{
  "urls": ["https://docs.pinnacletech.io"],
  "maxPagesPerDomain": 50,
  "maxCrawlDepth": 3,
  "includeMetadata": true,
  "onlyMainContent": true
}

Batch-convert multiple sites for a knowledge base:

{
  "urls": [
    "https://docs.pinnacletech.io",
    "https://help.betaindustries.com",
    "https://support.acmecorp.com/articles"
  ],
  "maxPagesPerDomain": 25,
  "maxCrawlDepth": 2,
  "includeMetadata": true,
  "onlyMainContent": true
}

Single-page extraction (no link following):

{
  "urls": [
    "https://blog.acmecorp.com/2026/03/product-launch-guide",
    "https://blog.acmecorp.com/2026/02/api-best-practices"
  ],
  "maxPagesPerDomain": 1,
  "maxCrawlDepth": 0,
  "includeMetadata": true,
  "onlyMainContent": true
}

Input tips

  • Start with depth 0 for specific pages — if you already know which pages you need, list them explicitly and set maxCrawlDepth: 0 to avoid crawling unrelated content.
  • Use section-specific URLs — targeting https://docs.acmecorp.com/api-reference rather than https://acmecorp.com means the crawler starts in the right area and page limits apply to the relevant section.
  • Keep "main content only" enabled for AI workflows — disabling it includes navigation and sidebar text in the Markdown, which degrades chunking quality and wastes token budget in downstream LLM calls.
  • Process multiple sites in one run — the actor deduplicates by domain and tracks page limits per domain independently, so batching 10 sites in one run is more efficient than 10 separate runs.
  • Check word counts before embedding — the wordCount field lets you filter out near-empty pages and estimate token costs before sending to an embedding API.

Output example

{
  "url": "https://docs.pinnacletech.io/guides/getting-started",
  "title": "Getting Started — Pinnacle Docs",
  "description": "Everything you need to set up your first Pinnacle integration in under 10 minutes.",
  "markdown": "# Getting Started\n\nThis guide walks you through creating your first Pinnacle integration.\n\n## Prerequisites\n\nBefore you begin, make sure you have:\n\n- A Pinnacle account ([sign up free](https://pinnacletech.io/signup))\n- Node.js 18+ or Python 3.10+\n- Your API key from the [dashboard](https://dashboard.pinnacletech.io)\n\n## Step 1: Install the SDK\n\n```bash\nnpm install @pinnacle/sdk\n```\n\nOr with Python:\n\n```bash\npip install pinnacle-sdk\n```\n\n## Step 2: Initialize the client\n\n```javascript\nimport { PinnacleClient } from '@pinnacle/sdk';\n\nconst client = new PinnacleClient({\n  apiKey: process.env.PINNACLE_API_KEY\n});\n```\n\n## Next steps\n\n- [Authentication guide](/guides/auth)\n- [API reference](/api)\n- [Example projects](/examples)",
  "wordCount": 412,
  "language": "en",
  "crawlDepth": 0,
  "crawledAt": "2026-03-15T09:12:44.331Z"
}

Output fields

FieldTypeDescription
urlstringFull URL of the converted page
titlestringPage title from OpenGraph og:title, then <title> tag, then first <h1>. Empty string if includeMetadata is false
descriptionstringMeta description from OpenGraph or <meta name="description">. Empty string if not present or includeMetadata is false
markdownstringFull page content converted to GitHub Flavored Markdown. The primary output field
wordCountintegerWord count of the Markdown text. Multiply by ~1.3 to estimate token usage for most LLMs
languagestring or nullLanguage code from <html lang> attribute, lowercased and trimmed of region suffix (e.g., "en-US" becomes "en"). Null if not set
crawlDepthintegerNumber of link hops from the starting URL. 0 means the starting page itself
crawledAtstringISO 8601 timestamp of when the page was crawled and converted

How much does it cost to convert websites to Markdown?

Website Content to Markdown runs on Apify's compute-unit pricing — you pay for the compute time used, not per page. At 256 MB memory (the default), the actor is among the lowest-cost crawlers on the platform.

ScenarioPages convertedEstimated costEstimated run time
Quick test1< $0.01~5 seconds
Small section10~$0.01~25 seconds
Medium documentation site50~$0.03~2 minutes
Large documentation site100~$0.05–0.08~4–6 minutes
Multi-site batch (500 pages total)500~$0.20–0.35~15–25 minutes

You can set a maximum run time and memory limit in the run configuration to cap spending. The actor stops automatically when your page limit is reached.

Compare this to manual copy-paste preparation: a developer spending 2 minutes per page preparing 100 pages for a RAG pipeline takes over 3 hours. This actor does the same job in under 5 minutes for under $0.10.

Apify's free tier includes $5 of monthly compute credits, which covers approximately 5,000–10,000 page conversions.

Convert websites to Markdown using the API

Python

from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")

run = client.actor("ryanclinton/website-content-to-markdown").call(run_input={
    "urls": ["https://docs.pinnacletech.io"],
    "maxPagesPerDomain": 30,
    "maxCrawlDepth": 2,
    "includeMetadata": True,
    "onlyMainContent": True,
})

for page in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(f"{page['url']} — {page['wordCount']} words (~{int(page['wordCount'] * 1.3)} tokens)")
    # Save each page as a .md file for LangChain / LlamaIndex ingestion
    safe_name = page["url"].replace("https://", "").replace("/", "_")
    with open(f"{safe_name}.md", "w") as f:
        f.write(page["markdown"])

JavaScript

import { ApifyClient } from "apify-client";
import { writeFileSync } from "fs";

const client = new ApifyClient({ token: "YOUR_API_TOKEN" });

const run = await client.actor("ryanclinton/website-content-to-markdown").call({
    urls: ["https://docs.pinnacletech.io"],
    maxPagesPerDomain: 30,
    maxCrawlDepth: 2,
    includeMetadata: true,
    onlyMainContent: true,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
for (const page of items) {
    console.log(`${page.url} — ${page.wordCount} words`);
    // Feed into LangChain UnstructuredMarkdownLoader or a vector database
    const safeName = page.url.replace("https://", "").replace(/\//g, "_");
    writeFileSync(`${safeName}.md`, page.markdown);
}

cURL

# Start the actor run
curl -X POST "https://api.apify.com/v2/acts/ryanclinton~website-content-to-markdown/runs?token=YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://docs.pinnacletech.io"],
    "maxPagesPerDomain": 30,
    "maxCrawlDepth": 2,
    "includeMetadata": true,
    "onlyMainContent": true
  }'

# Fetch results (replace DATASET_ID from the run response above)
curl "https://api.apify.com/v2/datasets/DATASET_ID/items?token=YOUR_API_TOKEN&format=json"

How Website Content to Markdown works

Phase 1: URL discovery and sitemap parsing

When the actor starts, it normalizes each input URL (adding https:// for bare domains, validating format) and deduplicates by hostname. For each unique starting URL, it fetches /sitemap.xml using a 10-second timeout with an ApifyBot/1.0 User-Agent. The sitemap parser handles both standard sitemaps (extracting <loc> tags) and sitemap index files (fetching the first child sitemap). URLs matching binary file extensions — jpg, png, gif, pdf, zip, mp4, xml — are excluded. The combined list of starting URLs and sitemap-discovered URLs forms the initial request queue.

Phase 2: Breadth-first crawling

A CheerioCrawler runs with 10 concurrent workers at up to 120 requests/minute, with session pooling and persistent cookies for stable multi-page crawls. Three retries are attempted on failure. The handler skips responses without an html Content-Type to avoid processing XML sitemaps or JSON feeds that sneak through. Per-domain page counts and visited URL sets are tracked in a shared Map<string, DomainState> that enforces both the page cap and URL deduplication (trailing slash normalized). For each successfully processed page, the handler enqueues same-domain internal links from <a href> elements, filtering out fragments, external domains, and binary file extensions, up to maxCrawlDepth levels deep using BFS with __crawlDepth userData propagation.

Phase 3: Content extraction and Markdown conversion

The extraction pipeline runs in sequence for each page. First, extractContent() tries the 10 semantic selectors in order (<main>, <article>, [role="main"], #content, .content, .post-content, .entry-content, .article-body, .page-content, .main-content) and uses the first element with 200+ characters of inner HTML. A second Cheerio pass strips the 30+ non-content selectors from within the matched container. If no semantic container matches, the full <body> is used with the same stripping applied globally.

Next, htmlToMarkdown() passes the cleaned HTML to a pre-configured Turndown instance with ATX heading style, fenced code blocks (triple backtick), bullet markers, inline links, and preformattedCode: true to preserve code block whitespace. The turndown-plugin-gfm plugin adds table and strikethrough support. Two custom Turndown rules are applied: images without alt text or with data-URI sources are dropped entirely; anchor tags with empty text content are removed. The resulting Markdown is cleaned with cleanMarkdown() which collapses multiple blank lines, trims line-trailing whitespace, and strips whitespace-only lines.

Pages producing fewer than 50 characters of Markdown after the full pipeline are silently skipped — these are typically login redirect pages or thin landing pages. The final record includes the URL, title (OpenGraph > <title> > <h1>), description (OpenGraph > meta description), Markdown, word count, language, crawl depth, and ISO timestamp.

Tips for best results

  1. Target section roots, not homepages. Pointing at https://docs.acmecorp.com rather than https://acmecorp.com ensures your page budget is spent on documentation rather than marketing pages. The crawler follows internal links from the starting point.

  2. Use depth 0 when you have a URL list. If you already know which pages to convert, list them all in urls and set maxCrawlDepth: 0. This is faster and more predictable than relying on link discovery.

  3. Estimate token budgets before embedding. Sum the wordCount values in your output and multiply by 1.3. A 100-page documentation site averaging 800 words per page produces roughly 104,000 tokens — helpful to know before choosing an embedding model.

  4. Disable metadata for bulk training data. If you are building a fine-tuning dataset and only need raw Markdown text, set includeMetadata: false. It has negligible cost impact but keeps output records leaner.

  5. Run on a schedule for living knowledge bases. Use Apify's scheduling to re-run this actor weekly against your source sites. Pair it with the Website Change Monitor to trigger re-conversion only when content actually changes.

  6. For SPAs, use the Pro version. If a site loads content through JavaScript (React, Vue, Angular apps), this actor will return the skeleton HTML, not the rendered content. See the Limitations section.

  7. Combine with Company Deep Research for enterprise content. Feed company website Markdown directly into the Company Deep Research actor for comprehensive intelligence reports that include the company's own published content.

  8. Set proxy for rate-limited sites. Enable proxyConfiguration with Apify residential proxies if a site returns 429 or 403 errors during crawling. The session pool will rotate identities across requests.

Combine with other Apify actors

ActorHow to combine
AI Training Data CuratorConvert websites to Markdown, then pass to the curator for deduplication, quality filtering, and fine-tuning dataset formatting
Website Change MonitorDetect when source pages change, then trigger this actor to re-convert only the updated pages for incremental knowledge base updates
Company Deep ResearchConvert a company's public website to Markdown and feed the content into deep research workflows for comprehensive intelligence reports
Website Contact ScraperRun both actors on the same domain: one extracts contacts, the other extracts page content for enriched company profiles
Website Tech Stack DetectorIdentify a site's technology stack first, then convert its content to Markdown — useful for contextualizing technical documentation
Competitor Analysis ReportConvert competitor sites to Markdown, then run competitive analysis on the structured text using LLMs
B2B Lead Gen SuiteEnrich lead profiles with content extracted from their company websites converted to Markdown

Limitations

  • No JavaScript rendering — the actor uses CheerioCrawler, which parses the server-delivered HTML response. Single-page applications (React, Vue, Angular, Next.js with client-side rendering) that load content via JavaScript will return an empty shell or loading spinner. For JS-rendered sites, a headless browser approach is required.
  • No authenticated content — only publicly accessible pages are processed. Login walls, paywalls, and members-only content produce their gate page, not the protected content behind it.
  • Same-domain crawling only — the crawler never follows links to external domains. If a site's documentation is split across multiple subdomains (e.g., docs.acmecorp.com and api.acmecorp.com), list both as separate starting URLs.
  • 100-page maximum per domain — set by the input schema's maximum constraint. For very large sites, run multiple targeted crawls against specific sections.
  • Sitemap-dependent discovery — pages that are not linked from any crawled page and not present in sitemap.xml will not be discovered. Orphaned pages require explicit URL input.
  • No PDF or binary content — only HTML pages are converted. PDF documents, Word files, and embedded media are skipped.
  • English-biased class name selectors — the semantic content selectors use English CSS class names (.content, .post-content, .entry-content). Sites using non-English or unusual class naming conventions may need onlyMainContent: false to capture all content, at the cost of including some boilerplate.
  • No JavaScript execution in content — dynamically inserted content (lazy-loaded sections, infinite scroll, tab-hidden content) is not captured because it requires browser execution.

Integrations

  • LangChain / LlamaIndex — use the Apify integration to load Markdown output directly into your RAG pipeline as document chunks
  • Zapier — send converted Markdown pages to Notion, Google Docs, Confluence, or Slack on run completion
  • Make — chain conversion runs with Airtable, HubSpot, or Slack steps in automated content workflows
  • Google Sheets — export URL, title, word count, and language to a spreadsheet for content audits
  • Apify API — trigger runs programmatically from CI/CD pipelines and retrieve Markdown via REST for embedding workflows
  • Webhooks — receive a POST notification with the dataset URL when conversion finishes, enabling async pipeline triggers
  • Vector databases (Pinecone, Weaviate, Qdrant, Chroma) — pipe the markdown field directly into your embedding and upsert pipeline after chunking

Troubleshooting

  • Output Markdown is empty or very short — the source site likely uses JavaScript to render its content. CheerioCrawler only parses server-sent HTML. Check the page in your browser with JavaScript disabled; if you see a blank or loading page, this actor cannot process it. A headless browser alternative is required for SPAs.

  • Getting unexpected navigation or sidebar content in Markdown — some sites use non-standard markup without semantic HTML elements (<main>, <article>). The actor falls back to body-level stripping, which may miss some structural elements. Try disabling onlyMainContent and stripping the specific selectors yourself in post-processing, or provide a more specific section URL.

  • Run stopped before reaching the page limit — the actor logs a warning when the per-domain pageCount cap is reached. Increase maxPagesPerDomain (up to 100) or run multiple crawls targeting different sections of the site.

  • Some pages failing with 403 or 429 errors — the target site is blocking the crawler. Enable proxyConfiguration with "useApifyProxy": true and optionally set proxyUrls to residential proxies. The session pool will rotate IPs across requests.

  • Sitemap URLs not being picked up — some sites serve their sitemap at a non-standard path (e.g., /sitemap_index.xml or /sitemaps/pages.xml). The actor only checks /sitemap.xml. For sites with non-standard sitemap locations, add the specific page URLs manually to the urls input.

Responsible use

  • This actor only accesses publicly available web pages.
  • Respect robots.txt directives and website terms of service regarding automated access.
  • Do not use converted content in ways that violate the original site's copyright or content license.
  • Comply with applicable data protection laws (GDPR, CCPA) when storing or processing scraped content.
  • For guidance on web scraping legality, see Apify's guide.

FAQ

How do I convert a website to Markdown for a RAG pipeline? Enter the documentation site URL, set maxPagesPerDomain to the number of pages you want, and set onlyMainContent: true. Each output record contains a markdown field ready for chunking and embedding. The wordCount field helps you estimate token counts before sending to your embedding API.

What types of websites does this actor convert best? Text-heavy, server-rendered sites: documentation portals, developer guides, help centers, blogs, knowledge bases, and informational pages. Sites that rely on JavaScript to render their content (React SPAs, Angular apps) are not supported — use a headless browser approach for those.

Can I use the Markdown output directly with ChatGPT, Claude, or Gemini? Yes. The Markdown format is natively understood by all major LLMs. Feed the markdown field directly into prompts, or use the word count to gauge how many pages fit within a context window (rough estimate: 1 word ≈ 1.3 tokens).

How many pages can I convert in one run? Up to 100 pages per domain per run, across as many domains as you provide. For larger sites, run multiple targeted crawls against different sections and merge the datasets. There is no limit on the number of domains in a single run.

Does this actor follow links to other domains? No. The crawler only follows internal links within the same domain (and subdomain) as each starting URL. If you need content from multiple domains, add each as a separate entry in the urls input.

How is this different from manually copy-pasting website content? Manual copy-paste for 50 pages takes 2–4 hours and produces inconsistent formatting. This actor processes 50 pages in under 2 minutes, produces consistently formatted GitHub Flavored Markdown, strips all boilerplate automatically, and runs unattended on a schedule. The per-page word count and metadata fields are not available from manual copying.

How does the "main content only" mode work? The actor tries 10 semantic HTML selectors in priority order — <main>, <article>, [role="main"], and 7 common content class names. The first matching element with 200+ characters of inner HTML is used as the content container. Non-content elements (nav, footer, sidebar, ads, etc.) are then stripped from within that container. If no semantic container is found, the full <body> is used with the same stripping applied.

Is it legal to convert website content to Markdown? Accessing publicly available web pages is generally legal in most jurisdictions. However, you should review each target website's terms of service, respect robots.txt directives, and ensure your use of the converted content complies with copyright law. For commercial AI training use cases, some site terms explicitly restrict automated scraping. See Apify's guide on web scraping legality for a detailed overview.

Can I schedule this actor to run automatically? Yes. Apify's scheduling feature lets you set recurring runs on a cron schedule (daily, weekly, or custom). This is ideal for keeping documentation snapshots current or monitoring competitor content.

What happens to pages that fail to load? Failed requests are retried up to 3 times with exponential backoff. If still failing after retries, the page is logged as a warning and skipped. Skipped pages do not count toward the per-domain page limit, so your budget is not wasted on failures.

How is this different from Apify's Website Content Crawler? Both convert web pages to text, but this actor is a lightweight, cost-efficient solution for straightforward HTML sites. It uses CheerioCrawler (no browser, ~256 MB memory) and outputs structured JSON with per-page metadata. Apify's Website Content Crawler uses a full browser and supports JavaScript rendering but runs at higher cost. Choose this actor for static and server-rendered sites; choose a browser-based solution for SPAs.

Can I use this actor's output with LangChain or LlamaIndex? Yes. The markdown field integrates directly with LangChain's UnstructuredMarkdownLoader and LlamaIndex's SimpleDirectoryReader. Apify also provides a native LangChain integration that loads dataset items as LangChain documents without any custom code.

Help us improve

If you encounter issues, you can help us debug faster by enabling run sharing in your Apify account:

  1. Go to Account Settings > Privacy
  2. Enable Share runs with public Actor creators

This lets us see your run details when something goes wrong, so we can fix issues faster. Your data is only visible to the actor developer, not publicly.

Support

Found a bug or have a feature request? Open an issue in the Issues tab on this actor's page. For custom solutions or enterprise integrations, reach out through the Apify platform.

How it works

01

Configure

Set your parameters in the Apify Console or pass them via API.

02

Run

Click Start, trigger via API, webhook, or set up a schedule.

03

Get results

Download as JSON, CSV, or Excel. Integrate with 1,000+ apps.

Use cases

Sales Teams

Build targeted lead lists with verified contact data.

Marketing

Research competitors and identify outreach opportunities.

Data Teams

Automate data collection pipelines with scheduled runs.

Developers

Integrate via REST API or use as an MCP tool in AI workflows.

Ready to try Website Content to Markdown?

Start for free on Apify. No credit card required.

Open on Apify Store