What output format does Knowledge Intelligence Engine — Website to Markdown for RAG return?

Knowledge Intelligence Engine — Website to Markdown for RAG returns structured data in JSON format by default. You can also export results as CSV or Excel from the Apify Console. Each result includes all extracted fields in a flat, machine-readable structure that integrates directly with spreadsheets, CRMs, and automation tools via Apify integrations.

Are there alternatives to Knowledge Intelligence Engine — Website to Markdown for RAG?

Yes. ApifyForge lists multiple actors in each category with different strengths. Browse related actors on the Knowledge Intelligence Engine — Website to Markdown for RAG page or use the ApifyForge actor recommender to find the best fit for your use case. The right choice depends on your input data, budget, and required output fields.

AIDEVELOPER TOOLS

Knowledge Intelligence Engine — Website to Markdown for RAG

Knowledge Intelligence Engine — Website to Markdown for RAG is an Apify actor available on ApifyForge at $0.02 per page-converted. Turn any website, documentation site or help centre into a retrieval-ready knowledge corpus for RAG and AI search. Clean Markdown plus chunks, change detection, deduplication, retrieval scoring, version awareness and a full corpus audit, in one run.

Best for investigators, analysts, and risk teams conducting due diligence, regulatory tracking, or OSINT research.

Not ideal for real-time surveillance or replacing classified intelligence systems.

Try on Apify Store

$0.02per event

Compare similar actors

Last verified: March 26, 2026

Actively maintained

Maintenance Pulse

$0.02

Per event

What to know

Limited to publicly available and open-source information.
Report depth depends on the availability of upstream government and public data sources.
Requires an Apify account — free tier available with limited monthly usage.

Maintenance Pulse

93/100

Last Build

1d ago

Last Version

4d ago

Builds (30d)

Issue Response

6h avg

Cost Estimate

How many results do you need?

page-converteds

Estimated cost:$2.00

Pricing

Pay Per Event model. You only pay for what you use.

Event	Description	Price
page-converted	Charged per page converted to Markdown. Includes HTML-to-Markdown conversion, metadata extraction, and content filtering.	$0.02

Example: 100 events = $2.00 · 1,000 events = $20.00

Documentation

Knowledge Intelligence Engine — turn any site into a retrieval-ready RAG corpus

Most crawlers extract pages. This actor builds knowledge corpora.

Most teams think they need a crawler. What they actually need is a retrieval-ready knowledge corpus. A typical RAG pipeline is eight stages:

Crawl → Boilerplate removal → Deduplication → Chunking → Change detection
      → Quality filtering → Documentation classification → Corpus auditing

Most tools stop after the first step and hand you HTML. This actor does all eight in a single run and hands you a production-ready knowledge corpus: clean Markdown, embedding-ready chunks, change intelligence, retrieval scoring, version awareness, and a full corpus audit.

Feed the output straight into LangChain, LlamaIndex, Pinecone, Weaviate, Qdrant, Chroma, or any vector database. No browser, no JavaScript runtime, no preprocessing pipeline to build and maintain.

Note: the actor's slug is still website-content-to-markdown (its original name). The Markdown conversion is the foundation; everything above is built on top of it.

What makes this different?

Most crawlers extract pages. This actor manages knowledge. It does not just tell you what content exists; it tells you:

What changed since the last run (and returns only that, if you want)
What matters (an importance and density ranking, not just word counts)
What to embed and what to skip (a single retrieval score per page)
What is duplicated, stale, or orphaned before it reaches production
What is missing (crawl coverage vs the sitemap)

Capability	Generic scraper	Typical extractor	This actor
Clean Markdown	✅	✅	✅
Boilerplate removal	❌	✅	✅
Change detection across runs	❌	Partial	✅
Delta crawls (return only changed pages)	❌	❌	✅
Per-page retrieval scoring	❌	❌	✅
Embedding-ready chunks	❌	❌	✅
Documentation classification	❌	❌	✅
Version awareness	❌	❌	✅
Duplicate / orphan / stale detection	❌	❌	✅
Corpus audit + recommendations	❌	❌	✅
Embedding-cost analysis	❌	❌	✅

Most extraction tools tell you what content exists. This actor tells you what changed, what should be embedded, what should be removed, what is missing, and what is hurting retrieval. That is the moat.

Before and after

BEFORE                              AFTER  (one run)
docs.example.com                    Corpus Score: 91 / 100
542 pages                           ✓ 31 duplicate pages flagged
~2.1M tokens                        ✓ 14 orphan pages found
unknown quality                     ✓ 27 stale pages found
unknown duplicates                  ✓ 38% embedding-token reduction
unknown coverage                    ✓ 96% coverage confidence
                                    ✓ 93% of pages retrieval-ready
                                    → ready for Pinecone / Qdrant / Weaviate

(Illustrative; real figures depend on the site.)

Re-embed only what changed (the expensive part of RAG)

Most crawlers return every page every time, so you re-embed, and re-pay for, the whole site on every refresh:

Typical crawler                  This actor (delta mode)
1,000 pages                      1,000 pages
  ↓ crawl 1,000                    ↓ crawl 1,000
  ↓ embed 1,000                    ↓ 17 changed
  ↓ (next refresh)                 ↓ embed 17
  ↓ embed 1,000 AGAIN              ✓ ~98% less embedding work

This actor tracks a content hash per page across runs. Set a watchlistName and deltaOnly, and a 1,000-page documentation site with 17 changed pages returns 17 records. The exact saving varies with how much your docs change, but on a stable site it is enormous.

What happens after the crawl?

Most crawlers stop at extraction. This one keeps going:

Most crawlers stop here          This actor continues
Website                          Website
  ↓ Discovery                      ↓ Discovery
  ↓ Markdown                       ↓ Markdown
  ■ (you build the rest)           ↓ Deduplication
                                   ↓ Classification (type · archetype · intent)
                                   ↓ Change detection (delta vs last run)
                                   ↓ Chunking (heading-aware, embedding-ready)
                                   ↓ Corpus analysis (coverage · health · versions)
                                   ↓ Retrieval audit (what to embed, what to skip)
                                   → Knowledge corpus → your vector database

The intelligence stack: from raw website through deduplication, classification, change detection, chunking and corpus audit to a retrieval-ready knowledge corpus

What you don't have to build

Most website crawlers stop at extraction. Everything else is a pipeline you build and maintain yourself:

Build it yourself with a typical crawler	With this actor
Boilerplate-removal pipeline	✅ built in
Deduplication pipeline	✅ built in
Chunking pipeline	✅ built in
Change-detection system	✅ built in
Documentation classification	✅ built in
Retrieval / quality scoring	✅ built in
Corpus quality auditing	✅ built in
Token-cost estimation	✅ built in

A typical extraction tool returns Markdown plus a little metadata. This actor returns Markdown, chunks, change intelligence, a corpus audit, retrieval scoring, coverage analysis, version awareness, and a knowledge graph. One side is a converter; the other is infrastructure.

Why RAG teams choose this

Built for AI ingestion, not generic scraping. The result: lower embedding costs (dedup + delta), higher retrieval quality (scoring + classification), smaller vector stores (skip thin and duplicate pages), faster refresh cycles (delta crawls), and far less post-processing code.

When a page renders its content client-side or restricts automated access, you get an explicit diagnostic record (jsRenderingRequired / botProtection) instead of a silently missing page, and those pages are never charged.

Corpus Intelligence Report

Why this matters: most teams never discover their duplicate documentation, orphaned pages, stale content, missing sitemap coverage, thin pages, or version conflicts until retrieval quality has already degraded in production. This actor finds them automatically, on every run, before they reach your model.

The feature almost no other extraction tool offers: the actor does not just hand you text, it tells you what is wrong with your knowledge corpus. Every run writes a SUMMARY to the key-value store with a corpusScore, a coverageConfidence ("did I crawl the whole thing?"), a retrievalAudit (excellent / good / poor pages + the top issues), an embedding-cost compression report, corpusDrift since the last run, and plain-English recommendations like "31 duplicate pages detected, dedupe on canonicalUrl" and "captured 500 of 560 sitemap pages (89.3%), raise maxPagesPerDomain for full coverage". Most tools tell you what content exists; this one tells you what to fix. (Full SUMMARY shape is documented in the output section below.)

A run summary reads like a corpus health dashboard:

Corpus Score          91 / 100
Coverage Confidence   94%
Retrieval-ready        412 excellent · 83 good · 17 poor
Duplicate pages        31
Orphan pages           14
Stale pages            27
Embedding savings      2.1M → 1.3M tokens  (38% reduction)
Corpus drift           17%  (vs last run)

Who is this for?

Built for: RAG engineers, AI platform teams, documentation teams, internal-search teams, and knowledge-management teams who need a retrieval-ready corpus, not just scraped text.

Not for: general web scraping, ecommerce product extraction, lead generation, or browser automation. If you only need raw HTML or a few fields, a generic crawler is simpler.

When a crawler is enough vs when you need this

You only need...	Use a generic crawler
Markdown + basic metadata	A plain extraction tool is fine

You need...	Use this actor
Change detection + delta ingestion	✅
Retrieval scoring + quality filtering	✅
Corpus audit + coverage analysis	✅
Version awareness + documentation classification	✅
A knowledge graph + embedding-cost analysis	✅

A real example

Point it at a documentation site and you get an ingestion-ready corpus plus a report, not just pages:

Input:   https://docs.example.com

✓ 542 pages discovered (from sitemap + link following)
✓ 511 pages extracted to clean Markdown
✓ 31 duplicate pages detected (canonical variants)
✓ 14 orphan pages detected (in sitemap, no inbound links)
✓ 27 stale pages detected (not updated in over a year)
✓ 38% token reduction after dedup
✓ 93% of pages retrieval-ready (pageScore ≥ 75)
✓ 96% coverage confidence

Result:  ~1.3M retrieval-ready tokens, embedding-ready, with a corpus health report

(Illustrative figures; actual numbers depend on the target site.)

What data can you extract?

Data Point	Source	Example
📄 Markdown content	Converted page HTML	`# Getting Started\n\nThis guide covers...`
🔗 Page URL	Request URL	`https://docs.pinnacletech.io/guides/setup`
📌 Page title	OpenGraph, `<title>`, or `<h1>`	`Getting Started — Pinnacle Docs`
📝 Meta description	OpenGraph or `<meta name="description">`	`Learn how to set up your Pinnacle account...`
🔢 Word count	Counted from Markdown output	`1,843`
🧮 Token estimate	Markdown length ÷ 4 (GPT-style)	`2,410`
⭐ Content quality	wordCount + extraction method	`rich`
🧩 Extraction method	Which content path matched	`semantic-main`
🖥️ JS rendering required	Framework shell + thin static HTML	`false`
🛡️ Bot protection	Anti-bot challenge detection	`{ "detected": false }`
🌐 Language	`<html lang>` attribute	`en`
🔽 Crawl depth	Hops from starting URL	`2`
🕐 Crawled at	Apify runtime timestamp	`2026-06-06T09:12:44.000Z`

Sample dataset output: per-page pageScore, content archetype, intent, token estimate and change type

Why use Website Content to Markdown?

Four built-in capabilities: delta mode, page-score embed gate, corpus intelligence report, and embedding-ready chunks

Large language models and RAG pipelines need clean text — not raw HTML packed with <nav> elements, cookie consent banners, sidebar widgets, and tracking scripts. Preparing web content for AI consumption by hand means copy-pasting from dozens of pages, reformatting manually, and re-doing the work every time the source changes. That process does not scale.

This actor automates the entire pipeline: it discovers pages through sitemap.xml and internal link following, extracts the main content using semantic HTML selectors, strips more than 30 categories of boilerplate, and converts the result to GitHub Flavored Markdown in a single run. Every page becomes a clean, consistently formatted document ready for downstream AI processing.

Beyond the conversion itself, the Apify platform gives you tools that matter at scale:

Scheduling — run weekly or on a custom cron to keep your knowledge base snapshots current
API access — trigger runs from Python, JavaScript, or any HTTP client and pipe results directly into your pipeline
Proxy rotation — scrape at scale without IP blocks using Apify's built-in residential and datacenter proxy infrastructure
Monitoring — get Slack or email alerts when runs fail or produce unexpected results
Integrations — connect output to LangChain, LlamaIndex, Pinecone, Weaviate, Zapier, Make, or webhooks in minutes

Features

Grouped by the job they do. Every field below is documented in full in the output fields table.

Content extraction

Semantic main-content extraction (10 selectors, <main>/<article> first), 30+ categories of boilerplate stripped, clean GitHub Flavored Markdown (headings, code, tables, task lists).
Sitemap discovery across five common locations + index files, breadth-first link following (depth ≤ 5), per-domain page caps, concurrent crawling with a session pool.

Retrieval optimization

pageScore (the headline 0-100), with retrievalScore, qualityScore, knowledgeDensity and extractionConfidence as components.
Heading-aware chunks (opt-in, embedding-ready with token counts), tokenEstimate per page, contentArchetype + intent + pageType for filtering and query routing.

Corpus intelligence

Run-level SUMMARY: corpusScore, coverageConfidence, retrievalAudit, embedding-cost compression, crawl-gap coverage, corpus recommendations.
corpusFingerprint + corpusDrift to measure how much the whole corpus changed between runs.

Change management

watchlistName content hashing + changeType (new / content / structure / unchanged), and deltaOnly to emit (and charge for) only changed pages.

Documentation intelligence

documentationVersion + run-level version-family resolution, API-reference detection (apiReference + endpoints), freshnessScore / stalenessDays, completeness, toc, codeBlocks, breadcrumbs.

Deduplication & hygiene

canonicalUrl + duplicateContent so the same page is never embedded twice, orphan / thin / stale detection at run level, and a link graph (extractLinks) for graph-RAG.

Reliability

JS-SPA + anti-bot detection emits an explicit diagnostic record (never a silent gap, never charged), failed URLs are classified with a failureType + recommendation, and proxy support for sites that restrict automated access.

The 3 fields most teams actually use

Don't let the field count intimidate you. Most pipelines branch on just three:

pageScore — the headline 0-100. Embed pages above your threshold.
retrievalScore — the embed gate (>= 75 is a clean cutoff).
changeType — new / content / structure / unchanged, for re-embedding only what moved.

Everything else is optional intelligence: turn it on when you need corpus audits, chunking, classification, or the knowledge graph. The defaults give you clean Markdown plus those three fields.

Use cases for converting websites to Markdown

RAG pipeline ingestion

AI engineers building retrieval-augmented generation systems need clean text to chunk and embed. This actor converts entire documentation sites into structured Markdown pages in a single run. The wordCount field on each record lets you estimate token cost before committing to chunking and embedding, avoiding expensive surprises downstream.

LLM fine-tuning dataset preparation

Teams preparing fine-tuning datasets for instruction-following or domain-specific models need high-quality, boilerplate-free text. This actor converts blog posts, knowledge bases, and technical documentation with all navigation and ad content removed — so training data reflects actual prose, not menu structures.

AI chatbot and knowledge base construction

Product teams building internal chatbots or customer-facing support tools need to ingest their documentation into a vector store. This actor converts company wikis, help centers, and product docs into Markdown that integrates directly with LangChain's UnstructuredMarkdownLoader and LlamaIndex's SimpleDirectoryReader.

Competitive content analysis

Marketing and strategy teams analyzing competitor websites can convert entire competitor blogs and resource libraries to Markdown, then run LLM-based content gap analysis, keyword extraction, and tone comparison — all from structured text rather than raw HTML.

Documentation archival and migration

Engineering teams migrating from legacy CMS platforms or creating offline documentation snapshots need a reliable way to extract content as portable Markdown. This actor crawls the full site and produces files ready for import into Hugo, Jekyll, Astro, or any Markdown-based documentation tool.

Content monitoring and freshness tracking

Set a watchlistName and schedule the actor: each run hashes every page and marks it new, content, structure, or unchanged against the prior run. Re-embed only the pages whose changed flag is true instead of rebuilding your whole vector store on every refresh. The change intelligence is built in, with no second actor and no external state to manage.

How to convert a website to Markdown

Enter your URLs — Paste one or more website URLs into the "Website URLs" field. Bare domains like pinnacletech.io are auto-prefixed with https://. Use section-specific URLs like https://docs.pinnacletech.io to target only relevant content rather than an entire homepage.
Set your depth and page limit — The defaults (10 pages, depth 2) work for most documentation sections. Set depth to 0 if you only need the specific pages you listed. Increase maxPagesPerDomain up to 100 for larger sites.
Run the actor — Click "Start" and wait. A 10-page run typically completes in under 30 seconds. A 100-page run takes 2–5 minutes depending on page size.
Download your Markdown — Open the Dataset tab and export as JSON. Each record contains the markdown field with clean, LLM-ready content. You can also stream results via the API as they arrive.

Input parameters

Parameter	Type	Required	Default	Description
`urls`	string[]	Yes	—	Starting URLs to crawl. Bare domains are auto-prefixed with `https://`. One URL per line in the UI
`maxPagesPerDomain`	integer	No	10	Maximum pages to convert per domain (range: 1–100)
`maxCrawlDepth`	integer	No	2	Link-following depth from each starting URL. 0 = only the starting pages, 5 = maximum
`includeMetadata`	boolean	No	true	Include title, meta description, language code, and word count in each output record
`onlyMainContent`	boolean	No	true	Strip navigation, footers, sidebars, and ads. Extract only main article content
`generateChunks`	boolean	No	false	Split each page into heading-aware chunks ready for embedding (each carries its heading path + token count). Skips the chunking step in your RAG pipeline
`chunkSize`	integer	No	1000	Approximate target chunk size in tokens (used only when `generateChunks` is on; range 100–8000)
`extractLinks`	boolean	No	false	Include per-page internal/external links and a run-level navigation graph (inbound link counts) in the key-value store
`extractAssets`	boolean	No	false	Include page images that have alt text (type, alt, URL) on each record
`watchlistName`	string	No	—	Name a watchlist to enable cross-run change detection. The actor stores each page's content hash under this name and marks pages `new` / `content` / `structure` / `unchanged` on the next run
`deltaOnly`	boolean	No	false	Requires `watchlistName`. Crawls the whole site but emits (and charges for) only pages that are new or changed since the last run, so you re-embed just what moved
`proxyConfiguration`	object	No	Apify Proxy	Proxy settings for crawling. Defaults to Apify's datacenter proxy pool. Use residential proxies for sites that restrict automated access

Input examples

Convert a documentation site (most common use case):

{
  "urls": ["https://docs.pinnacletech.io"],
  "maxPagesPerDomain": 50,
  "maxCrawlDepth": 3,
  "includeMetadata": true,
  "onlyMainContent": true
}

Batch-convert multiple sites for a knowledge base:

{
  "urls": [
    "https://docs.pinnacletech.io",
    "https://help.betaindustries.com",
    "https://support.acmecorp.com/articles"
  ],
  "maxPagesPerDomain": 25,
  "maxCrawlDepth": 2,
  "includeMetadata": true,
  "onlyMainContent": true
}

Single-page extraction (no link following):

{
  "urls": [
    "https://blog.acmecorp.com/2026/03/product-launch-guide",
    "https://blog.acmecorp.com/2026/02/api-best-practices"
  ],
  "maxPagesPerDomain": 1,
  "maxCrawlDepth": 0,
  "includeMetadata": true,
  "onlyMainContent": true
}

Input tips

Start with depth 0 for specific pages — if you already know which pages you need, list them explicitly and set maxCrawlDepth: 0 to avoid crawling unrelated content.
Use section-specific URLs — targeting https://docs.acmecorp.com/api-reference rather than https://acmecorp.com means the crawler starts in the right area and page limits apply to the relevant section.
Keep "main content only" enabled for AI workflows — disabling it includes navigation and sidebar text in the Markdown, which degrades chunking quality and wastes token budget in downstream LLM calls.
Process multiple sites in one run — the actor deduplicates by domain and tracks page limits per domain independently, so batching 10 sites in one run is more efficient than 10 separate runs.
Check word counts before embedding — the wordCount field lets you filter out near-empty pages and estimate token costs before sending to an embedding API.

Output example

{
  "recordType": "page",
  "schemaVersion": "1.1.0",
  "url": "https://docs.pinnacletech.io/guides/getting-started",
  "title": "Getting Started — Pinnacle Docs",
  "description": "Everything you need to set up your first Pinnacle integration in under 10 minutes.",
  "markdown": "# Getting Started\n\nThis guide walks you through creating your first Pinnacle integration.\n\n## Prerequisites\n\nBefore you begin, make sure you have:\n\n- A Pinnacle account ([sign up free](https://pinnacletech.io/signup))\n- Node.js 18+ or Python 3.10+\n- Your API key from the [dashboard](https://dashboard.pinnacletech.io)\n\n## Step 1: Install the SDK\n\n```bash\nnpm install @pinnacle/sdk\n```\n\n## Next steps\n\n- [Authentication guide](/guides/auth)\n- [API reference](/api)",
  "wordCount": 412,
  "tokenEstimate": 540,
  "language": "en",
  "crawlDepth": 0,
  "extractionMethod": "semantic-main",
  "contentQuality": "rich",
  "pageScore": 89,
  "qualityScore": 87,
  "retrievalScore": 91,
  "knowledgeDensity": 84,
  "intent": "implement",
  "completeness": { "hasOverview": true, "hasExamples": true, "hasCode": true, "hasFaq": false, "hasTroubleshooting": false },
  "documentationVersion": "v2",
  "extractionConfidence": 95,
  "confidenceReason": ["main_element", "high_text_density", "low_navigation_ratio"],
  "htmlSize": 118000,
  "contentSize": 2480,
  "boilerplateReduction": 97.9,
  "pageType": "documentation",
  "contentArchetype": "how-to",
  "siteSection": "guides",
  "faqCount": 0,
  "procedure": true,
  "procedureSteps": 2,
  "breadcrumbs": ["Docs", "Guides", "Getting Started"],
  "freshnessScore": 92,
  "stalenessDays": 7,
  "discoveredVia": "sitemap",
  "publishedAt": "2026-01-12T00:00:00.000Z",
  "modifiedAt": "2026-05-30T00:00:00.000Z",
  "apiReference": false,
  "endpoints": [],
  "toc": [
    { "level": 1, "title": "Getting Started" },
    { "level": 2, "title": "Prerequisites" },
    { "level": 2, "title": "Next steps" }
  ],
  "codeBlocks": [{ "language": "bash", "lines": 1 }],
  "containsCode": true,
  "contentHash": "9f2c…",
  "structureHash": "4a1b…",
  "canonicalUrl": "https://docs.pinnacletech.io/guides/getting-started",
  "duplicateContent": false,
  "changeType": "content",
  "changed": true,
  "previousHash": "7d3e…",
  "parentPage": "https://docs.pinnacletech.io/guides",
  "jsRenderingRequired": false,
  "jsFramework": null,
  "botProtection": { "detected": false, "vendor": null },
  "crawledAt": "2026-06-06T09:12:44.331Z"
}

A diagnostic record for a page that needs browser rendering looks like this (empty markdown, contentQuality: "empty", failureType: "js-required", and an actionable recommendation — this record is not charged):

{
  "recordType": "page",
  "schemaVersion": "1.1.0",
  "url": "https://app.pinnacletech.io/dashboard",
  "title": "Pinnacle",
  "description": null,
  "markdown": "",
  "wordCount": 0,
  "tokenEstimate": 0,
  "language": "en",
  "crawlDepth": 1,
  "extractionMethod": "body-fallback",
  "contentQuality": "empty",
  "jsRenderingRequired": true,
  "jsFramework": "Next.js",
  "botProtection": { "detected": false, "vendor": null },
  "failureType": "js-required",
  "scrapeError": "Page renders content client-side (Next.js) — static HTML yielded no content. Browser rendering required.",
  "recommendation": "This page renders its content with client-side JavaScript. Use a browser-based crawler to extract it.",
  "crawledAt": "2026-06-06T09:12:47.882Z"
}

Pages that never loaded (network error, repeated block, timeout) also land as records with failureType set to blocked / timeout / no-data and a matching recommendation, so no requested URL silently disappears. Filter WHERE failureType IS NOT NULL (or open the Failures & diagnostics dataset view) to see everything that needs attention.

Output fields

Field	Type	Description
`recordType`	string	`page` for a converted (or diagnostic) page, `error` for a run-level failure record. Stable enum for filtering
`schemaVersion`	string	Output schema version (semver). Branch on this if you pin output shape in a pipeline
`url`	string	Full URL of the converted page
`title`	string	Page title from OpenGraph `og:title`, then `<title>` tag, then first `<h1>`. Empty string if `includeMetadata` is false
`description`	string	Meta description from OpenGraph or `<meta name="description">`. On a diagnostic record, this carries the reason no content was extracted
`markdown`	string	Full page content converted to GitHub Flavored Markdown. The primary output field. Empty on diagnostic records
`wordCount`	integer	Word count of the Markdown text
`tokenEstimate`	integer	Rough GPT-style token count (Markdown length ÷ 4) for context-window budgeting and RAG chunk sizing
`language`	string or null	Language code from `<html lang>` attribute, lowercased and trimmed of region suffix (e.g., `"en-US"` becomes `"en"`). Null if not set
`crawlDepth`	integer	Number of link hops from the starting URL. 0 means the starting page itself
`extractionMethod`	string	How content was located: `semantic-main`, `body-fallback`, or `full-page`. Stable enum
`contentQuality`	string	Richness band: `rich` (300+ words), `moderate` (50-299), `thin`, or `empty`. Filter a RAG corpus on this. Stable enum
`jsRenderingRequired`	boolean	True when a JS-framework shell was detected and the static HTML yielded little content. The page likely needs browser rendering
`jsFramework`	string or null	Detected client-side framework (Next.js, Nuxt, React, Angular, Vue, Svelte) when JS rendering is required
`botProtection`	object	`{ detected: boolean, vendor: string \| null }`. Set when the response looks like an anti-bot challenge rather than content
`failureType`	string or null	Why a page produced no content: `no-data` / `blocked` / `timeout` / `js-required` / `parse-error`. Null on content-bearing pages. Filter `WHERE failureType IS NOT NULL` for pages that need attention
`scrapeError`	string or null	Human-readable reason a page produced no content. Null on success
`recommendation`	string or null	Actionable next step for a failed page (enable proxy, use a browser-based crawler, etc.). Null on success
`qualityScore`	integer	0-100 content quality (word volume + heading structure + extraction cleanliness + richness). Sort on this; filter on the `contentQuality` band
`extractionConfidence`	integer	0-100 confidence that extraction captured the right content, with `confidenceReason[]` tags. Distinct from `qualityScore` (which rates the content)
`htmlSize` / `contentSize` / `boilerplateReduction`	mixed	Raw HTML bytes, extracted Markdown bytes, and the % of boilerplate removed (the per-page embedding-volume saving)
`pageScore`	integer	The single headline score (0-100). Read this one number; the scores below are its drill-down components
`retrievalScore`	integer	Component. 0-100 embed-worthiness gate (quality + extraction confidence + structure + dedup). `retrievalScore >= 75` is a clean ingestion filter
`knowledgeDensity`	integer	Component. 0-100 information density (value per word). A short API reference can outscore a long marketing page
`completeness`	object	`{ hasOverview, hasExamples, hasCode, hasFaq, hasTroubleshooting }` — find docs lacking examples or troubleshooting
`intent`	string	Retrieval-routing intent: `learn` / `implement` / `troubleshoot` / `reference` / `other`. A query-router branches on this
`documentationVersion`	string or null	Version detected in the URL (`v2`, `2025`, `latest`, ...). Keeps multi-version docs from mixing in one corpus
`versionFamily`	string or null	Version-independent path so versions of the same doc group together. The SUMMARY resolves the latest version per family
`pageType`	string	`documentation` / `reference` / `blog` / `article` / `landing` / `other`. Filter a mixed corpus by type
`contentArchetype`	string	Finer intent: `how-to` / `tutorial` / `faq` / `reference` / `changelog` / `release-notes` / `explanation` / `blog-post` / `landing` / `other`. "Ingest docs but skip release notes"
`siteSection`	string	URL-path section: `api` / `reference` / `docs` / `guides` / `tutorials` / `blog` / `changelog` / `help` / `faq` / `other`
`faqCount` / `procedure` / `procedureSteps`	mixed	Knowledge-object signals: question-style heading count, whether the page is step-by-step, and how many steps
`breadcrumbs`	array	Semantic breadcrumb trail (labels) from breadcrumb nav / schema.org BreadcrumbList
`freshnessScore` / `stalenessDays`	integer or null	Freshness from `modifiedAt` (100 = recent, decays to 0 at ~2 years; days since modified). Null when no date metadata
`discoveredVia`	string	How the URL entered the crawl: `seed` / `sitemap` / `internal-link`. For coverage troubleshooting
`publishedAt` / `modifiedAt`	string or null	Publish / last-modified dates from page metadata. Null when the page declares none
`apiReference` / `endpoints`	mixed	`apiReference` flags API docs; `endpoints` lists detected `{ method, path }` pairs
`toc`	array	Heading hierarchy: `{ level, title }[]`, for navigation and semantic chunking
`codeBlocks`	array	Fenced code blocks: `{ language, lines }[]`
`containsCode`	boolean	True when the page has at least one code block
`chunks`	array	Heading-aware chunks for RAG: `{ chunkId, headingPath, content, tokens, quality }[]`. Each chunk carries a 0-100 `quality`. Empty unless `generateChunks` is on
`contentHash`	string	sha256 of the Markdown. Diff across runs to skip re-embedding unchanged pages
`structureHash`	string	sha256 of the heading skeleton. Distinguishes a restructure from a prose edit
`canonicalUrl`	string or null	`<link rel="canonical">` target, when present
`duplicateContent`	boolean	True when the canonical URL differs from the crawled URL (e.g. a `?utm=` variant). Dedupe your corpus on this
`changed` / `changeType` / `previousHash`	mixed	Watchlist mode only (null otherwise). `changeType` is `new` / `content` / `structure` / `unchanged`
`parentPage`	string or null	The page this URL was discovered from. Null for seed URLs
`internalLinks` / `externalLinks`	array	Same-domain / cross-domain links. Empty unless `extractLinks` is on
`assets`	array	Page images with alt text: `{ type, alt, url }[]`. Empty unless `extractAssets` is on
`crawledAt`	string	ISO 8601 timestamp of when the page was crawled and converted

A run-level SUMMARY record turns the run into a full Corpus Intelligence Report in the key-value store:

{
  corpusScore,                       // 0-100 overall corpus health
  corpusFingerprint,                 // hash of the whole corpus
  corpusDrift, corpusSimilarity,     // % change vs the prior watchlist run
  coverageConfidence,                // 0-100 "did we crawl the whole site?"
  retrievalAudit: { excellent, good, poor, topIssues: [ ... ] },
  pages, words, tokens,
  compression: { rawTokens, dedupedTokens, reduction },   // embedding-cost saving from dedup
  embeddingSavingsPct,
  crawlCoverage: { expectedFromSitemap, captured, coveragePct },
  coverage: { bySection, byArchetype },
  versions,                          // doc-version breakdown
  versionFamilies: { "<family>": { versions, latest, pages } },
  corpusHealth: { excellent, good, poor },
  retrievalReadyPages, pagesToSkip,
  duplicatePages, orphanPages, thinPages, stalePages, documentationDebtPages,
  averageQuality, languages,
  changedPages, skippedUnchanged,
  recommendations: [ ... ],          // plain-English findings
  pagesNeedingJsRendering, pagesBotBlocked, pagesFailed, chargedEvents, generatedAt
}

The recommendations[] array surfaces plain-English findings like "31 duplicate pages detected — dedupe on canonicalUrl", "14 orphan pages found (in the sitemap, no inbound links)", and "Captured 500 of 560 sitemap pages (89.3%) — raise maxPagesPerDomain for full coverage". When extractLinks is on, a NAVIGATION_GRAPH key holds { nodes: [{ url, inboundLinks, importanceScore, orphan }], edges: [{ source, target }], generatedAt } — a PageRank-lite importance ranking plus the link edges for graph-based retrieval.

How much does it cost?

This actor uses Apify's pay-per-event pricing: $0.02 per page converted to Markdown, plus a negligible $0.00005 per run to start. Always check the Store page for the latest rate.

You only pay for content you receive. Pages that are skipped, blocked, require JavaScript rendering, fail to load, or are unchanged in delta mode are not charged, they are reported as diagnostic records at no cost.

Two controls keep spend predictable:

maxPagesPerDomain caps how many pages each domain can convert, so a run can never exceed your page budget.
deltaOnly (with a watchlistName) charges only for pages that changed since the last run. On a stable site, a scheduled refresh costs a small fraction of a full crawl, which is the core cost advantage over crawlers that re-process everything every time.

Compared with building this yourself: manual copy-paste preparation of 100 pages for a RAG pipeline takes hours and produces inconsistent formatting; this actor returns clean, scored, embedding-ready Markdown in minutes.

Convert websites to Markdown using the API

Python

from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")

run = client.actor("ryanclinton/website-content-to-markdown").call(run_input={
    "urls": ["https://docs.pinnacletech.io"],
    "maxPagesPerDomain": 30,
    "maxCrawlDepth": 2,
    "includeMetadata": True,
    "onlyMainContent": True,
})

for page in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(f"{page['url']} — {page['wordCount']} words (~{int(page['wordCount'] * 1.3)} tokens)")
    # Save each page as a .md file for LangChain / LlamaIndex ingestion
    safe_name = page["url"].replace("https://", "").replace("/", "_")
    with open(f"{safe_name}.md", "w") as f:
        f.write(page["markdown"])

JavaScript

import { ApifyClient } from "apify-client";
import { writeFileSync } from "fs";

const client = new ApifyClient({ token: "YOUR_API_TOKEN" });

const run = await client.actor("ryanclinton/website-content-to-markdown").call({
    urls: ["https://docs.pinnacletech.io"],
    maxPagesPerDomain: 30,
    maxCrawlDepth: 2,
    includeMetadata: true,
    onlyMainContent: true,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
for (const page of items) {
    console.log(`${page.url} — ${page.wordCount} words`);
    // Feed into LangChain UnstructuredMarkdownLoader or a vector database
    const safeName = page.url.replace("https://", "").replace(/\//g, "_");
    writeFileSync(`${safeName}.md`, page.markdown);
}

cURL

# Start the actor run
curl -X POST "https://api.apify.com/v2/acts/ryanclinton~website-content-to-markdown/runs?token=YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://docs.pinnacletech.io"],
    "maxPagesPerDomain": 30,
    "maxCrawlDepth": 2,
    "includeMetadata": true,
    "onlyMainContent": true
  }'

# Fetch results (replace DATASET_ID from the run response above)
curl "https://api.apify.com/v2/datasets/DATASET_ID/items?token=YOUR_API_TOKEN&format=json"

How Website Content to Markdown works

Phase 1: URL discovery and sitemap parsing

When the actor starts, it normalizes each input URL (adding https:// for bare domains, validating format) and deduplicates by hostname. For each unique starting URL, it fetches /sitemap.xml using a 10-second timeout with an ApifyBot/1.0 User-Agent. The sitemap parser handles both standard sitemaps (extracting <loc> tags) and sitemap index files (fetching the first child sitemap). URLs matching binary file extensions — jpg, png, gif, pdf, zip, mp4, xml — are excluded. The combined list of starting URLs and sitemap-discovered URLs forms the initial request queue.

Phase 2: Breadth-first crawling

A CheerioCrawler runs with 10 concurrent workers at up to 120 requests/minute, with session pooling and persistent cookies for stable multi-page crawls. Three retries are attempted on failure. The handler skips responses without an html Content-Type to avoid processing XML sitemaps or JSON feeds that sneak through. Per-domain page counts and visited URL sets are tracked in a shared Map<string, DomainState> that enforces both the page cap and URL deduplication (trailing slash normalized). For each successfully processed page, the handler enqueues same-domain internal links from <a href> elements, filtering out fragments, external domains, and binary file extensions, up to maxCrawlDepth levels deep using BFS with __crawlDepth userData propagation.

Phase 3: Content extraction and Markdown conversion

The extraction pipeline runs in sequence for each page. First, extractContent() tries the 10 semantic selectors in order (<main>, <article>, [role="main"], #content, .content, .post-content, .entry-content, .article-body, .page-content, .main-content) and uses the first element with 200+ characters of inner HTML. A second Cheerio pass strips the 30+ non-content selectors from within the matched container. If no semantic container matches, the full <body> is used with the same stripping applied globally.

Next, htmlToMarkdown() passes the cleaned HTML to a pre-configured Turndown instance with ATX heading style, fenced code blocks (triple backtick), bullet markers, inline links, and preformattedCode: true to preserve code block whitespace. The turndown-plugin-gfm plugin adds table and strikethrough support. Two custom Turndown rules are applied: images without alt text or with data-URI sources are dropped entirely; anchor tags with empty text content are removed. The resulting Markdown is cleaned with cleanMarkdown() which collapses multiple blank lines, trims line-trailing whitespace, and strips whitespace-only lines.

Pages producing fewer than 50 characters of Markdown are checked for a JS-framework shell and anti-bot challenge markers. If either is detected, a diagnostic record is emitted (jsRenderingRequired / botProtection set, empty markdown, not charged) so the page is visible rather than silently lost; otherwise the page is skipped as genuinely empty. The final record includes the record type and schema version, URL, title (OpenGraph > <title> > <h1>), description (OpenGraph > meta description), Markdown, word count, token estimate, content-quality band, extraction method, JS-rendering and bot-protection signals, language, crawl depth, and ISO timestamp.

Tips for best results

Target section roots, not homepages. Pointing at https://docs.acmecorp.com rather than https://acmecorp.com ensures your page budget is spent on documentation rather than marketing pages. The crawler follows internal links from the starting point.
Use depth 0 when you have a URL list. If you already know which pages to convert, list them all in urls and set maxCrawlDepth: 0. This is faster and more predictable than relying on link discovery.
Estimate token budgets before embedding. Sum the wordCount values in your output and multiply by 1.3. A 100-page documentation site averaging 800 words per page produces roughly 104,000 tokens — helpful to know before choosing an embedding model.
Disable metadata for bulk training data. If you are building a fine-tuning dataset and only need raw Markdown text, set includeMetadata: false. It has negligible cost impact but keeps output records leaner.
Run on a schedule for living knowledge bases. Use Apify's scheduling to re-run this actor weekly against your source sites. Pair it with the Website Change Monitor to trigger re-conversion only when content actually changes.
For SPAs, use the Pro version. If a site loads content through JavaScript (React, Vue, Angular apps), this actor will return the skeleton HTML, not the rendered content. See the Limitations section.
Combine with Company Deep Research for enterprise content. Feed company website Markdown directly into the Company Deep Research actor for comprehensive intelligence reports that include the company's own published content.
Set proxy for rate-limited sites. Enable proxyConfiguration with Apify residential proxies if a site returns 429 or 403 errors during crawling. The session pool will rotate identities across requests.

Combine with other Apify actors

Actor	How to combine
AI Training Data Curator	Convert websites to Markdown, then pass to the curator for deduplication, quality filtering, and fine-tuning dataset formatting
Website Change Monitor	Detect when source pages change, then trigger this actor to re-convert only the updated pages for incremental knowledge base updates
Company Deep Research	Convert a company's public website to Markdown and feed the content into deep research workflows for comprehensive intelligence reports
Website Contact Scraper	Run both actors on the same domain: one extracts contacts, the other extracts page content for enriched company profiles
Website Tech Stack Detector	Identify a site's technology stack first, then convert its content to Markdown — useful for contextualizing technical documentation
Competitor Analysis Report	Convert competitor sites to Markdown, then run competitive analysis on the structured text using LLMs
B2B Lead Gen Suite	Enrich lead profiles with content extracted from their company websites converted to Markdown

Use in Dify

Drop this actor into Dify workflows via the Apify plugin's Run Actor node. Every page returns clean Markdown plus the classification enums your downstream node branches on. A generic HTML scraper pointed at the same site returns raw markup you then have to clean, classify, and triage by hand; this returns the Markdown already filtered and tagged for ingestion.

Actor ID: ryanclinton/website-content-to-markdown
Sample input (convert a documentation site for a RAG knowledge base):

{
  "urls": ["https://docs.pinnacletech.io"],
  "maxPagesPerDomain": 50,
  "maxCrawlDepth": 3,
  "onlyMainContent": true
}

Branching example — a Dify if/else node routes each record on stable enums, no prose parsing:

retrievalScore >= 75 → chunk + embed into the vector store; below that → skip (keeps low-value pages out of the corpus)
contentArchetype == "release-notes" or "changelog" → route to a separate index (or drop) instead of the main docs corpus
failureType == "js-required" → route the URL to a browser-based crawler, then re-ingest
failureType == "blocked" → route to a proxy-enabled re-run
failureType != null (any value) → error branch for review

Because failureType is null on every content-bearing page and set on every page that produced no content, one equality check (failureType IS NULL) cleanly separates the embed path from the triage path. retrievalScore is the single embed-gate; tokenEstimate feeds chunk-sizing on the embed branch; and with deltaOnly + a watchlist, the actor only emits changed pages so the whole Dify flow runs over the delta, not the full corpus.

Limitations

No JavaScript rendering — the actor uses CheerioCrawler, which parses the server-delivered HTML response. Single-page applications (React, Vue, Angular, Next.js with client-side rendering) that load content via JavaScript return an empty shell. Rather than dropping these pages silently, the actor flags them with jsRenderingRequired: true and the detected jsFramework, so you can route them to a browser-based actor. JS-detected pages are not charged.
No authenticated content — only publicly accessible pages are processed. Login walls, paywalls, and members-only content produce their gate page, not the protected content behind it.
Same-domain crawling only — the crawler never follows links to external domains. If a site's documentation is split across multiple subdomains (e.g., docs.acmecorp.com and api.acmecorp.com), list both as separate starting URLs.
100-page maximum per domain — set by the input schema's maximum constraint. For very large sites, run multiple targeted crawls against specific sections.
Discovery depends on links + sitemaps — the crawler checks five common sitemap locations (/sitemap.xml, /sitemap_index.xml, /sitemap-index.xml, /sitemaps.xml, /sitemap/sitemap.xml) and follows internal links, but a page that is neither linked from a crawled page nor listed in one of those sitemaps will not be discovered. Orphaned pages require explicit URL input.
No PDF or binary content — only HTML pages are converted. PDF documents, Word files, and embedded media are skipped.
English-biased class name selectors — the semantic content selectors use English CSS class names (.content, .post-content, .entry-content). Sites using non-English or unusual class naming conventions may need onlyMainContent: false to capture all content, at the cost of including some boilerplate.
No JavaScript execution in content — dynamically inserted content (lazy-loaded sections, infinite scroll, tab-hidden content) is not captured because it requires browser execution.

Integrations

LangChain / LlamaIndex — use the Apify integration to load Markdown output directly into your RAG pipeline as document chunks
Zapier — send converted Markdown pages to Notion, Google Docs, Confluence, or Slack on run completion
Make — chain conversion runs with Airtable, HubSpot, or Slack steps in automated content workflows
Google Sheets — export URL, title, word count, and language to a spreadsheet for content audits
Apify API — trigger runs programmatically from CI/CD pipelines and retrieve Markdown via REST for embedding workflows
Webhooks — receive a POST notification with the dataset URL when conversion finishes, enabling async pipeline triggers
Vector databases (Pinecone, Weaviate, Qdrant, Chroma) — pipe the markdown field directly into your embedding and upsert pipeline after chunking

Troubleshooting

Output Markdown is empty or very short — check the jsRenderingRequired and botProtection fields on the record. If jsRenderingRequired is true, the source site renders content client-side and needs a browser-based actor. If botProtection.detected is true, the site restricts automated access from shared IPs (enable residential proxies). These pages are reported but not charged.
Getting unexpected navigation or sidebar content in Markdown — some sites use non-standard markup without semantic HTML elements (<main>, <article>). The actor falls back to body-level stripping, which may miss some structural elements. Try disabling onlyMainContent and stripping the specific selectors yourself in post-processing, or provide a more specific section URL.
Run stopped before reaching the page limit — the actor logs a warning when the per-domain pageCount cap is reached. Increase maxPagesPerDomain (up to 100) or run multiple crawls targeting different sections of the site.
Some pages failing with 403 or 429 errors — the target site is blocking the crawler. Enable proxyConfiguration with "useApifyProxy": true and optionally set proxyUrls to residential proxies. The session pool will rotate IPs across requests.
Sitemap URLs not being picked up — some sites serve their sitemap at a non-standard path (e.g., /sitemap_index.xml or /sitemaps/pages.xml). The actor only checks /sitemap.xml. For sites with non-standard sitemap locations, add the specific page URLs manually to the urls input.

Responsible use

This actor only accesses publicly available web pages.
Respect robots.txt directives and website terms of service regarding automated access.
Do not use converted content in ways that violate the original site's copyright or content license.
Comply with applicable data protection laws (GDPR, CCPA) when storing or processing scraped content.
For guidance on web scraping legality, see Apify's guide.

FAQ

How do I convert a website to Markdown for a RAG pipeline? Enter the documentation site URL, set maxPagesPerDomain to the number of pages you want, and set onlyMainContent: true. Each output record contains a markdown field ready for chunking and embedding. The wordCount field helps you estimate token counts before sending to your embedding API.

What types of websites does this actor convert best? Text-heavy, server-rendered sites: documentation portals, developer guides, help centers, blogs, knowledge bases, and informational pages. Sites that rely on JavaScript to render their content (React SPAs, Angular apps) are not supported — use a headless browser approach for those.

Can I use the Markdown output directly with ChatGPT, Claude, or Gemini? Yes. The Markdown format is natively understood by all major LLMs. Feed the markdown field directly into prompts, or use the word count to gauge how many pages fit within a context window (rough estimate: 1 word ≈ 1.3 tokens).

How many pages can I convert in one run? Up to 100 pages per domain per run, across as many domains as you provide. For larger sites, run multiple targeted crawls against different sections and merge the datasets. There is no limit on the number of domains in a single run.

Does this actor follow links to other domains? No. The crawler only follows internal links within the same domain (and subdomain) as each starting URL. If you need content from multiple domains, add each as a separate entry in the urls input.

How is this different from manually copy-pasting website content? Manual copy-paste for 50 pages takes 2–4 hours and produces inconsistent formatting. This actor processes 50 pages in under 2 minutes, produces consistently formatted GitHub Flavored Markdown, strips all boilerplate automatically, and runs unattended on a schedule. The per-page word count and metadata fields are not available from manual copying.

How does the "main content only" mode work? The actor tries 10 semantic HTML selectors in priority order — <main>, <article>, [role="main"], and 7 common content class names. The first matching element with 200+ characters of inner HTML is used as the content container. Non-content elements (nav, footer, sidebar, ads, etc.) are then stripped from within that container. If no semantic container is found, the full <body> is used with the same stripping applied.

Is it legal to convert website content to Markdown? Accessing publicly available web pages is generally legal in most jurisdictions. However, you should review each target website's terms of service, respect robots.txt directives, and ensure your use of the converted content complies with copyright law. For commercial AI training use cases, some site terms explicitly restrict automated scraping. See Apify's guide on web scraping legality for a detailed overview.

Can I schedule this actor to run automatically? Yes. Apify's scheduling feature lets you set recurring runs on a cron schedule (daily, weekly, or custom). This is ideal for keeping documentation snapshots current or monitoring competitor content.

What happens to pages that fail to load? Failed requests are retried up to 3 times with exponential backoff. If still failing after retries, the page is logged as a warning and skipped. Skipped pages do not count toward the per-domain page limit, so your budget is not wasted on failures.

How is this different from Apify's Website Content Crawler? Both convert web pages to text, but this actor is a lightweight, cost-efficient solution for straightforward HTML sites. It uses CheerioCrawler (no browser, ~256 MB memory) and outputs structured JSON with per-page metadata. Apify's Website Content Crawler uses a full browser and supports JavaScript rendering but runs at higher cost. Choose this actor for static and server-rendered sites; choose a browser-based solution for SPAs.

Can I use this actor's output with LangChain or LlamaIndex? Yes. The markdown field integrates directly with LangChain's UnstructuredMarkdownLoader and LlamaIndex's SimpleDirectoryReader. Apify also provides a native LangChain integration that loads dataset items as LangChain documents without any custom code.

Help us improve

If you encounter issues, you can help us debug faster by enabling run sharing in your Apify account:

Go to Account Settings > Privacy
Enable Share runs with public Actor creators

This lets us see your run details when something goes wrong, so we can fix issues faster. Your data is only visible to the actor developer, not publicly.

Support

Found a bug or have a feature request? Open an issue in the Issues tab on this actor's page. For custom solutions or enterprise integrations, reach out through the Apify platform.

Compare this actor

Web Content Extraction Tools Compared

BlogJun 6, 2026

Documentation Debt: The Hidden Cause of Bad AI Answers

Bad RAG answers usually aren't the model's fault. They're documentation debt: duplicate, stale, orphan, thin, and version-conflicted pages poisoning the corpus.

Related actors

AI Cold Email Writer — $0.01/Email, Zero LLM Markup

Generates personalized cold emails from enriched lead data using your own OpenAI or Anthropic key. Subject line, body, CTA, and optional follow-up sequence — $0.01/email, zero LLM markup.

$0.05/event

AI Outreach Personalizer — Emails with Your LLM Key

Generate personalized cold emails using your own OpenAI or Anthropic API key. Subject lines, opening lines, full bodies — tailored to each lead's role, company, and signals. $0.01/lead compute + your LLM costs. Zero AI markup.

$0.01/event

Bulk Email Verifier — MX, SMTP & Disposable Detection at Scale

Verify email deliverability in bulk — MX records, SMTP mailbox checks, disposable detection (55K+ domains), role-based flagging, catch-all detection, domain health scoring (SPF/DKIM/DMARC), and confidence scores. $0.005/email, no subscription.

$0.005/event

CFPB Complaint Intelligence — Vendor Risk & Screening

Turn 5M+ CFPB consumer complaints into decisions: screen companies pass / review / fail, score complaint-handling risk, monitor what changed since last run, benchmark cohorts, and build audit-ready due-diligence packs. Filter by company, product, state, and date. No API key.

$0.002/event

Compare with similar actors

Web Content Extraction Tools Compared

Not sure which to pick?

Try the actor recommender

Last verified: March 26, 2026

Ready to try Knowledge Intelligence Engine — Website to Markdown for RAG?

Run it on your own Apify account. Apify offers a free tier with $5 of monthly credits.

Open on Apify Store

Knowledge Intelligence Engine — Website to Markdown for RAG

What to know

Maintenance Pulse

Cost Estimate

Pricing

Documentation

What makes this different?

Before and after

Re-embed only what changed (the expensive part of RAG)

What happens after the crawl?

What you don't have to build

Why RAG teams choose this

Corpus Intelligence Report

Who is this for?

When a crawler is enough vs when you need this

A real example

What data can you extract?

Why use Website Content to Markdown?

Features

The 3 fields most teams actually use

Use cases for converting websites to Markdown

RAG pipeline ingestion

LLM fine-tuning dataset preparation

AI chatbot and knowledge base construction

Competitive content analysis

Documentation archival and migration

Content monitoring and freshness tracking

How to convert a website to Markdown

Input parameters

Input examples

Input tips

Output example

Output fields

How much does it cost?

Convert websites to Markdown using the API

Python

JavaScript

cURL

How Website Content to Markdown works

Phase 1: URL discovery and sitemap parsing

Phase 2: Breadth-first crawling

Phase 3: Content extraction and Markdown conversion

Tips for best results

Combine with other Apify actors

Use in Dify

Limitations

Integrations

Troubleshooting

Responsible use

FAQ

Help us improve

Support

Compare this actor

Related articles

Documentation Debt: The Hidden Cause of Bad AI Answers

Related actors

AI Cold Email Writer — $0.01/Email, Zero LLM Markup

AI Outreach Personalizer — Emails with Your LLM Key

Bulk Email Verifier — MX, SMTP & Disposable Detection at Scale

CFPB Complaint Intelligence — Vendor Risk & Screening

Ready to try Knowledge Intelligence Engine — Website to Markdown for RAG?