Website Content Analyzer
Website Content Analyzer extracts personalization data from agency websites at scale — services offered, industries served, client names, company size, tech stack, and tone keywords — so your outreach lands with context instead of boilerplate. Paste in a list of URLs, run the actor, and get a structured dataset ready for cold email, CRM enrichment, or segmentation in minutes.
Maintenance Pulse
90/100Documentation
Website Content Analyzer extracts personalization data from agency websites at scale — services offered, industries served, client names, company size, tech stack, and tone keywords — so your outreach lands with context instead of boilerplate. Paste in a list of URLs, run the actor, and get a structured dataset ready for cold email, CRM enrichment, or segmentation in minutes.
The actor calls the Website Content to Markdown sub-actor to fetch up to 4 pages per site (homepage, /about, /services, /case-studies), concatenates the content into a single text blob, then runs pure in-process keyword analysis against hardcoded taxonomies of 80+ services, 50+ industry verticals, and 40+ tech stack tools. No LLM API calls, no external dependencies beyond Apify — every result is deterministic and cheap.
What data can you extract?
| Data Point | Source | Example |
|---|---|---|
| 📋 Services offered | Regex against 80-term service taxonomy | ["SEO", "PPC", "Content Marketing", "CRO", "Web Design"] |
| 🏭 Industries served | Regex against 50-term industry taxonomy | ["Healthcare", "SaaS", "E-commerce", "Finance"] |
| 👤 Client names | Three extraction patterns: phrase, heading, "trusted by" | ["Acorn Capital", "Pinnacle Health", "Delta Ventures"] |
| 📏 Size signal | Heuristic scoring against headcount and structure phrases | "small" — one of: solo / small / mid / large / unknown |
| 🛠 Tech stack signals | Regex against 40-tool taxonomy | ["HubSpot", "Klaviyo", "Google Analytics 4", "Shopify"] |
| 🎨 Tone keywords | Top-5 adjectives from homepage hero (first 2,000 chars) | ["data-driven", "boutique", "results-oriented"] |
| 📝 Summary snippet | Meta description or first paragraph (up to 200 chars) | "We help B2B SaaS companies drive predictable pipeline..." |
| 🌐 Domain | Normalized from input URL | "brightedgedigital.com" |
| 📊 Pages analyzed | Count of successfully fetched pages | 4 |
| ✅ Analysis status | complete, partial, or failed | "complete" |
| 🕐 Analyzed at | ISO 8601 timestamp | "2026-03-22T09:14:32.000Z" |
Why use Website Content Analyzer?
Researching a prospect agency manually means opening 4-6 tabs, reading through service pages, scrolling for client logos, and guessing their positioning — then copying notes into a spreadsheet row by row. At 20 minutes per site, a list of 100 agencies takes 33 hours before you write a single email.
This actor automates the entire research step. Drop in 100 URLs and get a structured dataset in under 30 minutes. Every record has the exact fields you need to write a personalized opening line: what they do, who they serve, which tools they use, and how they describe themselves.
Beyond the automation, you get the full Apify platform around it:
- Scheduling — run weekly to keep your prospect database fresh as agencies update their sites
- API access — trigger runs from your CRM, Python pipeline, or any HTTP client
- Proxy rotation — Apify's automatic proxy infrastructure handles sites that rate-limit scrapers
- Monitoring — get Slack or email alerts when runs fail or return unexpected results
- Integrations — push results directly to HubSpot, Google Sheets, Zapier, or Make
Features
- 80-term service taxonomy — covers every major marketing and agency service: SEO, PPC, Google Ads, Meta Ads, LinkedIn Ads, CRO, Content Marketing, Email Marketing, Branding, PR, ABM, Video Marketing, Programmatic Advertising, and 65+ more
- 50-term industry taxonomy — detects verticals from Healthcare, Finance, SaaS, and E-commerce through Dental, Cannabis, Crypto, Senior Living, and Aerospace
- 40-tool tech stack taxonomy — identifies HubSpot, Salesforce, Klaviyo, Semrush, Ahrefs, Shopify, Webflow, Marketo, Google Analytics 4, The Trade Desk, and 30+ more
- Three-pattern client name extraction — finds brand names via phrase matching ("clients include..."), markdown heading parsing, and "trusted by" list scanning; caps at 10 names to eliminate noise
- Heuristic size scoring — combines 8 signal types (team-of-N phrases, explicit headcounts, "boutique", "global offices", "Fortune 500" references) into a scored classification: solo, small, mid, large, or unknown
- Tone keyword extraction — matches 55 positioning adjectives against the first 2,000 characters of site content (the homepage hero), returns up to 5 that appeared
- Summary snippet — pulls the meta description first; falls back to the first substantial paragraph; caps at 200 characters for direct use in email openers
- Sub-actor architecture — delegates fetching to the Website Content to Markdown actor, inheriting its page rendering, content extraction, and proxy handling
- Configurable depth —
maxPagesPerSitefrom 1 to 10; default is 4 pages (homepage + /about + /services + /case-studies) - Sequential processing with PPE charging — processes one website at a time, charges only on success, and stops cleanly when your spending limit is reached
- Run-level summary record — final dataset item includes total counts (successful / partial / failed), pages analyzed, and elapsed seconds
- Zero LLM costs — all analysis is pure regex and pattern matching; no OpenAI, Anthropic, or Gemini API calls
Use cases for website content analysis
Sales prospecting and cold email personalization
SDRs and BDRs targeting marketing agencies can pull a list of 200 agency websites from a directory, run the actor overnight, and wake up to a spreadsheet where every row has services, industries, clients, and a tone snippet. That data feeds directly into a personalized opening line: "Noticed you work with healthcare brands using HubSpot — here is something that might help your demand gen team." The alternative is spending two hours on manual research for 10 prospects.
Marketing agency lead generation
Agencies building prospect lists for their own business development can profile hundreds of potential clients in a single run. Filter the output by industriesServed to find prospects in your niche, by sizeSignal to target companies in your sweet spot, and by techStackSignals to focus on organizations using tools you integrate with. This replaces manual browsing through agency directories and LinkedIn.
CRM enrichment and account profiling
RevOps and data teams can enrich existing CRM records by running the actor against company websites already in HubSpot or Salesforce. Services offered, industries served, and tech stack become new contact properties for segmentation, lead scoring, and routing to the right account executive.
Competitive intelligence and market mapping
Strategy teams can map the service landscape of an industry by analyzing every agency ranking for a target keyword. The structured output makes it easy to pivot on servicesOffered and see which services are commoditized versus rare in a given market.
Partner and reseller prospecting
SaaS companies building partner programs can identify agencies that already mention their platform in their tech stack. A run against 500 agency sites will surface exactly which ones name-drop HubSpot, Klaviyo, or Shopify — making them natural referral or co-sell candidates.
Recruitment and talent market research
Talent acquisition teams researching which agencies specialize in specific disciplines (performance marketing, B2B ABM, CRO) can use the servicesOffered output to qualify agencies before reaching out to hire from them. The sizeSignal field helps prioritize outreach toward boutique shops where individual contributors are more reachable.
How to analyze agency website content
- Enter your website URLs — Paste one or more agency homepage URLs into the
websitesfield. For example:https://brightedgedigital.com,https://pivotalmarketing.io. Subpages are fetched automatically — you only need the homepage. - Set your page depth — The default of 4 pages (homepage, /about, /services, /case-studies) covers most use cases. Increase to 6-10 for agencies with many subpages, or reduce to 1-2 for a fast first-pass scan.
- Click Start and wait — The actor processes sites sequentially. Expect roughly 30-60 seconds per website depending on site speed and your selected page depth.
- Download your results — Go to the Dataset tab when the run completes. Export as JSON, CSV, or Excel. Each row is one website with all extracted fields ready to use.
Input parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
websites | array of strings | Yes | — | Homepage URLs to analyze. One result record per URL. |
maxPagesPerSite | integer (1–10) | No | 4 | Pages fetched per site in priority order: homepage, /about, /services, /case-studies. Higher values increase accuracy and cost. |
proxyConfiguration | object | No | Apify Automatic Proxy enabled | Proxy settings forwarded to the website-content-to-markdown sub-actor. |
Input examples
Typical agency prospecting batch:
{
"websites": [
"https://brightedgedigital.com",
"https://pivotalmarketing.io",
"https://growthlabagency.com",
"https://vertexperformance.co",
"https://apexcreative.agency"
],
"maxPagesPerSite": 4,
"proxyConfiguration": {
"useApifyProxy": true
}
}
Fast first-pass scan — homepage only, lower cost:
{
"websites": [
"https://brightedgedigital.com",
"https://pivotalmarketing.io",
"https://growthlabagency.com"
],
"maxPagesPerSite": 1
}
Deep analysis for high-priority accounts:
{
"websites": [
"https://brightedgedigital.com"
],
"maxPagesPerSite": 10,
"proxyConfiguration": {
"useApifyProxy": true,
"apifyProxyGroups": ["RESIDENTIAL"]
}
}
Input tips
- Start with 4 pages — the default covers homepage, /about, /services, and /case-studies. This is where agencies concentrate their positioning language and client references.
- Use
maxPagesPerSite: 1for large batches — if you are running 200+ sites and want a rough first-pass, homepage-only mode halves both cost and run time. - Batch in one run — processing 100 sites in one run is faster than 100 separate single-site runs because sub-actor overhead is shared.
- Enable Apify Automatic Proxy — enabled by default. Some agency sites use Cloudflare or basic bot detection; leaving the proxy on avoids empty results on those sites.
- Provide homepage URLs only — the actor appends
/about,/services, and/case-studiesautomatically. Do not append paths manually.
Output example
{
"website": "https://brightedgedigital.com",
"domain": "brightedgedigital.com",
"servicesOffered": [
"Analytics",
"B2B Marketing",
"Content Marketing",
"CRO",
"Email Marketing",
"Google Ads",
"HubSpot",
"Lead Generation",
"Marketing Automation",
"PPC",
"SEO",
"Social Media Marketing"
],
"industriesServed": [
"Finance",
"Healthcare",
"Professional Services",
"SaaS",
"Technology"
],
"clientNames": [
"Acorn Capital",
"Delta Health Systems",
"Meridian Software",
"Pinnacle Advisors",
"Vertex Analytics"
],
"sizeSignal": "small",
"techStackSignals": [
"Google Analytics 4",
"HubSpot",
"Looker Studio",
"Semrush"
],
"toneKeywords": [
"data-driven",
"results-oriented",
"transparent",
"boutique",
"certified"
],
"summarySnippet": "Brightedge Digital is a boutique B2B marketing agency helping SaaS and financial services firms drive predictable pipeline with data-driven demand generation.",
"pagesAnalyzed": 4,
"analysisStatus": "complete",
"analyzedAt": "2026-03-22T09:14:32.000Z"
}
Output fields
| Field | Type | Description |
|---|---|---|
website | string | The input URL that was analyzed |
domain | string | Normalized domain (www. stripped) |
servicesOffered | string[] | Services matched from the 80-term taxonomy, alphabetically sorted |
industriesServed | string[] | Industries matched from the 50-term taxonomy, alphabetically sorted |
clientNames | string[] | Proper-noun brand names found near "clients include", "trusted by", or case study headings; up to 10 |
sizeSignal | string | Estimated company size: solo, small, mid, large, or unknown |
techStackSignals | string[] | Tools and platforms mentioned in site content, matched from the 40-tool taxonomy |
toneKeywords | string[] | Up to 5 positioning adjectives from the homepage hero section (first 2,000 characters) |
summarySnippet | string | Meta description or first paragraph, truncated to 200 characters |
pagesAnalyzed | integer | Number of pages successfully fetched |
analysisStatus | string | complete — all requested pages retrieved; partial — some pages failed; failed — no content retrieved |
analyzedAt | string | ISO 8601 timestamp |
The final record in every dataset is a summary item (identified by "type": "summary") with aggregate counts: totalWebsites, successful, partial, failed, totalPagesAnalyzed, and durationSeconds.
How much does it cost to analyze agency websites?
Website Content Analyzer uses pay-per-event pricing — you pay $0.10 per website successfully analyzed. Failed websites (where no content was retrieved) are not charged. Platform compute costs are included.
| Scenario | Websites | Cost per site | Total cost |
|---|---|---|---|
| Quick test | 1 | $0.10 | $0.10 |
| Small batch | 10 | $0.10 | $1.00 |
| Medium batch | 50 | $0.10 | $5.00 |
| Large batch | 200 | $0.10 | $20.00 |
| Enterprise batch | 1,000 | $0.10 | $100.00 |
You can set a maximum spending limit per run to control costs. The actor stops when your budget is reached — no overage surprises.
Compare this to manual research at 20 minutes per site: a 50-site batch costs $5.00 and runs in under 30 minutes versus 17 hours of analyst time. Tools like Apollo.io or Clay charge $49–149/month for contact enrichment that still requires manual service profiling. With this actor, you get richer agency-specific context at a fraction of the cost with no subscription commitment.
Analyze agency websites using the API
Python
from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("ryanclinton/website-content-analyzer").call(run_input={
"websites": [
"https://brightedgedigital.com",
"https://pivotalmarketing.io",
"https://growthlabagency.com"
],
"maxPagesPerSite": 4,
"proxyConfiguration": {"useApifyProxy": True}
})
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
if item.get("type") == "summary":
continue
domain = item.get("domain", "")
services = ", ".join(item.get("servicesOffered", []))
industries = ", ".join(item.get("industriesServed", []))
size = item.get("sizeSignal", "unknown")
snippet = item.get("summarySnippet", "")
print(f"{domain} [{size}] — {services}")
print(f" Industries: {industries}")
print(f" Snippet: {snippet}")
JavaScript
import { ApifyClient } from "apify-client";
const client = new ApifyClient({ token: "YOUR_API_TOKEN" });
const run = await client.actor("ryanclinton/website-content-analyzer").call({
websites: [
"https://brightedgedigital.com",
"https://pivotalmarketing.io",
"https://growthlabagency.com"
],
maxPagesPerSite: 4,
proxyConfiguration: { useApifyProxy: true }
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
for (const item of items) {
if (item.type === "summary") continue;
const { domain, sizeSignal, servicesOffered, industriesServed, summarySnippet } = item;
console.log(`${domain} [${sizeSignal}]`);
console.log(` Services: ${servicesOffered.join(", ")}`);
console.log(` Industries: ${industriesServed.join(", ")}`);
console.log(` Snippet: ${summarySnippet}`);
}
cURL
# Start the actor run
curl -X POST "https://api.apify.com/v2/acts/ryanclinton~website-content-analyzer/runs?token=YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"websites": [
"https://brightedgedigital.com",
"https://pivotalmarketing.io"
],
"maxPagesPerSite": 4,
"proxyConfiguration": { "useApifyProxy": true }
}'
# Fetch results after the run completes (replace DATASET_ID from the run response)
curl "https://api.apify.com/v2/datasets/DATASET_ID/items?token=YOUR_API_TOKEN&format=json"
How Website Content Analyzer works
Phase 1 — URL construction and page fetching
For each input URL, the actor constructs up to maxPagesPerSite page URLs in a fixed priority order: the homepage first, then /about, /services, and /case-studies. If maxPagesPerSite is 1, only the homepage is fetched. The constructed URL list is passed to the Website Content to Markdown sub-actor (actor ID p5CEBX8eQUGKlMEmW) with onlyMainContent: true to strip navigation, footers, and boilerplate. The sub-actor handles HTTP fetching, browser rendering where needed, and content normalization. Its dataset items are retrieved via the Apify client with a limit of 1,000 items.
Phase 2 — Text assembly and status classification
The sub-actor's dataset items are filtered for errors. Non-error items have their markdown or text fields concatenated with double-newline separators into a single text blob. If the number of successful pages equals the number of pages requested, the record is marked complete; fewer successes yields partial; zero content yields failed. Only complete and partial records trigger a PPE charge event (website-analyzed).
Phase 3 — In-process taxonomy matching
The full text blob is passed through three matchTaxonomy calls against SERVICE_TAXONOMY (80 entries), INDUSTRY_TAXONOMY (50 entries), and TECH_STACK_TAXONOMY (40 entries). Each taxonomy entry holds one label and an array of compiled RegExp objects. The text is tested against each pattern in sequence; the first match for any pattern in an entry adds that entry's label to a Set. Labels are returned as a sorted array. All matching runs entirely in Node.js memory with no external I/O.
Phase 4 — Heuristic signals and snippet extraction
Size scoring runs 8 independent checks on the full text: phrases matching /\bi am\b/ or /freelanc/ add to solo; "boutique agency" adds to small; explicit "team of N" phrases parse N and score the appropriate bucket; headcount patterns (\d{2,3}+\s*experts) do the same; "global offices in N countries", "500+ employees", and "Fortune 500" references add to large. The highest-scoring key in the map wins; ties and zero-sum cases return unknown.
Client name extraction applies three sequential regex passes: a phrase-context pattern matching proper nouns after "clients include", "worked with", "partners with", and similar triggers; a markdown heading parser scanning H1-H3 headings on case study pages; and a "trusted by" list parser splitting on commas and "and". Each candidate passes a isValidClientName filter that rejects strings under 2 or over 80 characters, strings not starting with a capital letter, and a hardcoded junk set of navigation words. Results are capped at 10.
Tone extraction tests 55 positioning adjectives against only the first 2,000 characters of the assembled text (the homepage hero). Each adjective is escaped to handle hyphens and matched with a word-boundary regex. The first 5 that match are returned.
Summary snippet prefers the metaDescription field from the sub-actor response if it is at least 20 characters. It falls back to the first non-heading paragraph of at least 20 characters in the full text, found via /(?:^|\n\n)([^#\n][^\n]{20,})/m. Both are truncated to 200 characters.
Tips for best results
- Provide homepage URLs only. The actor appends
/about,/services, and/case-studiesautomatically. If you pass a subpage URL, the constructed additional pages will be based on that path and will likely retrieve less useful content. - Use
maxPagesPerSite: 4as your default. The four targeted pages are where agencies concentrate the language the taxonomies match against. Additional pages tend to add noise without proportional accuracy gains. - Run homepage-only scans first on large lists. For a batch of 500+ sites, set
maxPagesPerSite: 1to classify and filter quickly, then re-run top candidates at depth 4 for full enrichment. - Combine with Email Pattern Finder. After collecting
domainfrom this actor, pass it to Email Pattern Finder to discover the email naming convention at each agency before writing outreach. - Feed
summarySnippetdirectly into email copy. The snippet is sized and structured for use as context in a personalized opening line. It requires no editing for many outreach templates. - Filter on
analysisStatusbefore importing to your CRM. Records with"failed"have no usable data. Records with"partial"are usable but may be missing service or client data from pages that did not load. - Re-run failed sites with a residential proxy. If a batch shows high
failedcounts, enableapifyProxyGroups: ["RESIDENTIAL"]and re-run only the failed domains. Most failures on agency sites are IP-level blocks that residential proxies bypass.
Combine with other Apify actors
| Actor | How to combine |
|---|---|
| Website Contact Scraper | Run after this actor to pull emails and phone numbers from the same agency sites — pipe domain directly as input. |
| Email Pattern Finder | Pass each domain from output to discover the email naming convention before outreach. |
| Waterfall Contact Enrichment | Feed enriched agency profiles into a 10-step contact waterfall to find specific decision-maker emails. |
| B2B Lead Qualifier | Score analyzed agencies 0–100 using 30+ signals — servicesOffered, sizeSignal, and techStackSignals map directly to qualifier inputs. |
| HubSpot Lead Pusher | Push structured analysis records into HubSpot as company properties immediately after a run completes. |
| Website Content to Markdown | The sub-actor that powers page fetching inside this actor. Use it directly for raw markdown extraction of any website. |
| Google Maps Email Extractor | Discover local agencies via a Google Maps search, then feed the resulting website URLs into this actor for content analysis. |
Limitations
- No JavaScript-rendered content fallback at the top level. The Website Content to Markdown sub-actor handles rendering, but heavily client-side-rendered SPAs with no server-side rendering may return sparse content. Check
pagesAnalyzedandanalysisStatusin the output to identify these cases. - Client name extraction is heuristic, not guaranteed. The patterns work well on agencies with standard page structures (case study pages, "trusted by" sections). Sites with custom layouts or image-only client logo grids will return fewer or zero client names.
- Size signal is an estimate. The heuristic scoring relies on explicit text signals. Agencies that do not mention team size anywhere on their public site will return
"unknown". The signal does not account for holding companies or multi-brand structures. - Taxonomy matching is English-only. The service, industry, and tech stack patterns are English-language. Non-English agency sites will return low or empty taxonomy results.
- Fixed subpage priority. The actor always attempts
/about,/services, and/case-studiesin that order. Agencies that use different URL structures (e.g.,/our-work,/what-we-do) will have those pages missed regardless ofmaxPagesPerSite. - No historical comparison. Each run analyzes current live content only. Use Website Change Monitor if you need to track content changes over time.
- Sequential processing. Websites are analyzed one at a time. A batch of 100 sites with
maxPagesPerSite: 4typically takes 60–90 minutes. There is no parallel mode. - No image or PDF content. Taxonomy matching runs against extracted text only. Service descriptions embedded in infographics, PDFs, or image-only content will not be detected.
Integrations
- Zapier — trigger a content analysis run when a new lead is added to a spreadsheet, then push the enriched record into your CRM automatically
- Make — build multi-step workflows that run the actor, filter on
servicesOffered, and route records to different sales sequences based on agency type - Google Sheets — export results directly to a Google Sheet for collaborative prospect review and prioritization
- Apify API — trigger runs programmatically from your Python or JavaScript pipeline with full control over input and output retrieval
- Webhooks — receive a POST notification when a run finishes so your downstream process can immediately consume the new dataset
- LangChain / LlamaIndex — use
summarySnippetand service arrays as grounding context for LLM-generated personalized email drafts in an AI outreach pipeline
Troubleshooting
-
High
failedcount in the summary record. Most failures are caused by sites that block scrapers at the IP level. ConfirmproxyConfigurationhasuseApifyProxy: true(enabled by default) and re-run failed domains. For persistent failures, switch toapifyProxyGroups: ["RESIDENTIAL"]for that subset. -
servicesOfferedorindustriesServedare empty for a site you know has services listed. CheckpagesAnalyzedfirst. If it is 0 oranalysisStatusis"failed", no text was retrieved — a proxy or access issue. If pages were retrieved but results are still empty, the site likely uses service language outside the current taxonomy patterns, or the relevant content is embedded in images rather than text. -
Run is taking longer than expected. Each website involves one sub-actor call plus page fetching. A 100-site batch with
maxPagesPerSite: 4makes up to 400 HTTP requests and typically takes 60–90 minutes. ReducemaxPagesPerSiteto 1–2 for faster throughput at the cost of analysis depth. -
clientNamesis empty for an agency that lists clients prominently. Client name extraction works on text-based lists, testimonials, and case study headings. If clients are shown only as image logos, no names will be extracted. There is no workaround for image-only client grids in the current version. -
Spending limit reached before all sites were processed. The actor stops cleanly at the spending cap. The summary record shows how many sites were processed. Re-run with the remaining URLs or increase your spending limit before starting a large batch.
Responsible use
- This actor only accesses publicly available website content.
- Respect website terms of service and
robots.txtdirectives. Do not use this actor against sites that prohibit automated access. - Comply with GDPR, CAN-SPAM, and other applicable data protection laws when using extracted data for outreach.
- Do not use extracted data for spam, harassment, or unauthorized commercial purposes.
- For guidance on web scraping legality, see Apify's guide.
FAQ
How many agency websites can Website Content Analyzer process in one run? There is no hard limit on the number of sites per run. The actor processes websites sequentially and stops only if your spending limit is reached or the 60-minute timeout expires. In practice, a single run handles 50–200 sites comfortably depending on page depth. For batches over 500 sites, consider splitting into multiple runs of 200 each to stay within the default 1-hour timeout.
Does website content analysis work on any type of website, or only marketing agencies?
The service and industry taxonomies are built for marketing agencies and professional service firms. The actor will still extract tech stack signals, client names, size signals, and summary snippets from any business website — but servicesOffered output is most accurate for agencies, consultancies, and digital service companies. It is less relevant for e-commerce product sites or SaaS platforms.
What is the difference between partial and complete analysis status?
complete means all the pages you requested (up to maxPagesPerSite) were successfully fetched. partial means at least one page was fetched but fewer than requested — typically because /case-studies or /services returned a 404 or was blocked. partial results are still useful; they just have less source text, which may reduce matched service and client counts.
Does Website Content Analyzer use any LLM or AI APIs internally? No. All analysis is pure regex and keyword matching against hardcoded taxonomies compiled into the actor. There are no calls to OpenAI, Anthropic, Gemini, or any other AI API. This keeps costs predictable and results deterministic — the same URL run twice will always return the same output (assuming the site content has not changed).
How accurate is the size signal?
The size signal is a best-effort heuristic based on explicit text signals in site content. It is accurate when an agency states team size directly ("a team of 12", "85+ specialists"). It defaults to "unknown" when no size signals are present, which is common for agencies that do not publish headcount. Do not use this field as a hard filter for enterprise sales targeting without manual validation on important accounts.
Can I schedule this actor to run weekly and keep my prospect database fresh? Yes. Use Apify's built-in scheduler to trigger a run on any interval. Combine with Website Change Monitor to detect which sites have updated content since the last run, then re-analyze only those sites to minimize costs.
How is Website Content Analyzer different from using Clay or Apollo for agency research? Clay and Apollo provide contact-level data (emails, LinkedIn profiles, job titles) from their own databases. This actor extracts the content of the agency's own website — what services they actually claim to offer, which industries they explicitly serve, which tools they name-drop, and how they describe their positioning. The two approaches are complementary: use this actor to profile the agency, then use Waterfall Contact Enrichment to find the specific contact email for your outreach.
Is it legal to scrape agency websites? Scraping publicly available website content is generally legal in most jurisdictions, as affirmed by multiple court rulings including hiQ Labs v. LinkedIn. The key requirements are that the content is publicly accessible (no login required), you are not circumventing technical access controls, and you comply with applicable data protection laws when using the data for outreach. See Apify's guide on web scraping legality for a detailed breakdown.
Can I run Website Content Analyzer via the API from my Python pipeline?
Yes. The actor is fully API-accessible. See the Python code example in this README — it requires only the apify-client package and your API token. Results are available as a dataset that you can iterate, filter, and pipe into any downstream system.
What happens if a website is offline or returns a 500 error during the run?
The sub-actor marks pages returning errors as failed. If all pages fail for a given site, analysisStatus is set to "failed" and the record is pushed with empty arrays for all extracted fields. Failed sites are not charged. The run continues to the next site in the list rather than stopping.
How many pages does the actor actually fetch per site?
By default, 4 pages: homepage, /about, /services, and /case-studies. If any of those paths returns a 404 or error, it is counted as failed for that site. Set maxPagesPerSite: 1 to fetch only the homepage, or up to 10 for additional depth (the priority queue stays the same: homepage first, then the three named subpages, then no additional paths beyond that in the current version).
Does Website Content Analyzer support non-English websites?
The taxonomy patterns are English-language only. Sites that describe their services in French, German, Spanish, or other languages will not match the service, industry, or tech stack taxonomies. The summarySnippet and clientNames extraction will still function for non-English content since they use structural patterns rather than vocabulary matching, but the taxonomy arrays will return empty.
Help us improve
If you encounter issues, you can help us debug faster by enabling run sharing in your Apify account:
- Go to Account Settings > Privacy
- Enable Share runs with public Actor creators
This lets us see your run details when something goes wrong, so we can fix issues faster. Your data is only visible to the actor developer, not publicly.
Support
Found a bug or have a feature request? Open an issue in the Issues tab on this actor's page. For custom solutions or enterprise integrations, reach out through the Apify platform.
How it works
Configure
Set your parameters in the Apify Console or pass them via API.
Run
Click Start, trigger via API, webhook, or set up a schedule.
Get results
Download as JSON, CSV, or Excel. Integrate with 1,000+ apps.
Use cases
Sales Teams
Build targeted lead lists with verified contact data.
Marketing
Research competitors and identify outreach opportunities.
Data Teams
Automate data collection pipelines with scheduled runs.
Developers
Integrate via REST API or use as an MCP tool in AI workflows.
Related actors
GitHub Repository Search
Search GitHub repositories by keyword, language, topic, stars, forks. Sort by stars, forks, or recently updated. Returns metadata, topics, license, owner info, URLs. Free API, optional token for higher limits.
Weather Forecast Search
Get weather forecasts for any location worldwide using the free Open-Meteo API. Returns current conditions, daily and hourly forecasts with temperature, precipitation, wind, UV index, and more. No API key needed.
EUIPO EU Trademark Search
Search EU trademarks via official EUIPO database. Find registered and pending trademarks by name, Nice class, applicant, or status. Returns full trademark details and filing history.
Nominatim Address Geocoder
Geocode addresses to GPS coordinates and reverse geocode coordinates to addresses using OpenStreetMap Nominatim. Batch geocoding with rate limiting. Free, no API key needed.
Ready to try Website Content Analyzer?
Start for free on Apify. No credit card required.
Open on Apify Store