AIDEVELOPER TOOLS

Researcher Integrity Check

Researcher integrity screening across 7 academic databases in a single run — for universities, journals, grant committees, and pharmaceutical companies that need to verify a researcher's publication record before making high-stakes decisions. The actor queries OpenAlex, ORCID, PubMed, Semantic Scholar, Crossref, CORE, and NIH grants in parallel, then applies four scoring models to produce a composite integrity verdict with specific, actionable signals.

Try on Apify Store
$0.30per event
0
Users (30d)
0
Runs (30d)
90
Actively maintained
Maintenance Pulse
$0.30
Per event

Maintenance Pulse

90/100
Last Build
Today
Last Version
1d ago
Builds (30d)
8
Issue Response
N/A

Cost Estimate

How many results do you need?

analysis-runs
Estimated cost:$30.00

Pricing

Pay Per Event model. You only pay for what you use.

EventDescriptionPrice
analysis-runFull intelligence analysis run$0.30

Example: 100 events = $30.00 · 1,000 events = $300.00

Documentation

Researcher integrity screening across 7 academic databases in a single run — for universities, journals, grant committees, and pharmaceutical companies that need to verify a researcher's publication record before making high-stakes decisions. The actor queries OpenAlex, ORCID, PubMed, Semantic Scholar, Crossref, CORE, and NIH grants in parallel, then applies four scoring models to produce a composite integrity verdict with specific, actionable signals.

Screening a researcher manually — cross-referencing Retraction Watch, reviewing ORCID profiles, scanning publication velocity across databases, and checking NIH grant records — takes a skilled research librarian four to eight hours per subject. This actor completes the same screening in under 90 seconds using Benford's Law citation analysis, paper mill pattern recognition, HHI funding concentration scoring, and publication velocity spike detection. No code required.

What data can you extract?

Data PointSourceExample
🔍 Composite integrity scoreAll 7 sources42 (0-100, higher = more concern)
⚖️ VerdictScoring engineINVESTIGATION_NEEDED
📄 Retraction/correction flagsOpenAlex, PubMed, Semantic Scholar12 points detected
📊 Citation anomaly scoreBenford's Law analysis5 anomalies flagged
🚩 Publication velocity flagsYear-over-year spike detection42 papers in 2020 — suspicious
🏭 Paper mill levelTemplate + journal + author analysisPOSSIBLE
📰 Journal quality levelCitation impact + open access ratioHIGH
💰 Funding risk levelNIH grants, HHI concentrationLOW
🔬 Publication countOpenAlex + PubMed + Semantic Scholar85 papers across databases
✅ ORCID profile statusORCIDVerified — 3 affiliations
⚠️ Required actionsScoring engineReview retracted publications
📅 Analysis timestampRun metadata2026-03-20T09:14:22.000Z

Why use Researcher Integrity Check?

Research integrity failures are costly. A university that hires a faculty member with undisclosed retractions faces reputational damage, grant clawback, and potential liability. A journal that publishes a paper from a known paper mill co-author faces post-publication correction. A pharma company that relies on manipulated clinical trial data from a key opinion leader faces regulatory consequences.

Manual screening is neither fast nor systematic. Cross-referencing Retraction Watch gives you retractions but misses citation rings. Reviewing ORCID profiles reveals affiliations but not anomalous velocity. Searching PubMed separately from Semantic Scholar means missed papers. This actor automates the entire multi-source process, applies statistical models that humans cannot replicate at scale, and returns a structured, decision-ready report.

  • Scheduling — run integrity checks on a cohort of grant applicants weekly, or monitor a faculty member's publication rate monthly on a schedule
  • API access — trigger screening from Python, JavaScript, or any HTTP client as part of an automated hiring or grant review workflow
  • Proxy rotation — the actor uses Apify's built-in infrastructure to query academic APIs reliably at scale without rate-limiting failures
  • Monitoring — get Slack or email alerts when a run produces a HIGH_RISK verdict, so reviewers are notified immediately
  • Integrations — push verdicts to Google Sheets, Zapier, or your HRIS/grant management system via webhooks

Features

  • 7 parallel academic database queries — OpenAlex, ORCID, PubMed, Semantic Scholar, Crossref, CORE, and NIH grants are all queried simultaneously, not sequentially, so a full screen takes 60-90 seconds regardless of how many sources return data
  • Retraction and correction detection — scans all retrieved papers for titles and publication types containing "retracted", "correction", "erratum", and "expression of concern", assigning weighted penalty scores (5 points per retraction, 2 per correction, 3 per expression of concern), capped at 35 points
  • Benford's Law citation analysis — applies the leading-digit frequency test to the researcher's citation distribution; if the digit 1 appears as a leading digit below 15% or above 50% of the time, the model flags citation anomalies consistent with manipulation or self-citation rings
  • Coefficient of variation (CV) uniformity check — detects suspiciously uniform citation counts (CV < 0.3 across 10+ papers), a pattern associated with coordinated citation swaps rather than organic academic impact
  • Publication velocity spike detection — flags any year where output exceeds 30 papers, and any year-over-year spike where output triples while reaching at least 10 papers — thresholds derived from documented paper mill cases
  • Paper mill title template matching — extracts the first 5 words of every paper title and identifies repeated patterns (3+ occurrences), a signature of factory-produced manuscripts
  • Journal over-concentration detection — flags when 50%+ of a researcher's papers across databases appear in a single journal, a pattern consistent with editorial collusion or predatory venue dependency
  • Author group repetition analysis — identifies when the same co-author roster appears across many papers relative to total paper count, indicating a closed citation ring
  • HHI funding concentration scoring — applies the Herfindahl-Hirschman Index to the researcher's NIH grant portfolio; high HHI (>0.7 across 3+ grants) signals single-source dependency risk
  • Terminated and withdrawn grant detection — scans all NIH grant records for "terminated", "withdrawn", and "suspended" status, a compliance concern indicator
  • ORCID profile completeness penalty — missing profiles, profiles with no listed works, and profiles with no institutional affiliations each receive penalty points, since legitimate active researchers maintain complete profiles
  • 4-model weighted composite score — combines researcher integrity (30%), paper mill detection (25%), journal quality inverted (25%), and funding risk (20%) into a single 0-100 composite with automatic HIGH_RISK override for CRITICAL or CONFIRMED_MILL verdicts
  • Actionable required-actions list — the report outputs specific follow-up tasks, not just scores, so reviewers know exactly what to investigate next

Use cases for researcher integrity screening

University hiring and tenure review

Faculty hiring committees and provosts' offices need to verify that candidates have not concealed retracted publications or inflated their h-index through citation manipulation. A single undetected integrity case discovered after tenure causes institutional embarrassment, potential grant clawback from NIH and NSF, and months of internal investigation. This actor produces a structured report that a research integrity officer can attach directly to a hiring dossier, flagging any patterns that warrant deeper review before an offer is extended.

Journal peer review and editorial screening

Journal editors receive thousands of submissions annually from researchers they have never encountered. Running an integrity screen on corresponding authors before assigning reviewers takes 90 seconds and surfaces paper mill indicators, retraction history, and predatory journal over-concentration. Editors at high-volume journals can integrate this actor into their submission management workflow via API, triggering a check automatically when a new manuscript enters the system.

Grant committee evaluation

NIH study sections, NSF panels, and university internal funding committees assess applicant credibility as part of scoring. A researcher with a CONFIRMED_MILL paper mill level or terminated grants should receive additional scrutiny before awards are recommended. This actor produces the grant portfolio analysis, publication velocity data, and citation quality assessment that committee members need, derived from public grant records and open academic databases.

Pharmaceutical and biotech due diligence

Pharmaceutical companies engage key opinion leaders, clinical trial investigators, and advisory board members whose publications directly influence regulatory submissions and prescribing decisions. Engaging an investigator with manipulated trial data or undisclosed retractions creates downstream regulatory and liability exposure. Research integrity screening should be part of the standard vendor and KOL onboarding checklist alongside conflict-of-interest disclosures.

Science journalism and investigative research

Journalists covering research misconduct allegations need structured publication data to support their reporting. This actor aggregates paper counts, retraction flags, citation anomalies, and velocity spikes from multiple databases into a single structured report that can be used as a documentary foundation, alongside primary source verification through institutional channels and direct interviews.

Academic publishing compliance and audit

Research institutions conducting retrospective audits of their faculty publication portfolios — whether triggered by whistleblower complaints, funding agency inquiries, or proactive integrity reviews — can use this actor to batch-screen multiple researchers via API. The structured JSON output integrates with institutional data systems, and the required-actions field provides a prioritized follow-up list for each subject.

How to screen a researcher for integrity concerns

  1. Enter the researcher's full name — Type the name exactly as it appears in their publications (e.g., "Yoshihiro Sato" or "Joachim Boldt"). Full names produce more accurate cross-database matching than initials.
  2. Add institution and field (recommended) — Entering the researcher's affiliated institution (e.g., "University of Zurich") and research field (e.g., "orthopedic surgery") narrows results and reduces false matches for common names. Both fields are optional but significantly improve accuracy.
  3. Click Start and wait 60-90 seconds — The actor queries all 7 academic databases in parallel. Most runs complete in under 90 seconds. Researchers with very large publication portfolios may take up to 2 minutes.
  4. Download the report — Go to the Dataset tab and export your results as JSON, CSV, or Excel. The verdict field gives you the summary decision; allSignals and requiredActions tell you exactly what was found and what to do next.

Input parameters

ParameterTypeRequiredDefaultDescription
researcherNamestringYesFull name of the researcher to screen (e.g., "Yoshihiro Sato", "Joachim Boldt")
institutionstringNoAffiliated institution to narrow results (e.g., "Harvard University", "Max Planck Institute")
fieldstringNoResearch discipline to focus the analysis (e.g., "oncology", "materials science", "machine learning")

Input examples

Standard integrity screen with full context:

{
  "researcherName": "Yoshihiro Sato",
  "institution": "Tohoku University",
  "field": "orthopedic surgery"
}

Batch-style screen by name only (common workflow for large cohorts):

{
  "researcherName": "Joachim Boldt"
}

Targeted screen with field to reduce false matches on common names:

{
  "researcherName": "Wei Chen",
  "field": "materials science"
}

Input tips

  • Always include institution for common names — names like "Wei Chen" or "John Smith" return many false matches across academic databases. Adding the institution resolves the correct researcher with far greater accuracy.
  • Use the researcher's publication name — some researchers publish under a different name than their legal name. Use the name appearing on their publications, not their HR record.
  • Field narrows the query string — the field input is appended to the search query sent to all databases, so use discipline-specific terminology that appears in publication metadata (e.g., "oncology" rather than "cancer research").
  • Run without institution first if unsure — if you don't know the researcher's current affiliation, run without it. The ORCID module will independently identify verified affiliations.

Output example

{
  "researcher": "Yoshihiro Sato",
  "institution": "Tohoku University",
  "field": "orthopedic surgery",
  "analysisTimestamp": "2026-03-20T09:14:22.481Z",
  "compositeScore": 68,
  "verdict": "HIGH_RISK",
  "allSignals": [
    "Retraction/correction flags detected — 35 points",
    "Publication spike: 8 → 31 papers (2009 → 2010)",
    "Citation distribution suspiciously uniform — potential citation manipulation",
    "High paper-to-grant ratio: 28:1 — possible padding",
    "Low citation impact: avg 1.4 — possible predatory venue"
  ],
  "requiredActions": [
    "Review retracted publications — determine scope of affected research",
    "Citation pattern anomalies — check for citation rings or manipulation"
  ],
  "scoring": {
    "researcherIntegrity": {
      "score": 72,
      "publicationCount": 112,
      "retractionFlags": 35,
      "citationAnomalies": 5,
      "velocityRedFlags": 8,
      "integrityLevel": "CRITICAL",
      "signals": [
        "Retraction/correction flags detected — 35 points",
        "Publication spike: 8 → 31 papers (2009 → 2010)",
        "Citation distribution suspiciously uniform — potential citation manipulation"
      ]
    },
    "paperMill": {
      "score": 22,
      "suspiciousPatterns": 2,
      "templateFlags": 3,
      "millLevel": "POSSIBLE",
      "signals": []
    },
    "journalQuality": {
      "score": 28,
      "totalPapers": 112,
      "highCitationPapers": 9,
      "openAccessRatio": 0.12,
      "qualityLevel": "LOW",
      "signals": [
        "Low citation impact: avg 1.4 — possible predatory venue"
      ]
    },
    "fundingRisk": {
      "score": 38,
      "grantCount": 4,
      "flaggedGrants": 0,
      "fundingConcentration": 0.58,
      "riskLevel": "ELEVATED",
      "signals": [
        "High paper-to-grant ratio: 28:1 — possible padding"
      ]
    }
  },
  "rawDataSummary": {
    "openalexResults": [],
    "orcidResults": [],
    "pubmedResults": [],
    "semanticScholarResults": [],
    "crossrefResults": [],
    "coreResults": [],
    "nihGrantResults": []
  },
  "metadata": {
    "subActorsRun": 7,
    "scoringModels": [
      "researcherIntegrity",
      "paperMill",
      "journalQuality",
      "fundingRisk",
      "compositeIntegrity"
    ],
    "dataSourceCounts": {
      "openalex-research-papers": 47,
      "orcid-researcher-search": 1,
      "pubmed-research-search": 38,
      "semantic-scholar-search": 27,
      "crossref-paper-search": 22,
      "core-academic-search": 19,
      "nih-research-grants": 4
    }
  }
}

Output fields

FieldTypeDescription
researcherstringResearcher name as provided in input
institutionstring | nullInstitution as provided in input, or null
fieldstring | nullResearch field as provided in input, or null
analysisTimestampstringISO 8601 timestamp of when the analysis ran
compositeScorenumberWeighted composite integrity concern score (0-100; higher = more concern)
verdictstringSummary verdict: CLEAR, MINOR_CONCERNS, INVESTIGATION_NEEDED, or HIGH_RISK
allSignalsstring[]All human-readable signals detected across all four models
requiredActionsstring[]Prioritized list of follow-up actions based on findings
scoring.researcherIntegrity.scorenumberIntegrity model score (0-100)
scoring.researcherIntegrity.integrityLevelstringCLEAN, MINOR_FLAGS, SUSPICIOUS, HIGH_RISK, or CRITICAL
scoring.researcherIntegrity.retractionFlagsnumberRaw retraction/correction penalty points before capping
scoring.researcherIntegrity.citationAnomaliesnumberCitation anomaly count from Benford's Law and CV analysis
scoring.researcherIntegrity.velocityRedFlagsnumberPublication velocity raw penalty points
scoring.researcherIntegrity.publicationCountnumberTotal papers retrieved across OpenAlex, PubMed, and Semantic Scholar
scoring.paperMill.scorenumberPaper mill model score (0-100)
scoring.paperMill.millLevelstringUNLIKELY, POSSIBLE, PROBABLE, LIKELY_MILL, or CONFIRMED_MILL
scoring.paperMill.templateFlagsnumberNumber of repeated title pattern instances detected
scoring.paperMill.suspiciousPatternsnumberCombined suspicious journal and author pattern count
scoring.journalQuality.scorenumberJournal quality model score (0-100; higher = better quality)
scoring.journalQuality.qualityLevelstringPREDATORY, LOW, MODERATE, HIGH, or ELITE
scoring.journalQuality.highCitationPapersnumberNumber of papers with 10 or more citations
scoring.journalQuality.openAccessRationumberFraction of papers that are open access (0.00-1.00)
scoring.fundingRisk.scorenumberFunding risk model score (0-100)
scoring.fundingRisk.riskLevelstringLOW, MODERATE, ELEVATED, HIGH, or CRITICAL
scoring.fundingRisk.grantCountnumberTotal NIH grants found
scoring.fundingRisk.flaggedGrantsnumberGrants with terminated, withdrawn, or suspended status
scoring.fundingRisk.fundingConcentrationnumberHHI concentration score (0.00-1.00; >0.7 = high concentration)
rawDataSummary.openalexResultsarrayUp to 20 raw paper records from OpenAlex
rawDataSummary.orcidResultsarrayUp to 15 raw ORCID profile records
rawDataSummary.pubmedResultsarrayUp to 20 raw PubMed records
rawDataSummary.semanticScholarResultsarrayUp to 20 raw Semantic Scholar records
rawDataSummary.crossrefResultsarrayUp to 20 raw Crossref records
rawDataSummary.coreResultsarrayUp to 20 raw CORE records
rawDataSummary.nihGrantResultsarrayUp to 15 raw NIH grant records
metadata.subActorsRunnumberAlways 7 — the number of parallel sub-actor calls
metadata.scoringModelsstring[]Names of all scoring models applied
metadata.dataSourceCountsobjectRecord count from each of the 7 data sources

How much does it cost to screen a researcher for integrity?

Researcher Integrity Check uses pay-per-run pricing — you pay approximately $0.15 per researcher screened. Platform compute costs are included. Apify's free tier includes $5 of monthly credits, covering roughly 33 integrity checks per month at no charge.

ScenarioResearchers screenedCost per researcherTotal cost
Quick test1$0.15$0.15
Small cohort (grant applicants)10$0.15$1.50
Medium cohort (faculty candidates)50$0.15$7.50
Large audit (department review)200$0.15$30.00
Enterprise (institution-wide annual audit)1,000$0.15$150.00

You can set a maximum spending limit per run to control costs. The actor stops when your budget is reached.

Compare this to manual research integrity screening at $75-150/hour for a trained research librarian — a single comprehensive manual screen of 10 researchers costs $600-1,200 in labor. At $1.50 for the same 10-researcher cohort, this actor covers the statistical signal detection layer that should precede any human deep-dive, prioritizing which researchers actually require investigator time.

Researcher integrity screening using the API

Python

from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")

run = client.actor("ryanclinton/researcher-integrity-check").call(run_input={
    "researcherName": "Yoshihiro Sato",
    "institution": "Tohoku University",
    "field": "orthopedic surgery"
})

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(f"Researcher: {item['researcher']}")
    print(f"Verdict: {item['verdict']} (score: {item['compositeScore']}/100)")
    print(f"Integrity level: {item['scoring']['researcherIntegrity']['integrityLevel']}")
    print(f"Paper mill level: {item['scoring']['paperMill']['millLevel']}")
    for signal in item.get("allSignals", []):
        print(f"  - {signal}")
    for action in item.get("requiredActions", []):
        print(f"  ACTION: {action}")

JavaScript

import { ApifyClient } from "apify-client";

const client = new ApifyClient({ token: "YOUR_API_TOKEN" });

const run = await client.actor("ryanclinton/researcher-integrity-check").call({
    researcherName: "Yoshihiro Sato",
    institution: "Tohoku University",
    field: "orthopedic surgery"
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
for (const item of items) {
    console.log(`Researcher: ${item.researcher}`);
    console.log(`Verdict: ${item.verdict} (score: ${item.compositeScore}/100)`);
    console.log(`Integrity level: ${item.scoring.researcherIntegrity.integrityLevel}`);
    console.log(`Paper mill level: ${item.scoring.paperMill.millLevel}`);
    for (const signal of item.allSignals || []) {
        console.log(`  - ${signal}`);
    }
    for (const action of item.requiredActions || []) {
        console.log(`  ACTION: ${action}`);
    }
}

cURL

# Start the researcher integrity check
curl -X POST "https://api.apify.com/v2/acts/ryanclinton~researcher-integrity-check/runs?token=YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "researcherName": "Yoshihiro Sato",
    "institution": "Tohoku University",
    "field": "orthopedic surgery"
  }'

# Fetch results once the run finishes (replace DATASET_ID from the run response)
curl "https://api.apify.com/v2/datasets/DATASET_ID/items?token=YOUR_API_TOKEN&format=json"

How Researcher Integrity Check works

Phase 1 — Parallel multi-source data collection

The actor fires 7 sub-actor calls simultaneously using Promise.allSettled, ensuring that a timeout or failure in any single source does not block the others. Each sub-actor is allocated 256 MB of memory and a 120-second timeout. Results are collected from the default dataset of each sub-actor run, with a cap of 1,000 items per source. The data is keyed by source name and passed as a unified Record<string, unknown[]> to all four scoring functions.

The query sent to each source is constructed by concatenating researcherName, institution (if provided), and field (if provided). ORCID is queried with the researcher name only, since ORCID's search API matches on personal identity rather than publication metadata. NIH grants are also queried by name only.

Phase 2 — Four independent scoring models

Researcher Integrity model aggregates all papers from OpenAlex, PubMed, and Semantic Scholar into a single pool and scans title and publication type fields for retraction and correction indicators. It then extracts citation counts from cited_by_count, citationCount, and citations fields (normalized across schema differences between sources) and applies Benford's Law leading-digit analysis. The coefficient of variation (standard deviation divided by mean) is computed across all citation values; a CV below 0.3 with 10 or more papers triggers the citation uniformity flag. Year-of-publication is parsed from publication_date, date, or year fields. Annual counts are sorted chronologically to detect year-over-year triples.

Paper Mill Detection model uses three pattern detectors operating on OpenAlex, Crossref, and CORE data. The title template detector extracts the first five whitespace-delimited words of every paper title, maps them to a frequency counter, and flags any pattern appearing three or more times. The journal concentration detector identifies any journal (normalized to lowercase from journal, venue, source, or container_title fields) accounting for 50% or more of the researcher's papers across at least five papers. The author diversity detector normalizes and sorts co-author rosters into canonical strings; if the unique roster count falls below 30% of total papers with at least 10 papers in the pool, it flags low author diversity.

Journal Quality model is the only model where a higher score is better. It scores average citation impact, open access ratio from is_oa, open_access, and openAccess fields, unique journal diversity, and publication volume health. The composite journal quality score is then inverted when calculating the final weighted composite, so elite publication venues reduce overall risk.

Funding Risk model queries only the NIH grants sub-actor results. It calculates a paper-to-grant ratio; ratios above 20:1 suggest publication padding relative to funded research. It computes the Herfindahl-Hirschman Index (sum of squared market shares) across all identified funding agency sources — an HHI above 0.7 with three or more grants flags single-source dependency. Individual grant records are scanned for "terminated", "withdrawn", and "suspended" status strings.

Phase 3 — Composite scoring and verdict assignment

The composite score is a weighted sum: (integrityScore × 0.30) + (millScore × 0.25) + ((100 - journalScore) × 0.25) + (fundingScore × 0.20). Verdict thresholds are: CLEAR below 20, MINOR_CONCERNS 20-39, INVESTIGATION_NEEDED 40-64, HIGH_RISK 65 and above. Two override conditions bypass the composite threshold: a CRITICAL integrity level or a CONFIRMED_MILL paper mill level automatically produces a HIGH_RISK verdict regardless of the composite score. The requiredActions field is populated deterministically based on which threshold conditions were triggered, giving reviewers a concrete follow-up list.

Tips for best results

  1. Disambiguate with institution for common names. Researchers named "Wei Chen", "John Smith", or "Maria Garcia" appear hundreds of times across academic databases. Including the institution — even a partial match like "MIT" or "Stanford" — dramatically improves the signal-to-noise ratio in the retrieved paper pool.

  2. Use the raw data summary to verify results. Each source's raw records (up to 20 papers per source) are included in the output. If you see a high retraction score, check rawDataSummary.pubmedResults and rawDataSummary.openalexResults to confirm the flagged papers are actually attributed to your subject, not a name collision.

  3. Treat HIGH_RISK as a trigger for human review, not a final decision. The statistical models flag patterns; human judgment determines whether a pattern reflects misconduct or legitimate anomalies (e.g., a researcher who publishes legitimately in a single journal because they are a founding editor). The requiredActions field tells you specifically what to investigate.

  4. Batch screening via API is the most cost-effective pattern. If you are screening 20 or more researchers, trigger runs in parallel via the Apify API and collect results into a shared Google Sheet using the Google Sheets integration. A hiring committee reviewing 30 candidates can have all 30 reports ready in under 3 minutes.

  5. Combine with Company Deep Research for institution-level context. A researcher's integrity profile is more meaningful when you understand the institutional environment — research offices under funding pressure, departments with prior misconduct history, or institutions under active federal investigation.

  6. Re-run annually for ongoing monitoring. Research integrity is not static. A researcher who is CLEAR today may accumulate retractions or have grants terminated. Schedule annual re-runs for active collaborators, advisory board members, and key opinion leaders.

  7. Low data-source counts warrant caution. Check metadata.dataSourceCounts in the output. If most sources return 0 results, the researcher may be early-career, may publish under a variant name, or may be primarily in a field not well indexed by open databases. Do not interpret a CLEAR verdict with sparse data as strong confirmation of integrity.

Combine with other Apify actors

ActorHow to combine
Company Deep ResearchScreen the researcher, then run their institution through Company Deep Research to assess institutional integrity environment and funding pressures
ORCID Researcher SearchUse ORCID data independently to verify claimed affiliations before running a full integrity screen
SEC EDGAR Filing AnalyzerFor pharmaceutical advisors and clinical trial investigators, cross-reference with SEC disclosures to identify undisclosed financial relationships
Sanctions Network AnalysisCombine with researcher integrity data for comprehensive KOL due diligence in regulated industries
AML Entity ScreeningFor international researchers, pair integrity results with AML screening to flag sanctioned-country funding relationships
Multi-Review AnalyzerCross-reference researcher reputation signals from Trustpilot and BBB with publication integrity data for a complete due diligence picture
B2B Lead QualifierBuild a prioritized outreach list of researchers vetted by integrity score for clinical advisory or consulting engagement

Limitations

  • Name disambiguation is imperfect for common names. The actor cannot definitively resolve researcher identity across databases. For names with more than 20 matches across sources, a small number of papers from other researchers with the same name may be included in the scoring pool. Adding institution and field reduces but does not eliminate this risk.
  • Retraction detection depends on open database indexing. Retracted papers are only flagged if the retraction notice is indexed in OpenAlex, PubMed, or Semantic Scholar with a title or type field containing "retract". Retractions not yet indexed, or those in journals not covered by these databases, will not appear.
  • NIH grants only — non-US funding sources are not assessed. The funding risk model exclusively uses NIH grant data. Researchers funded by European Research Council, Wellcome Trust, DFG, or other non-US agencies will receive a "No NIH grants found" penalty that does not reflect their actual funding track record. European researchers should have this signal interpreted with caution.
  • Open access field coverage is inconsistent. Not all databases return reliable open-access status fields. The open access ratio in the journal quality model may be understated for repositories with incomplete is_oa coverage.
  • Paper mill detection does not examine manuscript text. The model identifies structural and pattern-level signals, not content-level plagiarism, fabricated data, or image manipulation. For text-level analysis, dedicated tools such as iThenticate or Scite.ai are more appropriate.
  • Sub-actor failures produce empty arrays, not errors. If any of the 7 data source sub-actors times out or fails, its data is silently excluded from scoring. A partial failure reduces scoring accuracy. Check metadata.dataSourceCounts — any source reporting 0 that you expected to return results may have failed.
  • The actor does not access paywalled databases. Web of Science, Scopus, and Elsevier's databases are not queried. Researchers who publish predominantly in highly specialized journals indexed only in these commercial databases will have incomplete coverage.
  • Scores reflect publicly available data as of the run date. Academic databases have indexing lags of weeks to months. Very recent retractions or grants may not yet appear.

Integrations

  • Zapier — trigger an integrity check automatically when a new grant application or faculty application is submitted to your system, and route HIGH_RISK verdicts to a review queue
  • Make — build multi-step workflows that screen researchers, format the report, and send a summary email to a hiring committee
  • Google Sheets — export cohort screening results to a shared spreadsheet for committee review, with verdict and score visible in sortable columns
  • Apify API — integrate programmatic researcher screening into grant management systems, HRIS platforms, or journal submission pipelines
  • Webhooks — receive an immediate notification when a HIGH_RISK verdict is returned, triggering escalation workflows in your internal systems
  • LangChain / LlamaIndex — pipe structured integrity reports into LLM-based research assistants that can synthesize findings and draft summaries for non-technical committee members

Troubleshooting

  • Very low data source counts across all 7 sources — The researcher's name may not be well-indexed in open databases. Try running with only the researcher's last name plus a distinctive institutional keyword. Some researchers, particularly those in non-English-language institutions or early-career stages, have sparse open-access publication records. A CLEAR verdict with fewer than 5 papers total is inconclusive, not reassuring.

  • Unexpected HIGH_RISK verdict for a researcher you believe is clean — Check metadata.dataSourceCounts for all 7 sources. If one source returned a large number of results (e.g., 200+ papers), it is likely returning papers from multiple researchers with the same name. Re-run with institution and field specified to narrow the paper pool. The raw data summaries include the actual paper records — verify that the flagged titles belong to your subject.

  • Run times exceeding 3 minutes — The actor is configured with a 120-second per-sub-actor timeout. Very large data returns (researchers with 500+ papers in open databases) cause slower dataset retrieval. This is expected. If the actor times out before completing, increase the actor's timeout in the run configuration. Most researchers complete in under 90 seconds.

  • Missing NIH grant data for US-based researchers — NIH grant data availability depends on whether the researcher is listed as a principal investigator in the NIH Reporter database. Some researchers receive grants as co-investigators not listed as PI, or through industry funding not reported to NIH. A 0-grant result does not mean no federal funding exists.

  • ORCID returns 0 results — ORCID search requires an exact or near-exact name match. Hyphenated names, names with diacritics, or names with multiple common transliterations may fail to match. If ORCID returns 0 results for a researcher you know has an ORCID profile, this generates a 10-point profile penalty in the scoring but does not indicate an actual integrity concern.

Responsible use

  • This actor only accesses publicly available academic publication data, ORCID profiles, and NIH grant records.
  • Research integrity verdicts are statistical signals derived from public data, not legal findings of misconduct. Do not use output as the sole basis for employment, funding, or legal decisions.
  • Comply with applicable data protection laws when processing researcher personal information, including GDPR for researchers based in the European Union.
  • Do not use results for harassment, defamation, or reputational damage campaigns. Integrity screening is a due diligence tool, not a public accusation mechanism.
  • For guidance on web scraping and data use legality, see Apify's guide.

FAQ

How does researcher integrity screening work across 7 databases simultaneously?

The actor fires all 7 sub-actor queries — OpenAlex, ORCID, PubMed, Semantic Scholar, Crossref, CORE, and NIH grants — in parallel using Promise.allSettled, so the total run time is approximately the longest individual source response, not the sum. Results from all sources are pooled into a unified dataset before scoring begins. Most runs complete in 60-90 seconds.

How many researchers can I screen in one run?

One run screens one researcher. For batch screening, trigger multiple runs in parallel via the Apify API. A cohort of 30 researchers can be queued simultaneously and results collected within 2-3 minutes. There is no built-in batch mode, but the API pattern is straightforward for developers — see the Python example above.

Does researcher integrity screening detect all retractions?

No. Retraction detection depends on the retraction notice being indexed in OpenAlex, PubMed, or Semantic Scholar with "retract" appearing in the title or document type field. Retractions not yet indexed, those in non-English journals with partial open-access indexing, or post-publication corrections that use euphemistic language may not be detected. For definitive retraction coverage, cross-reference results with Retraction Watch manually.

What is Benford's Law and why does it apply to citation counts?

Benford's Law predicts that in large datasets of naturally occurring numbers, the digit 1 appears as the leading digit roughly 30% of the time, digit 2 about 17%, and so on down to digit 9 at about 4.6%. Legitimate citation counts follow this distribution because citations accumulate organically over time. If a researcher's citation distribution shows the digit 1 appearing less than 15% or more than 50% of the time across 10+ papers, it suggests the distribution may have been manipulated — for example, through coordinated self-citation or citation ring arrangements.

How is researcher integrity screening different from Retraction Watch or Scite.ai?

Retraction Watch is a curated database of confirmed retractions, maintained manually. This actor queries multiple open academic databases for retraction-pattern signals but does not use the Retraction Watch database directly. Scite.ai analyzes how papers are cited (supporting, contradicting, mentioning) at the reference level. This actor analyzes publication-level patterns — velocity, volume, journal concentration, author group repetition, and funding records — that Retraction Watch and Scite.ai do not cover. The tools are complementary, not substitutes.

Can researcher integrity screening prove misconduct?

No. The actor identifies statistical anomalies consistent with patterns documented in known misconduct cases. A HIGH_RISK verdict means the patterns warrant human investigation — it does not constitute evidence of fraud, fabrication, or falsification. Only institutional investigation bodies, journal editors, and funding agencies have the authority to make misconduct determinations.

How accurate is the paper mill detection model?

The paper mill model uses three proxy signals: repeated title patterns (first 5 words), single-journal over-concentration, and low author group diversity. These signals are calibrated against documented paper mill cases. The model is designed to produce false positives rather than false negatives — a POSSIBLE or PROBABLE verdict triggers human review, not automatic disqualification. Researchers who legitimately specialize in a narrow sub-field and co-author consistently with the same team may trigger low-severity flags.

Is it legal to screen researchers using public academic data?

Academic publication records, ORCID profiles, and NIH grant data are public by design — researchers publish in journals with the explicit intent of making their work publicly accessible, and NIH grant records are mandated to be public under federal transparency requirements. Screening public records for due diligence purposes is a standard institutional practice. That said, using screening results to make employment or funding decisions should be part of a documented due diligence process that gives subjects the opportunity to respond, consistent with fair process principles and applicable employment law in your jurisdiction.

What does the HHI funding concentration score mean?

The Herfindahl-Hirschman Index (HHI) is an economics metric for measuring market concentration. Applied to funding sources, it equals the sum of squared market shares for each funding agency. An HHI of 1.0 means all grants came from a single source; 0.25 means four equal sources. An HHI above 0.7 with three or more grants is flagged as high concentration, indicating the researcher's funding is heavily dependent on a single agency — a risk factor if that agency relationship is compromised.

Can I schedule this actor to run periodically on a list of researchers?

Yes. Use the Apify Scheduler to run the actor on a recurring schedule. For monitoring a fixed list of researchers — advisory board members, ongoing collaborators, or faculty under review — you can trigger scheduled runs via the API with a different researcher name each time, or build a wrapper actor that iterates over a list. See the Apify documentation on schedules for setup instructions.

What happens if one of the 7 sub-actors fails or times out?

The actor uses Promise.allSettled, which means a failure in one sub-actor does not stop the others. The failed source returns an empty array, its data is excluded from scoring, and the run continues. The final output includes metadata.dataSourceCounts showing exactly how many records each source returned — a 0 count for an expected source signals a potential sub-actor failure. The composite score will be less accurate with fewer data sources, but the run will complete.

How is this different from running a Google search on a researcher's name?

A Google search returns unstructured news articles, institutional pages, and publication lists that require manual interpretation. This actor returns structured, machine-readable JSON with quantitative scores, classified verdict levels, and a specific required-actions list derived from statistical analysis of publication data across 7 academic databases. For systematic due diligence on multiple researchers, structured output is essential.

Help us improve

If you encounter issues, you can help us debug faster by enabling run sharing in your Apify account:

  1. Go to Account Settings > Privacy
  2. Enable Share runs with public Actor creators

This lets us see your run details when something goes wrong, so we can fix issues faster. Your data is only visible to the actor developer, not publicly.

Support

Found a bug or have a feature request? Open an issue in the Issues tab on this actor's page. For custom solutions or enterprise integrations — such as bulk cohort screening, integration with institutional research management systems, or custom scoring model calibration — reach out through the Apify platform.

How it works

01

Configure

Set your parameters in the Apify Console or pass them via API.

02

Run

Click Start, trigger via API, webhook, or set up a schedule.

03

Get results

Download as JSON, CSV, or Excel. Integrate with 1,000+ apps.

Use cases

Sales Teams

Build targeted lead lists with verified contact data.

Marketing

Research competitors and identify outreach opportunities.

Data Teams

Automate data collection pipelines with scheduled runs.

Developers

Integrate via REST API or use as an MCP tool in AI workflows.

Ready to try Researcher Integrity Check?

Start for free on Apify. No credit card required.

Open on Apify Store