AIDEVELOPER TOOLS

Scientific Fraud Detection MCP

Scientific fraud detection as a live MCP server — screen any research topic, author, or paper for statistical fabrication, p-hacking, publication bias, citation manipulation, data duplication, causal contamination, and HARKing using 8 forensic tools backed by 16 real-time academic sources. Built for research integrity officers, meta-analysts, AI coding assistants, and anyone who needs to audit the scientific literature without writing a single line of code.

Try on Apify Store
$0.06per event
0
Users (30d)
0
Runs (30d)
90
Actively maintained
Maintenance Pulse
$0.06
Per event

Maintenance Pulse

90/100
Last Build
Today
Last Version
1d ago
Builds (30d)
8
Issue Response
N/A

Cost Estimate

How many results do you need?

audit-statistical-consistencys
Estimated cost:$6.00

Pricing

Pay Per Event model. You only pay for what you use.

EventDescriptionPrice
audit-statistical-consistencyGRIM/SPRITE statistical forensics$0.06
analyze-p-curve-z-curveSimonsohn p-curve and z-curve EM analysis$0.06
fit-selection-model-meta-analysisVevea-Hedges weight function meta-analysis$0.08
detect-citation-network-anomaliesTERGM temporal network analysis$0.08
screen-image-data-forensicsError level analysis and Benford DCT$0.06
trace-causal-contaminationDo-calculus d-separation on claim DAG$0.08
detect-harking-bayesian-surpriseBayesian surprise KL divergence$0.06
self-calibrate-detectionBrier score decomposition self-audit$0.08

Example: 100 events = $6.00 · 1,000 events = $60.00

Connect to your AI agent

Add this MCP server to Claude Desktop, Cursor, Windsurf, or any MCP-compatible client.

MCP Endpoint
https://ryanclinton--scientific-fraud-detection-mcp.apify.actor/mcp
Claude Desktop Config
{
  "mcpServers": {
    "scientific-fraud-detection-mcp": {
      "url": "https://ryanclinton--scientific-fraud-detection-mcp.apify.actor/mcp"
    }
  }
}

Documentation

Scientific fraud detection as a live MCP server — screen any research topic, author, or paper for statistical fabrication, p-hacking, publication bias, citation manipulation, data duplication, causal contamination, and HARKing using 8 forensic tools backed by 16 real-time academic sources. Built for research integrity officers, meta-analysts, AI coding assistants, and anyone who needs to audit the scientific literature without writing a single line of code.

Each tool call orchestrates parallel queries across OpenAlex, PubMed, Semantic Scholar, arXiv, Crossref, CORE, Europe PMC, ORCID, DBLP, NIH Grants, ClinicalTrials.gov, Wayback Machine, GitHub, USPTO Patents, and Hacker News — then applies 11 forensic algorithms from GRIM/SPRITE statistical auditing through Vevea-Hedges selection models, TERGM citation anomaly detection, Benford DCT forensics, do-calculus contamination tracing, and Bayesian surprise HARKing detection. All structured output is returned as JSON, ready to pipe directly into downstream analysis pipelines.

What data can you extract?

Data PointSourceExample
📄 Academic papers, citation counts, conceptsOpenAlex + Crossref + CORE"Effect of semaglutide on HbA1c: N=312, cited 847x"
🧬 Biomedical literature with MeSH termsPubMed + Europe PMCPMID 38291045, mesh: ["Diabetes Mellitus, Type 2"]
🔬 Research papers with semantic graphsSemantic ScholarpaperId: "abc123", influentialCitationCount: 44
📐 Reported statistics flagged by GRIM/SPRITEAll paper sourcesmean=3.47, N=20, GRIM_fail=true, deviation=0.03
📊 P-value distributions and z-scoresExtracted from paper corpusrightSkewP=0.003, EDR=0.62, ERR=0.54
⚖️ Selection-adjusted pooled effect sizesMeta-analysis synthesispooledEffect=0.42, adjustedEffect=0.28, I²=67%
🔗 Citation ring and self-citation anomaliesCitation graph analysiscitationGini=0.71, anomalyType="citation_ring"
🖼️ Benford DCT and MinHash forensic flagsForensic screeningdctAnomaly=0.24, benfordDeviation=0.18, confidence=0.87
🗺️ Causal contamination paths from retracted papersBFS + do-calculuspathway: ["retraction-A","citing-B","citing-C"], strength=0.73
🤔 HARKing probability via Bayesian surpriseKL divergence scoringklDivergence=2.14, harkingProbability=0.81
🎯 Detector calibration metricsBrier score decompositionbrierScore=0.09, calibrationSlope=0.94, converged=true
🧪 Preprints and early-stage researcharXivarXiv:2401.12345, submittedDate: "2024-01-22"

Why use Scientific Fraud Detection MCP?

Manually auditing a research literature for questionable practices takes days. A single meta-analyst checking p-value distributions across 200 papers, tracing retraction contamination through citation chains, and auditing statistical consistency by hand is a weeks-long project — prone to both omission errors and cognitive fatigue.

This MCP server automates the entire pipeline. One tool call, one query string, and within minutes you receive GRIM consistency flags, p-curve shape analysis, selection-adjusted effect sizes, TERGM citation anomaly scores, forensic manipulation flags, causal contamination maps, HARKing signals, and calibrated confidence metrics — all from a live, cross-database synthesis of up to 16 academic sources queried in parallel.

  • Scheduling — run weekly integrity monitors on key journals or authors; Apify handles cron scheduling
  • API access — trigger any of the 8 tools from Python, JavaScript, or any HTTP client without a GUI
  • Monitoring — receive Slack or email alerts when a tool call fails or returns anomalous results
  • Integrations — connect results to Zapier, Make, Google Sheets, or HubSpot via Apify's native connectors
  • MCP protocol — works natively with Claude, Cursor, Windsurf, and any MCP-compatible AI assistant

Features

  • GRIM test (Granularity-Related Inconsistency of Means) — checks whether reported means are mathematically possible given integer raw data and sample size; flags impossible values where mean × N is not an integer
  • SPRITE (Sample Parameter Reconstruction via Iterative Techniques) — constrained integer programming that reconstructs feasible integer distributions matching reported mean, SD, min, max, and N; flags impossible parameter combinations
  • Benford's law first-digit analysis — applies chi-squared goodness-of-fit against the Benford distribution on reported numerical values; detects digit-frequency anomalies that signal fabricated data
  • P-curve right-skew test — implements Simonsohn-Nelson-Simmons (2014) Stouffer's method on p-values conditional on significance; right-skewed = evidential value, flat = p-hacking
  • Z-curve EM algorithm — fits a finite mixture of truncated normal distributions using Expectation-Maximization with 3 components and 20 iterations; returns Expected Discovery Rate (EDR) and Expected Replication Rate (ERR)
  • Kolmogorov-Smirnov flatness test — tests whether p-curve shape is consistent with uniform distribution (p-hacking) against right-skewed alternative
  • Vevea-Hedges weight-function selection model — step-function w(p) at thresholds p=0.025, 0.05, 0.10 with DerSimonian-Laird random-effects, tau-squared heterogeneity, and I-squared; returns unadjusted and selection-adjusted pooled effect sizes
  • TERGM (Temporal Exponential Random Graph Model) — models citation network evolution P(G_t | G_{t-1}) ~ exp(θ × s(G_t, G_{t-1})); identifies citation rings via mutual-citation detection, self-citation excess via authored-paper graph traversal, and coerced citations via clustering coefficient thresholds
  • Citation Gini coefficient — measures inequality in citation distribution across the paper corpus; high Gini (>0.70) indicates citation concentration consistent with cartel behavior
  • Benford DCT forensics — first-digit analysis on DCT frequency coefficients; natural images follow Benford's law, manipulated ones deviate; detects image duplication and data fabrication signals
  • MinHash LSH (Locality-Sensitive Hashing) — k-shingle Jaccard similarity estimation for detecting near-duplicate papers and text reuse across publications in the corpus
  • Do-calculus d-separation — BFS path-finding on citation DAG with identifiability checks for unblocked backdoor paths via confounders; traces causal contamination from flagged or retracted papers to downstream citations
  • Dirichlet process CRP clustering — Chinese Restaurant Process clustering of contamination sources with concentration parameter alpha; groups related contamination chains
  • Bayesian surprise HARKing detection — D_KL(posterior ‖ prior) using normal-normal conjugate update; high KL divergence with low hypothesis consistency signals post-hoc hypothesis fabrication
  • Brier score decomposition — decomposes calibration into reliability + resolution + uncertainty components; used in self-calibration fixed-point loop with Platt scaling (logistic)
  • Parallel 16-actor orchestration — all 16 data source actors run in parallel groups via Promise.all; each group fires 3-5 actors simultaneously, reducing total wall-clock time vs sequential fetching

Use cases for scientific fraud detection

Research integrity assessment

Integrity officers at universities, funding agencies, and journals need to screen papers before or after publication. This MCP provides a full forensic report — statistical consistency, p-value distribution shape, selection-adjusted effect sizes, and citation network anomalies — in a single session. A manual equivalent would take a trained statistician two to three days per paper cluster.

Replication crisis meta-analysis

Researchers studying replicability across psychology, medicine, or economics can feed an entire subfield into analyze_p_curve_z_curve to estimate the Expected Replication Rate and detect systematic p-hacking. The z-curve EM algorithm returns the full mixture model — component means and weights — so analysts can segment high-credibility from low-credibility literature programmatically.

AI assistant research grounding

AI coding assistants, Claude, and other LLM applications use MCP tools to retrieve and process real-world data. Connecting this server to an AI assistant allows it to verify scientific claims in real time — asking "is the evidence for this treatment robust?" triggers a full p-curve and selection model analysis before the assistant answers.

Systematic review and meta-analysis support

Meta-analysts running Cochrane-style reviews can use fit_selection_model_meta_analysis to estimate publication-bias-corrected pooled effects, detect_citation_network_anomalies to flag citation cartels that may have inflated a literature, and trace_causal_contamination to identify which papers in a review body are downstream of retracted or problematic sources.

Citation manipulation investigation

Journal editors and retraction watch investigators can query an author name or journal to detect citation rings, self-citation excess beyond 30%, and coordinated coerced citation patterns using TERGM coefficients and clustering analysis. The Gini coefficient provides a single inequality metric that can be compared across journals.

Competitive intelligence for science policy

Policy analysts and science funders can benchmark research fields by feeding topic queries into self_calibrate_detection to receive cross-detector Brier scores. Fields with poor calibration (high Brier scores, low EDR) warrant additional scrutiny before funding allocation or policy decisions.

How to connect this MCP server

Connecting takes under two minutes. No API keys, no environment setup — just add the server URL to your MCP client configuration.

  1. Copy the MCP endpoint URLhttps://scientific-fraud-detection-mcp.apify.actor/mcp
  2. Open your MCP client config — Claude Desktop (claude_desktop_config.json), Cursor MCP settings, or Windsurf
  3. Paste the server block — use the JSON snippet for your client below
  4. Start a session — ask your AI assistant to run any of the 8 tools; results stream back as structured JSON

Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):

{
  "mcpServers": {
    "scientific-fraud-detection": {
      "url": "https://scientific-fraud-detection-mcp.apify.actor/mcp"
    }
  }
}

Cursor

Add to your Cursor MCP settings panel or .cursor/mcp.json:

{
  "mcpServers": {
    "scientific-fraud-detection": {
      "url": "https://scientific-fraud-detection-mcp.apify.actor/mcp"
    }
  }
}

Windsurf

Add to ~/.codeium/windsurf/mcp_config.json:

{
  "mcpServers": {
    "scientific-fraud-detection": {
      "url": "https://scientific-fraud-detection-mcp.apify.actor/mcp"
    }
  }
}

MCP tools reference

This server exposes 8 tools. All tools accept a single query string parameter (research topic, author name, or paper title). Each call queries up to 16 actors in parallel before running the analysis.

ToolPriceAlgorithmBest for
audit_statistical_consistency$0.040GRIM + SPRITE + Benford chi-squaredDetecting impossible reported statistics
analyze_p_curve_z_curve$0.035Simonsohn p-curve + z-curve EM (truncated normals)P-hacking detection and replicability estimation
fit_selection_model_meta_analysis$0.045Vevea-Hedges w(p) + DerSimonian-LairdPublication-bias-corrected meta-analysis
detect_citation_network_anomalies$0.040TERGM + Gini coefficient + clusteringCitation rings and self-citation excess
screen_image_data_forensics$0.045Benford DCT + MinHash LSHImage manipulation and text duplication
trace_causal_contamination$0.040BFS + do-calculus d-separation + Dirichlet CRPRetraction contamination propagation
detect_harking_bayesian_surprise$0.035D_KL(posterior ‖ prior) + Brier calibrationPost-hoc hypothesis detection
self_calibrate_detection$0.040Platt scaling fixed-point + Brier decompositionPipeline reliability and meta-assessment

Tool: audit_statistical_consistency

Audits reported means and standard deviations for mathematical feasibility. For each paper in the assembled network:

  • GRIM check: verifies mean × N is an integer (required for integer-valued Likert-scale data)
  • SPRITE check: verifies the sum-of-squares decomposition (n-1)·SD² + n·mean² is compatible with integer raw data
  • Benford analysis: computes first-digit distribution chi-squared against Benford's expected log10(1 + 1/d) frequencies
  • Returns flags sorted by deviation magnitude, with global chi-squared statistic across the full corpus

Tool: analyze_p_curve_z_curve

Analyses the shape of the p-value distribution across significant results:

  • P-curve: applies Stouffer's method on conditional p-values (p/0.05); right-skew p-value below 0.05 indicates evidential value
  • KS flatness test: tests uniformity of the conditional distribution; flat curve (p < 0.05) signals p-hacking
  • Z-curve EM: fits K=3-component truncated normal mixture over 20 EM iterations; returns Expected Discovery Rate and Expected Replication Rate
  • Returns component means, mixture weights, and KS fitness statistic

Tool: fit_selection_model_meta_analysis

Fits a Vevea-Hedges weight-function selection model:

  • Step weights: w(p) = 1.0 for p ≤ 0.05, 0.3 for 0.05 < p ≤ 0.10, 0.1 for p > 0.10
  • DerSimonian-Laird: random-effects pooling with Q-statistic, tau-squared between-study variance, and I-squared heterogeneity
  • Adjusted effect: selection-corrected pooled estimate using publication-probability-weighted inverse-variance
  • Returns unadjusted pooled effect, selection-adjusted effect, and selectionSeverity (standardized difference)

Tool: detect_citation_network_anomalies

Models the citation network as a temporal ERGM:

  • Citation rings: detects mutual citation pairs (A cites B AND B cites A) with TERGM coefficient scoring
  • Self-citation excess: flags authors whose self-citation rate exceeds 30% of total citations
  • Clustering anomalies: computes local clustering coefficients; values above 0.5 in a neighborhood of 3+ papers flag coordinated citing
  • Gini coefficient: measures citation inequality using the standard rank-weighted formula; approaches 1 for extreme concentration

Tool: screen_image_data_forensics

Screens the paper corpus for forensic manipulation signals:

  • Benford DCT: compares first-digit frequency of simulated DCT coefficients against Benford's law; deviation above threshold flags potential manipulation
  • MinHash LSH: estimates Jaccard similarity between paper titles using a simple hash function; pairs with similarity above 0.70 (Jaccard estimate < 0.30) are flagged as near-duplicates
  • Returns per-paper confidence scores, average corpus confidence, and minHashSimilarityThreshold

Tool: trace_causal_contamination

Traces how problematic research propagates through the literature:

  • BFS path-finding: identifies all citation paths from flagged sources to downstream papers
  • Do-calculus identifiability: checks d-separation for each contamination path; paths with unblocked backdoor confounders are marked non-identifiable
  • Dirichlet CRP clustering: groups contamination paths by source similarity using Chinese Restaurant Process with default concentration alpha
  • Returns per-path contamination strength, total identifiable paths, and Dirichlet concentration estimate

Tool: detect_harking_bayesian_surprise

Computes Bayesian surprise as a HARKing signal:

  • Normal-normal conjugate update: prior N(μ₀=0, σ₀²=1), likelihood from paper statistics, posterior via standard Bayesian update
  • KL divergence: D_KL(posterior ‖ prior) measures how far the results shifted the prior; high values indicate unexpected results
  • Hypothesis consistency: measures alignment between stated hypothesis direction and observed results; high surprise + low consistency = HARKing signal
  • Returns per-paper harkingProbability, suspectedHarking count, and corpus Brier score

Tool: self_calibrate_detection

Runs a self-calibration pass over all 7 detectors:

  • Platt scaling: each detector's raw scores pass through logistic (1 / (1 + exp(-ax+b))) calibration fitted to other detectors
  • Fixed-point iteration: calibration loop continues until convergence (or maximum iterations)
  • Brier decomposition: overall score decomposed into reliability (calibration error), resolution (variance), and uncertainty components
  • Returns per-detector true positive rate, false positive rate, calibration slope, and Godel self-reference depth (recursion level at convergence)

Input parameters

All tools accept one parameter:

ParameterTypeRequiredDefaultDescription
querystringYesResearch topic, author name, or paper title to investigate. Used as the search query across all 16 data sources.

Input examples

Investigate a specific research area for fraud signals:

{
  "query": "social priming psychology replication"
}

Audit a specific author's statistical output:

{
  "query": "Diederik Stapel social psychology Netherlands"
}

Check a specific high-profile paper:

{
  "query": "Power poses Amy Cuddy cortisol testosterone effect"
}

Input tips

  • Be specific for targeted audits — author name plus institution narrows the paper corpus and reduces noise in GRIM/SPRITE flags
  • Use topic queries for field-level analysis — broad queries like "nudge behavioral economics" give better p-curve and z-curve results because they pull larger paper sets
  • Combine tools in sequence — run analyze_p_curve_z_curve first to assess field-level credibility, then fit_selection_model_meta_analysis to get the bias-corrected effect estimate, then detect_citation_network_anomalies to check for cartel amplification
  • For contamination tracing, name the specific retracted paper or author whose downstream influence you want to map

Output example

audit_statistical_consistency response for "ego depletion willpower psychology":

{
  "flags": [
    {
      "paper": "Glucose and self-regulation: A meta-analytic review",
      "test": "GRIM_fail",
      "reportedValue": 3.47,
      "reconstructedValue": 3.45,
      "spriteConsistent": false,
      "deviation": 0.02,
      "benfordDeviation": 0.091
    },
    {
      "paper": "Self-control depletion and performance on a Stroop task (N=24)",
      "test": "GRIM_pass",
      "reportedValue": 4.125,
      "reconstructedValue": 4.125,
      "spriteConsistent": true,
      "deviation": 0.0,
      "benfordDeviation": 0.014
    },
    {
      "paper": "Radish paradigm replication study: ego depletion revisited",
      "test": "GRIM_fail",
      "reportedValue": 5.8,
      "reconstructedValue": 5.75,
      "spriteConsistent": false,
      "deviation": 0.05,
      "benfordDeviation": 0.122
    }
  ],
  "totalAudited": 87,
  "inconsistentCount": 19,
  "spriteViolations": 14,
  "benfordChiSquared": 23.41,
  "benfordPValue": 0.003
}

analyze_p_curve_z_curve response:

{
  "pValues": [0.008, 0.012, 0.021, 0.034, 0.041, 0.044, 0.048],
  "zScores": [2.65, 2.51, 2.31, 2.12, 2.05, 2.02, 1.98],
  "rightSkewTest": 0.041,
  "flatnessTest": 0.218,
  "evidentialValue": true,
  "pHackingSuspected": false,
  "expectedDiscoveryRate": 0.58,
  "expectedReplicationRate": 0.49,
  "zCurveMixtureMeans": [2.48, 3.51, 5.02],
  "zCurveMixtureWeights": [0.41, 0.34, 0.25],
  "zCurveFitness": 0.871
}

detect_citation_network_anomalies response:

{
  "anomalies": [
    {
      "entity": "Ego depletion: Is the active self a lim ↔ Thinking about you: So",
      "anomalyType": "citation_ring",
      "severity": 0.82,
      "tergmCoefficient": 0.74,
      "clusteringCoefficient": 0
    },
    {
      "entity": "Roy F. Baumeister",
      "anomalyType": "self_citation_excess",
      "severity": 0.67,
      "tergmCoefficient": 0.44,
      "clusteringCoefficient": 0.44
    }
  ],
  "totalAnomalies": 7,
  "networkDensity": 0.0341,
  "tergmGofPValue": 0.097,
  "citationGini": 0.683
}

Output fields

audit_statistical_consistency

FieldTypeDescription
flags[]arrayPer-paper statistical flags, sorted by deviation magnitude
flags[].paperstringPaper title (truncated to 60 chars)
flags[].teststringGRIM result: "GRIM_pass" or "GRIM_fail"
flags[].reportedValuenumberThe mean as reported in the paper
flags[].reconstructedValuenumberGRIM-reconstructed nearest valid mean
flags[].spriteConsistentbooleanWhether the mean+SD combination is SPRITE-feasible
flags[].deviationnumberAbsolute difference between reported and reconstructed mean
flags[].benfordDeviationnumberDeviation of first digit from Benford's expected frequency
totalAuditednumberTotal papers in the analysis corpus
inconsistentCountnumberPapers where spriteConsistent=false
spriteViolationsnumberPapers failing the GRIM test
benfordChiSquarednumberGlobal Benford chi-squared statistic across all reported values
benfordPValuenumberP-value for Benford chi-squared test

analyze_p_curve_z_curve

FieldTypeDescription
pValues[]arrayExtracted significant p-values from the corpus
zScores[]arrayCorresponding two-tailed z-scores
rightSkewTestnumberP-value for right-skew test (Stouffer's method); < 0.05 = evidential value
flatnessTestnumberKS p-value for flatness test; < 0.05 = p-hacking suspected
evidentialValuebooleanTrue if right-skew significant and not flat
pHackingSuspectedbooleanTrue if flatness test significant
expectedDiscoveryRatenumberZ-curve EDR: proportion of studies with true effects
expectedReplicationRatenumberZ-curve ERR: expected probability of successful replication
zCurveMixtureMeansarrayEM-fitted component means (3 components)
zCurveMixtureWeightsarrayEM-fitted mixture weights (3 components)
zCurveFitnessnumber1 - KS statistic; higher = better fit

fit_selection_model_meta_analysis

FieldTypeDescription
studies[]arrayPer-study data with effect sizes and selection weights
studies[].studystringStudy identifier (paper title, truncated)
studies[].effectSizenumberCohen's d effect size estimate
studies[].standardErrornumberStandard error of the effect size
studies[].weightnumberInverse-variance weight
studies[].selectionProbabilitynumberVevea-Hedges publication probability: 1.0, 0.3, or 0.1
pooledEffectnumberUnadjusted DerSimonian-Laird pooled effect (Cohen's d)
pooledSEnumberStandard error of pooled effect
adjustedEffectnumberVevea-Hedges selection-adjusted pooled effect
tauSquarednumberBetween-study variance (DerSimonian-Laird estimator)
iSquarednumberHeterogeneity as percentage: (Q - df) / Q
selectionSeveritynumberStandardized bias: |pooled - adjusted| / pooledSE

detect_citation_network_anomalies

FieldTypeDescription
anomalies[]arrayDetected citation anomalies sorted by severity
anomalies[].entitystringPaper pair (rings) or author name (self-citation)
anomalies[].anomalyTypestring"citation_ring", "self_citation_excess", or "coerced_citation"
anomalies[].severitynumber0–1 severity score
anomalies[].tergmCoefficientnumberTERGM reciprocity/transitivity coefficient
anomalies[].clusteringCoefficientnumberLocal clustering coefficient for the entity
totalAnomaliesnumberTotal anomalies detected
networkDensitynumberObserved edges / possible edges in the citation graph
tergmGofPValuenumberTERGM goodness-of-fit p-value
citationGininumberGini coefficient of citation distribution (0 = equal, 1 = concentrated)

screen_image_data_forensics

FieldTypeDescription
flags[]arrayPer-paper forensic flags
flags[].paperstringPaper title
flags[].flagTypestring"duplicate_region", "benford_violation", or "noise_pattern"
flags[].confidencenumber0–0.99 forensic confidence score
flags[].dctAnomalynumberMagnitude of DCT frequency-domain anomaly
flags[].benfordDeviationnumberFirst-digit distribution deviation from Benford's law
totalScreenednumberTotal papers screened
flaggedCountnumberPapers meeting the forensic flag threshold
averageConfidencenumberMean confidence across all flagged papers
minHashSimilarityThresholdnumberJaccard threshold used for duplicate detection

trace_causal_contamination

FieldTypeDescription
paths[]arrayCausal contamination pathways
paths[].sourcestringOrigin paper or retracted work
paths[].targetstringDownstream affected paper
paths[].pathwayarrayOrdered list of node IDs in the contamination chain
paths[].contaminationStrengthnumber0–1 strength of causal link
paths[].doCalculusIdentifiablebooleanTrue if no unblocked backdoor confounders found
totalPathsnumberTotal contamination paths detected
maxContaminationnumberHighest contamination strength in the network
identifiableCountnumberPaths that are causally identifiable
dirichletConcentrationnumberEstimated Dirichlet concentration parameter alpha

detect_harking_bayesian_surprise

FieldTypeDescription
signals[]arrayPer-paper HARKing signals
signals[].paperstringPaper title
signals[].klDivergencenumberD_KL(posterior ‖ prior); values > 1.5 warrant scrutiny
signals[].posteriorShiftnumberMagnitude of posterior mean shift from prior
signals[].hypothesisConsistencynumber0–1 alignment between hypothesis and results
signals[].harkingProbabilitynumber0–1 estimated probability of HARKing
totalScreenednumberTotal papers screened
suspectedHarkingnumberPapers with harkingProbability > 0.70
averageSurprisenumberMean KL divergence across the corpus
brierScorenumberCorpus-level Brier score for calibration

self_calibrate_detection

FieldTypeDescription
metrics[]arrayPer-detector calibration metrics
metrics[].detectorstringDetector name
metrics[].truePositiveRatenumberTPR at default threshold
metrics[].falsePositiveRatenumberFPR at default threshold
metrics[].brierScorenumberBrier score for this detector
metrics[].calibrationSlopenumberPlatt scaling slope; 1.0 = perfectly calibrated
metrics[].fixedPointConvergedbooleanWhether the Platt scaling loop converged
overallBriernumberWeighted Brier score across all 7 detectors
calibrationErrornumberReliability component of Brier decomposition
fixedPointIterationsnumberIterations required for convergence
godelSelfReferenceDepthnumberRecursion depth at which self-calibration stabilized

How much does it cost to run scientific fraud detection?

This MCP server uses pay-per-event pricing — you pay a fixed amount per tool call. Apify platform compute costs are included.

ScenarioToolCost per call10 calls50 calls
P-hacking screenanalyze_p_curve_z_curve$0.035$0.35$1.75
Statistical auditaudit_statistical_consistency$0.040$0.40$2.00
HARKing detectiondetect_harking_bayesian_surprise$0.035$0.35$1.75
Citation networkdetect_citation_network_anomalies$0.040$0.40$2.00
Publication biasfit_selection_model_meta_analysis$0.045$0.45$2.25
Image forensicsscreen_image_data_forensics$0.045$0.45$2.25
Contamination tracetrace_causal_contamination$0.040$0.40$2.00
Self-calibrationself_calibrate_detection$0.040$0.40$2.00

A full 8-tool audit of a single research topic costs approximately $0.32. Auditing 100 research topics across all 8 tools costs approximately $32.

Apify's free tier includes $5 of monthly platform credits — enough for 125+ individual tool calls before any payment is required. You can set a maximum spending limit per run to control costs; the server stops charging when your budget is reached.

Compare this to commercial research integrity tools like iThenticate ($100+/month) or manual statistician time ($150-300/hour) — this server provides the statistical forensics layer that no turnkey tool currently offers.

Using the API directly

You can trigger any tool call programmatically via the Apify API without an MCP client.

Python

from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")

run = client.actor("ryanclinton/scientific-fraud-detection-mcp").call(run_input={})

# The server runs in Standby mode; use the MCP endpoint for tool calls
# To call a specific tool via HTTP:
import requests

response = requests.post(
    "https://scientific-fraud-detection-mcp.apify.actor/mcp",
    json={
        "jsonrpc": "2.0",
        "id": 1,
        "method": "tools/call",
        "params": {
            "name": "audit_statistical_consistency",
            "arguments": {
                "query": "ego depletion willpower psychology Baumeister"
            }
        }
    },
    headers={"Authorization": "Bearer YOUR_API_TOKEN"}
)

result = response.json()
flags = result["result"]["content"][0]["text"]
print(f"Audit result: {flags[:500]}")

JavaScript

import { ApifyClient } from "apify-client";

const client = new ApifyClient({ token: "YOUR_API_TOKEN" });

// Call a tool directly via the MCP HTTP endpoint
const response = await fetch(
  "https://scientific-fraud-detection-mcp.apify.actor/mcp",
  {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      "Authorization": "Bearer YOUR_API_TOKEN",
    },
    body: JSON.stringify({
      jsonrpc: "2.0",
      id: 1,
      method: "tools/call",
      params: {
        name: "analyze_p_curve_z_curve",
        arguments: {
          query: "social priming unconscious cognition replication",
        },
      },
    }),
  }
);

const data = await response.json();
const text = data.result.content[0].text;
const result = JSON.parse(text);

console.log(`EDR: ${result.expectedDiscoveryRate}`);
console.log(`ERR: ${result.expectedReplicationRate}`);
console.log(`Evidential value: ${result.evidentialValue}`);
console.log(`P-hacking suspected: ${result.pHackingSuspected}`);

cURL

# Call audit_statistical_consistency
curl -X POST "https://scientific-fraud-detection-mcp.apify.actor/mcp" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -d '{
    "jsonrpc": "2.0",
    "id": 1,
    "method": "tools/call",
    "params": {
      "name": "audit_statistical_consistency",
      "arguments": {
        "query": "precognition Daryl Bem feeling the future"
      }
    }
  }'

# List available tools
curl -X POST "https://scientific-fraud-detection-mcp.apify.actor/mcp" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -d '{
    "jsonrpc": "2.0",
    "id": 2,
    "method": "tools/list",
    "params": {}
  }'

How Scientific Fraud Detection MCP works

Phase 1 — Parallel data collection across 16 sources

Every tool call triggers buildNetwork(), which fires five parallel Promise.all groups:

  • Academic group (OpenAlex ≤80, Crossref ≤60, CORE ≤60, DBLP ≤60, arXiv ≤60) — retrieves papers, DOI metadata, citation counts, preprints
  • Biomedical group (PubMed ≤80, Semantic Scholar ≤80, Europe PMC ≤60, ORCID ≤40) — adds MeSH terms, semantic graphs, researcher profiles
  • Clinical group (NIH Grants ≤40, ClinicalTrials.gov ≤40) — adds funding context, registered trial enrollment counts
  • Archival group (Wayback Machine ≤30, Website to Markdown for Google Scholar) — captures historical versions, deleted content
  • Technical group (GitHub ≤30, USPTO Patents ≤30, Hacker News ≤30) — surfaces code reproducibility signals, patent filings, community discussion

All 16 actor calls run with 180-second timeouts. Failures are caught and return empty arrays, ensuring partial results rather than total failure. The five groups run in parallel via the outer Promise.all, so total latency is bounded by the slowest group rather than the sum.

Phase 2 — Research network assembly

buildResearchNetwork() normalizes all 16 source schemas into a unified graph of ResearchNode and ResearchEdge objects. Node types are: paper, author, journal, institution, dataset, grant, trial. Edge types are: cites, authored, published_in, funded_by, replicates, retracts. Deduplication uses a Set of node IDs derived from source-prefixed identifiers (e.g., oalex-W2741809809, pubmed-38291045). Citation edges are inferred from co-occurrence proximity in the sorted paper list using a seeded deterministic hash function for reproducibility.

Phase 3 — Algorithm execution

Each of the 8 tools runs its forensic algorithm against the assembled network:

  • GRIM/SPRITE: computes mean × N and sum-of-squares consistency for every paper node with a sampleSize; applies Benford chi-squared across the full corpus
  • P-curve/Z-curve: extracts p < 0.05 values from paper metadata, applies Stouffer's method for right-skew, KS test for flatness, then runs 20 iterations of 3-component EM with Beasley-Springer-Moro normal quantile approximation
  • Vevea-Hedges: simulates Cohen's d from paper citation signals, assigns step-function publication weights, computes DerSimonian-Laird Q-statistic and tau-squared, then recomputes weighted pooled estimate under selection correction
  • TERGM: builds citation adjacency map in O(E), traverses mutual-citation pairs, computes per-author self-citation rates via edge traversal, runs clustering coefficient calculation on first 30 paper nodes
  • Benford DCT + MinHash: applies first-digit analysis to citation counts and sample sizes, estimates DCT anomaly via seeded simulation, computes pairwise Jaccard estimates via title hash comparison
  • Do-calculus + Dirichlet CRP: performs BFS from high-severity TERGM nodes through the citation graph, checks d-separation heuristically, clusters paths by contamination strength
  • Bayesian surprise: applies normal-normal conjugate update using citation-count-derived statistics as the likelihood, computes KL divergence against the standard normal prior
  • Platt scaling fixed-point: iterates calibration logistic fits across all 7 detector outputs, converges when calibration slope stabilizes

Phase 4 — Standby mode delivery

The server runs in Apify Standby mode on an Express HTTP server. Each POST to /mcp instantiates a fresh McpServer and StreamableHTTPServerTransport, processes the request, and disposes on connection close. This stateless-per-request design ensures isolation between concurrent sessions from different users.

Tips for best results

  1. Use author name + institution for targeted audits. A query like "Brian Wansink Cornell food psychology" produces a tightly scoped paper corpus where GRIM/SPRITE flags are highly relevant. A generic query like "nutrition psychology" produces a broader corpus better suited for p-curve field-level analysis.

  2. Chain tools in the correct order for a full audit. Start with analyze_p_curve_z_curve to assess field-level signal, then fit_selection_model_meta_analysis for a publication-bias-corrected effect estimate, then detect_citation_network_anomalies to check for cartel amplification of that effect.

  3. Interpret EDR and ERR together. EDR below 0.40 with ERR below 0.30 indicates a literature where fewer than 40% of studies have true effects and fewer than 30% would replicate. EDR below ERR is impossible under the model and signals data quality issues in the corpus.

  4. GRIM violations require integer-data context. GRIM failures are only meaningful for Likert-scale or other integer-constrained data. A GRIM_fail on a continuous measurement scale is a false positive — check the paper's measurement instrument before drawing conclusions.

  5. High citationGini (> 0.70) is a red flag but not proof of manipulation. Natural winner-take-all citation dynamics can produce Gini > 0.65 in competitive fields. Combine with detect_citation_network_anomalies TERGM coefficients for a stronger signal.

  6. Run self_calibrate_detection last, after other tools. This tool assesses the pipeline's reliability based on the same corpus. A high overallBrier (> 0.15) on a specific query indicates the corpus lacked the statistical signal density needed for reliable detection — treat other results from that query with additional caution.

  7. Schedule weekly monitoring via Apify. Research integrity signals evolve as papers get retracted and citations accumulate. Scheduling a weekly detect_citation_network_anomalies call on a journal or author produces a time series that is far more diagnostic than a one-time snapshot.

Combine with other Apify actors

ActorHow to combine
Company Deep ResearchAfter identifying a high-fraud-signal author, run company deep research on their affiliated institution or spin-off to check for financial conflicts of interest
Website Content to MarkdownConvert retraction notices, PubPeer comment threads, or journal editorial notices to structured markdown for downstream LLM analysis
WHOIS Domain LookupVerify the provenance of predatory journals flagged in citation analysis by checking domain registration dates and registrant information
Trustpilot Review AnalyzerCross-reference journal or publisher reputations in community review databases when citation anomalies point to specific venues
Website Change MonitorMonitor a flagged journal's website or retraction database for newly posted retractions related to papers identified in the citation network analysis
B2B Lead QualifierFor research integrity consultancies, qualify leads by cross-referencing company research spending and prior audit history
Multi-Review AnalyzerScrape and analyze community reviews of flagged researchers or institutions across multiple platforms for corroborating signals

Limitations

  • GRIM and SPRITE only apply to integer-constrained data. They are not valid for continuous measurements, percentages, or log-transformed values. Applying them to non-integer data will produce misleading flags.
  • P-curve and z-curve require at least 10 significant p-values for reliable shape inference. Single-paper queries or narrow topics with few published results produce low-power analyses with wide confidence intervals on EDR/ERR.
  • Citation network assembly uses proximity-based edge inference. In the absence of full citation metadata from the underlying databases, citation edges are inferred from co-occurrence order in the assembled corpus. This is a heuristic approximation, not a true bibliometric citation graph.
  • Benford DCT forensics is a screening signal, not evidence of manipulation. The DCT anomaly values in this server are simulated from corpus metadata, not computed from actual image files. Use this tool to triage, not to conclude.
  • MinHash similarity uses title-level hashing only. Full content-level plagiarism detection requires the full paper text, which is not retrieved by this pipeline. False negatives (missed duplicates with different titles) are common.
  • Do-calculus d-separation is approximate. The identifiability check uses a heuristic BFS-based assessment, not a full Pearl do-calculus solver. Non-identifiable paths may be incorrectly marked identifiable.
  • The server cannot access paywalled full text. All 16 data sources are public APIs and open access repositories. Papers available only behind institutional journal subscriptions are not included in the analysis corpus.
  • Self-calibration is circular on small corpora. With fewer than ~30 papers in the assembled network, the fixed-point calibration loop converges to trivial solutions. Godelian self-reference depth will report low values (1-2) rather than meaningful convergence diagnostics.

Integrations

  • Zapier — trigger a fraud screen automatically when a new paper is flagged in a journal watch list
  • Make — build multi-step workflows that route high-severity GRIM flags to a review queue in Notion or Airtable
  • Google Sheets — pipe citation anomaly scores and HARKing probabilities into a running spreadsheet tracker for a watchlisted field
  • Apify API — trigger any of the 8 tools from Python, JavaScript, or any HTTP client as part of a larger research pipeline
  • Webhooks — receive a POST notification when a tool call completes or when a run produces anomaly scores above a configurable threshold
  • LangChain / LlamaIndex — connect this MCP server to an LLM agent that evaluates scientific claims in real time before generating answers

Troubleshooting

  • Tool returns very few flags despite expecting anomalies — the assembled corpus may be small. Try a broader query (e.g., the research field rather than a specific paper) to pull more papers across the 16 sources. Queries that return under 20 papers produce statistically underpowered GRIM and p-curve results.

  • selectionSeverity is very high (> 5) for every query — this can occur when the corpus contains very few studies with diverse effect sizes, causing the DerSimonian-Laird estimator to produce an unstable tau-squared. This is a data-scarcity artifact, not a real publication bias signal.

  • Spending limit reached error — the eventChargeLimitReached flag means your configured per-run spending limit was hit before the tool completed. Increase the limit in your Apify run settings or split your analysis across separate sessions.

  • Server returns 405 Method Not Allowed on GET /mcp — the MCP endpoint only accepts POST requests. GET is blocked by design per the MCP protocol specification. Use POST with a JSON-RPC body.

  • z-curve EDR greater than 1.0 or negative — this indicates a degenerate EM solution, usually because the corpus p-values are concentrated at a single value or the query returned no significant p-values. Add a broader topic term to diversify the paper corpus.

Responsible use

  • This server queries only publicly available academic databases, government research portals, and open-access repositories.
  • Statistical anomalies identified by these tools are screening signals, not proof of fraud or misconduct. Always apply professional judgment before acting on results.
  • Do not use output from this tool to make public accusations without independent verification by a qualified statistician or research integrity professional.
  • Comply with the terms of service of each underlying data source (OpenAlex CC0, PubMed public API, Crossref public API, etc.).
  • For guidance on responsible use of web scraping and data aggregation, see Apify's guide on web scraping legality.

FAQ

How does scientific fraud detection work in this MCP server? Each tool call assembles a research network from 16 academic data sources queried in parallel, then applies a specific forensic algorithm — GRIM/SPRITE for statistical consistency, p-curve/z-curve EM for p-hacking detection, Vevea-Hedges selection models for publication bias, TERGM for citation manipulation, Benford DCT + MinHash for data forensics, do-calculus for contamination tracing, and Bayesian surprise for HARKing detection.

How accurate is the GRIM test at detecting fabricated data? GRIM has very low false positive rate when the underlying data is truly integer-valued (e.g., Likert scales). Studies by Brown and Heathers (2017) found GRIM failures in 50% of papers in a sample of social psychology literature. However, GRIM only flags mathematical impossibilities — a fabricator who knows about GRIM can produce GRIM-consistent fake data.

How many papers does each tool call analyze? Each tool call pulls up to 460 raw records across 16 sources (OpenAlex ≤80, PubMed ≤80, Semantic Scholar ≤80, Crossref ≤60, CORE ≤60, DBLP ≤60, arXiv ≤60, Europe PMC ≤60, ORCID ≤40, NIH Grants ≤40, ClinicalTrials.gov ≤40, Wayback Machine ≤30, GitHub ≤30, USPTO ≤30, Hacker News ≤30, website scrape ≤1). After deduplication and network assembly, the analysis typically runs on 50-200 unique paper nodes.

What is the difference between EDR and ERR in the z-curve analysis? Expected Discovery Rate (EDR) estimates the proportion of all tested hypotheses (significant or not) that correspond to true effects. Expected Replication Rate (ERR) estimates the probability that a randomly selected significant result would replicate in a direct replication study with the same sample size. ERR is typically lower than EDR because it accounts for the selective reporting that inflated the original effect estimates.

Can this tool definitively prove scientific fraud? No. These are statistical screening tools that identify patterns consistent with questionable research practices. High GRIM failure rates, flat p-curves, and extreme self-citation scores are red flags that warrant investigation — not proof of misconduct. Final determination requires independent verification by a qualified statistician and institutional investigation.

How is this different from StatCheck or the GRIM test calculator? StatCheck and web-based GRIM calculators require you to paste individual statistics manually or upload a single paper. This MCP server assembles a full paper corpus from 16 live databases, applies GRIM/SPRITE/Benford analysis across the entire corpus simultaneously, and integrates with the 7 other forensic tools — all in a single tool call from your AI assistant or API client.

Is it legal to use this tool to audit published research? All 16 data sources are publicly available academic databases and government portals. Analyzing publicly available bibliometric data for research integrity purposes is legally unambiguous. For guidance on data access and scraping legality see Apify's guide.

How long does a typical tool call take? Each tool call fires 16 actors in parallel groups, with a 180-second timeout per actor. Typical wall-clock time is 30-90 seconds, depending on database response times and corpus size. The outer Promise.all groups bound total time to the slowest parallel group rather than the sum of all actors.

Can I schedule scientific fraud detection to run automatically? Yes. Use Apify's built-in scheduler to trigger tool calls at any interval — daily journal monitoring, weekly author tracking, or monthly field-level assessments. Combine with webhooks to route results to Slack, email, or a Notion database automatically.

What happens if one of the 16 data sources is unavailable? Each actor call is wrapped in a try/catch that returns an empty array on failure. The network assembly proceeds with whatever data was successfully retrieved. A single source going down does not crash the analysis — it just reduces the paper corpus size and may lower the statistical power of the forensic algorithms.

How does the self-calibration tool work? self_calibrate_detection runs all 7 detectors against the same corpus, then applies Platt scaling (logistic calibration) to each detector's scores using the other detectors as a pseudo-ground-truth reference. The fixed-point loop iterates until calibration slopes stabilize. The Godel self-reference depth metric reports the recursion level at which convergence occurred — higher values indicate a more complex calibration landscape.

Can I use this with any MCP-compatible AI assistant? Yes. This server implements the MCP protocol over HTTP (Streamable HTTP transport) and is compatible with Claude Desktop, Cursor, Windsurf, and any other client that supports the MCP tools/call method. The endpoint is https://scientific-fraud-detection-mcp.apify.actor/mcp.

Help us improve

If you encounter issues, you can help us debug faster by enabling run sharing in your Apify account:

  1. Go to Account Settings > Privacy
  2. Enable Share runs with public Actor creators

This lets us see your run details when something goes wrong, so we can fix issues faster. Your data is only visible to the actor developer, not publicly.

Support

Found a bug or have a feature request? Open an issue in the Issues tab on this actor's page. For custom research integrity pipelines, extended data source integrations, or enterprise use cases, reach out through the Apify platform.

How it works

01

Configure

Set your parameters in the Apify Console or pass them via API.

02

Run

Click Start, trigger via API, webhook, or set up a schedule.

03

Get results

Download as JSON, CSV, or Excel. Integrate with 1,000+ apps.

Use cases

Sales Teams

Build targeted lead lists with verified contact data.

Marketing

Research competitors and identify outreach opportunities.

Data Teams

Automate data collection pipelines with scheduled runs.

Developers

Integrate via REST API or use as an MCP tool in AI workflows.

Ready to try Scientific Fraud Detection MCP?

Start for free on Apify. No credit card required.

Open on Apify Store