Scientific Fraud Detection MCP
Scientific fraud detection as a live MCP server — screen any research topic, author, or paper for statistical fabrication, p-hacking, publication bias, citation manipulation, data duplication, causal contamination, and HARKing using 8 forensic tools backed by 16 real-time academic sources. Built for research integrity officers, meta-analysts, AI coding assistants, and anyone who needs to audit the scientific literature without writing a single line of code.
Maintenance Pulse
90/100Cost Estimate
How many results do you need?
Pricing
Pay Per Event model. You only pay for what you use.
| Event | Description | Price |
|---|---|---|
| audit-statistical-consistency | GRIM/SPRITE statistical forensics | $0.06 |
| analyze-p-curve-z-curve | Simonsohn p-curve and z-curve EM analysis | $0.06 |
| fit-selection-model-meta-analysis | Vevea-Hedges weight function meta-analysis | $0.08 |
| detect-citation-network-anomalies | TERGM temporal network analysis | $0.08 |
| screen-image-data-forensics | Error level analysis and Benford DCT | $0.06 |
| trace-causal-contamination | Do-calculus d-separation on claim DAG | $0.08 |
| detect-harking-bayesian-surprise | Bayesian surprise KL divergence | $0.06 |
| self-calibrate-detection | Brier score decomposition self-audit | $0.08 |
Example: 100 events = $6.00 · 1,000 events = $60.00
Connect to your AI agent
Add this MCP server to Claude Desktop, Cursor, Windsurf, or any MCP-compatible client.
https://ryanclinton--scientific-fraud-detection-mcp.apify.actor/mcp{
"mcpServers": {
"scientific-fraud-detection-mcp": {
"url": "https://ryanclinton--scientific-fraud-detection-mcp.apify.actor/mcp"
}
}
}Documentation
Scientific fraud detection as a live MCP server — screen any research topic, author, or paper for statistical fabrication, p-hacking, publication bias, citation manipulation, data duplication, causal contamination, and HARKing using 8 forensic tools backed by 16 real-time academic sources. Built for research integrity officers, meta-analysts, AI coding assistants, and anyone who needs to audit the scientific literature without writing a single line of code.
Each tool call orchestrates parallel queries across OpenAlex, PubMed, Semantic Scholar, arXiv, Crossref, CORE, Europe PMC, ORCID, DBLP, NIH Grants, ClinicalTrials.gov, Wayback Machine, GitHub, USPTO Patents, and Hacker News — then applies 11 forensic algorithms from GRIM/SPRITE statistical auditing through Vevea-Hedges selection models, TERGM citation anomaly detection, Benford DCT forensics, do-calculus contamination tracing, and Bayesian surprise HARKing detection. All structured output is returned as JSON, ready to pipe directly into downstream analysis pipelines.
What data can you extract?
| Data Point | Source | Example |
|---|---|---|
| 📄 Academic papers, citation counts, concepts | OpenAlex + Crossref + CORE | "Effect of semaglutide on HbA1c: N=312, cited 847x" |
| 🧬 Biomedical literature with MeSH terms | PubMed + Europe PMC | PMID 38291045, mesh: ["Diabetes Mellitus, Type 2"] |
| 🔬 Research papers with semantic graphs | Semantic Scholar | paperId: "abc123", influentialCitationCount: 44 |
| 📐 Reported statistics flagged by GRIM/SPRITE | All paper sources | mean=3.47, N=20, GRIM_fail=true, deviation=0.03 |
| 📊 P-value distributions and z-scores | Extracted from paper corpus | rightSkewP=0.003, EDR=0.62, ERR=0.54 |
| ⚖️ Selection-adjusted pooled effect sizes | Meta-analysis synthesis | pooledEffect=0.42, adjustedEffect=0.28, I²=67% |
| 🔗 Citation ring and self-citation anomalies | Citation graph analysis | citationGini=0.71, anomalyType="citation_ring" |
| 🖼️ Benford DCT and MinHash forensic flags | Forensic screening | dctAnomaly=0.24, benfordDeviation=0.18, confidence=0.87 |
| 🗺️ Causal contamination paths from retracted papers | BFS + do-calculus | pathway: ["retraction-A","citing-B","citing-C"], strength=0.73 |
| 🤔 HARKing probability via Bayesian surprise | KL divergence scoring | klDivergence=2.14, harkingProbability=0.81 |
| 🎯 Detector calibration metrics | Brier score decomposition | brierScore=0.09, calibrationSlope=0.94, converged=true |
| 🧪 Preprints and early-stage research | arXiv | arXiv:2401.12345, submittedDate: "2024-01-22" |
Why use Scientific Fraud Detection MCP?
Manually auditing a research literature for questionable practices takes days. A single meta-analyst checking p-value distributions across 200 papers, tracing retraction contamination through citation chains, and auditing statistical consistency by hand is a weeks-long project — prone to both omission errors and cognitive fatigue.
This MCP server automates the entire pipeline. One tool call, one query string, and within minutes you receive GRIM consistency flags, p-curve shape analysis, selection-adjusted effect sizes, TERGM citation anomaly scores, forensic manipulation flags, causal contamination maps, HARKing signals, and calibrated confidence metrics — all from a live, cross-database synthesis of up to 16 academic sources queried in parallel.
- Scheduling — run weekly integrity monitors on key journals or authors; Apify handles cron scheduling
- API access — trigger any of the 8 tools from Python, JavaScript, or any HTTP client without a GUI
- Monitoring — receive Slack or email alerts when a tool call fails or returns anomalous results
- Integrations — connect results to Zapier, Make, Google Sheets, or HubSpot via Apify's native connectors
- MCP protocol — works natively with Claude, Cursor, Windsurf, and any MCP-compatible AI assistant
Features
- GRIM test (Granularity-Related Inconsistency of Means) — checks whether reported means are mathematically possible given integer raw data and sample size; flags impossible values where mean × N is not an integer
- SPRITE (Sample Parameter Reconstruction via Iterative Techniques) — constrained integer programming that reconstructs feasible integer distributions matching reported mean, SD, min, max, and N; flags impossible parameter combinations
- Benford's law first-digit analysis — applies chi-squared goodness-of-fit against the Benford distribution on reported numerical values; detects digit-frequency anomalies that signal fabricated data
- P-curve right-skew test — implements Simonsohn-Nelson-Simmons (2014) Stouffer's method on p-values conditional on significance; right-skewed = evidential value, flat = p-hacking
- Z-curve EM algorithm — fits a finite mixture of truncated normal distributions using Expectation-Maximization with 3 components and 20 iterations; returns Expected Discovery Rate (EDR) and Expected Replication Rate (ERR)
- Kolmogorov-Smirnov flatness test — tests whether p-curve shape is consistent with uniform distribution (p-hacking) against right-skewed alternative
- Vevea-Hedges weight-function selection model — step-function w(p) at thresholds p=0.025, 0.05, 0.10 with DerSimonian-Laird random-effects, tau-squared heterogeneity, and I-squared; returns unadjusted and selection-adjusted pooled effect sizes
- TERGM (Temporal Exponential Random Graph Model) — models citation network evolution P(G_t | G_{t-1}) ~ exp(θ × s(G_t, G_{t-1})); identifies citation rings via mutual-citation detection, self-citation excess via authored-paper graph traversal, and coerced citations via clustering coefficient thresholds
- Citation Gini coefficient — measures inequality in citation distribution across the paper corpus; high Gini (>0.70) indicates citation concentration consistent with cartel behavior
- Benford DCT forensics — first-digit analysis on DCT frequency coefficients; natural images follow Benford's law, manipulated ones deviate; detects image duplication and data fabrication signals
- MinHash LSH (Locality-Sensitive Hashing) — k-shingle Jaccard similarity estimation for detecting near-duplicate papers and text reuse across publications in the corpus
- Do-calculus d-separation — BFS path-finding on citation DAG with identifiability checks for unblocked backdoor paths via confounders; traces causal contamination from flagged or retracted papers to downstream citations
- Dirichlet process CRP clustering — Chinese Restaurant Process clustering of contamination sources with concentration parameter alpha; groups related contamination chains
- Bayesian surprise HARKing detection — D_KL(posterior ‖ prior) using normal-normal conjugate update; high KL divergence with low hypothesis consistency signals post-hoc hypothesis fabrication
- Brier score decomposition — decomposes calibration into reliability + resolution + uncertainty components; used in self-calibration fixed-point loop with Platt scaling (logistic)
- Parallel 16-actor orchestration — all 16 data source actors run in parallel groups via
Promise.all; each group fires 3-5 actors simultaneously, reducing total wall-clock time vs sequential fetching
Use cases for scientific fraud detection
Research integrity assessment
Integrity officers at universities, funding agencies, and journals need to screen papers before or after publication. This MCP provides a full forensic report — statistical consistency, p-value distribution shape, selection-adjusted effect sizes, and citation network anomalies — in a single session. A manual equivalent would take a trained statistician two to three days per paper cluster.
Replication crisis meta-analysis
Researchers studying replicability across psychology, medicine, or economics can feed an entire subfield into analyze_p_curve_z_curve to estimate the Expected Replication Rate and detect systematic p-hacking. The z-curve EM algorithm returns the full mixture model — component means and weights — so analysts can segment high-credibility from low-credibility literature programmatically.
AI assistant research grounding
AI coding assistants, Claude, and other LLM applications use MCP tools to retrieve and process real-world data. Connecting this server to an AI assistant allows it to verify scientific claims in real time — asking "is the evidence for this treatment robust?" triggers a full p-curve and selection model analysis before the assistant answers.
Systematic review and meta-analysis support
Meta-analysts running Cochrane-style reviews can use fit_selection_model_meta_analysis to estimate publication-bias-corrected pooled effects, detect_citation_network_anomalies to flag citation cartels that may have inflated a literature, and trace_causal_contamination to identify which papers in a review body are downstream of retracted or problematic sources.
Citation manipulation investigation
Journal editors and retraction watch investigators can query an author name or journal to detect citation rings, self-citation excess beyond 30%, and coordinated coerced citation patterns using TERGM coefficients and clustering analysis. The Gini coefficient provides a single inequality metric that can be compared across journals.
Competitive intelligence for science policy
Policy analysts and science funders can benchmark research fields by feeding topic queries into self_calibrate_detection to receive cross-detector Brier scores. Fields with poor calibration (high Brier scores, low EDR) warrant additional scrutiny before funding allocation or policy decisions.
How to connect this MCP server
Connecting takes under two minutes. No API keys, no environment setup — just add the server URL to your MCP client configuration.
- Copy the MCP endpoint URL —
https://scientific-fraud-detection-mcp.apify.actor/mcp - Open your MCP client config — Claude Desktop (
claude_desktop_config.json), Cursor MCP settings, or Windsurf - Paste the server block — use the JSON snippet for your client below
- Start a session — ask your AI assistant to run any of the 8 tools; results stream back as structured JSON
Claude Desktop
Add to ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):
{
"mcpServers": {
"scientific-fraud-detection": {
"url": "https://scientific-fraud-detection-mcp.apify.actor/mcp"
}
}
}
Cursor
Add to your Cursor MCP settings panel or .cursor/mcp.json:
{
"mcpServers": {
"scientific-fraud-detection": {
"url": "https://scientific-fraud-detection-mcp.apify.actor/mcp"
}
}
}
Windsurf
Add to ~/.codeium/windsurf/mcp_config.json:
{
"mcpServers": {
"scientific-fraud-detection": {
"url": "https://scientific-fraud-detection-mcp.apify.actor/mcp"
}
}
}
MCP tools reference
This server exposes 8 tools. All tools accept a single query string parameter (research topic, author name, or paper title). Each call queries up to 16 actors in parallel before running the analysis.
| Tool | Price | Algorithm | Best for |
|---|---|---|---|
audit_statistical_consistency | $0.040 | GRIM + SPRITE + Benford chi-squared | Detecting impossible reported statistics |
analyze_p_curve_z_curve | $0.035 | Simonsohn p-curve + z-curve EM (truncated normals) | P-hacking detection and replicability estimation |
fit_selection_model_meta_analysis | $0.045 | Vevea-Hedges w(p) + DerSimonian-Laird | Publication-bias-corrected meta-analysis |
detect_citation_network_anomalies | $0.040 | TERGM + Gini coefficient + clustering | Citation rings and self-citation excess |
screen_image_data_forensics | $0.045 | Benford DCT + MinHash LSH | Image manipulation and text duplication |
trace_causal_contamination | $0.040 | BFS + do-calculus d-separation + Dirichlet CRP | Retraction contamination propagation |
detect_harking_bayesian_surprise | $0.035 | D_KL(posterior ‖ prior) + Brier calibration | Post-hoc hypothesis detection |
self_calibrate_detection | $0.040 | Platt scaling fixed-point + Brier decomposition | Pipeline reliability and meta-assessment |
Tool: audit_statistical_consistency
Audits reported means and standard deviations for mathematical feasibility. For each paper in the assembled network:
- GRIM check: verifies mean × N is an integer (required for integer-valued Likert-scale data)
- SPRITE check: verifies the sum-of-squares decomposition (n-1)·SD² + n·mean² is compatible with integer raw data
- Benford analysis: computes first-digit distribution chi-squared against Benford's expected log10(1 + 1/d) frequencies
- Returns flags sorted by deviation magnitude, with global chi-squared statistic across the full corpus
Tool: analyze_p_curve_z_curve
Analyses the shape of the p-value distribution across significant results:
- P-curve: applies Stouffer's method on conditional p-values (p/0.05); right-skew p-value below 0.05 indicates evidential value
- KS flatness test: tests uniformity of the conditional distribution; flat curve (p < 0.05) signals p-hacking
- Z-curve EM: fits K=3-component truncated normal mixture over 20 EM iterations; returns Expected Discovery Rate and Expected Replication Rate
- Returns component means, mixture weights, and KS fitness statistic
Tool: fit_selection_model_meta_analysis
Fits a Vevea-Hedges weight-function selection model:
- Step weights: w(p) = 1.0 for p ≤ 0.05, 0.3 for 0.05 < p ≤ 0.10, 0.1 for p > 0.10
- DerSimonian-Laird: random-effects pooling with Q-statistic, tau-squared between-study variance, and I-squared heterogeneity
- Adjusted effect: selection-corrected pooled estimate using publication-probability-weighted inverse-variance
- Returns unadjusted pooled effect, selection-adjusted effect, and selectionSeverity (standardized difference)
Tool: detect_citation_network_anomalies
Models the citation network as a temporal ERGM:
- Citation rings: detects mutual citation pairs (A cites B AND B cites A) with TERGM coefficient scoring
- Self-citation excess: flags authors whose self-citation rate exceeds 30% of total citations
- Clustering anomalies: computes local clustering coefficients; values above 0.5 in a neighborhood of 3+ papers flag coordinated citing
- Gini coefficient: measures citation inequality using the standard rank-weighted formula; approaches 1 for extreme concentration
Tool: screen_image_data_forensics
Screens the paper corpus for forensic manipulation signals:
- Benford DCT: compares first-digit frequency of simulated DCT coefficients against Benford's law; deviation above threshold flags potential manipulation
- MinHash LSH: estimates Jaccard similarity between paper titles using a simple hash function; pairs with similarity above 0.70 (Jaccard estimate < 0.30) are flagged as near-duplicates
- Returns per-paper confidence scores, average corpus confidence, and minHashSimilarityThreshold
Tool: trace_causal_contamination
Traces how problematic research propagates through the literature:
- BFS path-finding: identifies all citation paths from flagged sources to downstream papers
- Do-calculus identifiability: checks d-separation for each contamination path; paths with unblocked backdoor confounders are marked non-identifiable
- Dirichlet CRP clustering: groups contamination paths by source similarity using Chinese Restaurant Process with default concentration alpha
- Returns per-path contamination strength, total identifiable paths, and Dirichlet concentration estimate
Tool: detect_harking_bayesian_surprise
Computes Bayesian surprise as a HARKing signal:
- Normal-normal conjugate update: prior N(μ₀=0, σ₀²=1), likelihood from paper statistics, posterior via standard Bayesian update
- KL divergence: D_KL(posterior ‖ prior) measures how far the results shifted the prior; high values indicate unexpected results
- Hypothesis consistency: measures alignment between stated hypothesis direction and observed results; high surprise + low consistency = HARKing signal
- Returns per-paper harkingProbability, suspectedHarking count, and corpus Brier score
Tool: self_calibrate_detection
Runs a self-calibration pass over all 7 detectors:
- Platt scaling: each detector's raw scores pass through logistic (1 / (1 + exp(-ax+b))) calibration fitted to other detectors
- Fixed-point iteration: calibration loop continues until convergence (or maximum iterations)
- Brier decomposition: overall score decomposed into reliability (calibration error), resolution (variance), and uncertainty components
- Returns per-detector true positive rate, false positive rate, calibration slope, and Godel self-reference depth (recursion level at convergence)
Input parameters
All tools accept one parameter:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
query | string | Yes | — | Research topic, author name, or paper title to investigate. Used as the search query across all 16 data sources. |
Input examples
Investigate a specific research area for fraud signals:
{
"query": "social priming psychology replication"
}
Audit a specific author's statistical output:
{
"query": "Diederik Stapel social psychology Netherlands"
}
Check a specific high-profile paper:
{
"query": "Power poses Amy Cuddy cortisol testosterone effect"
}
Input tips
- Be specific for targeted audits — author name plus institution narrows the paper corpus and reduces noise in GRIM/SPRITE flags
- Use topic queries for field-level analysis — broad queries like "nudge behavioral economics" give better p-curve and z-curve results because they pull larger paper sets
- Combine tools in sequence — run
analyze_p_curve_z_curvefirst to assess field-level credibility, thenfit_selection_model_meta_analysisto get the bias-corrected effect estimate, thendetect_citation_network_anomaliesto check for cartel amplification - For contamination tracing, name the specific retracted paper or author whose downstream influence you want to map
Output example
audit_statistical_consistency response for "ego depletion willpower psychology":
{
"flags": [
{
"paper": "Glucose and self-regulation: A meta-analytic review",
"test": "GRIM_fail",
"reportedValue": 3.47,
"reconstructedValue": 3.45,
"spriteConsistent": false,
"deviation": 0.02,
"benfordDeviation": 0.091
},
{
"paper": "Self-control depletion and performance on a Stroop task (N=24)",
"test": "GRIM_pass",
"reportedValue": 4.125,
"reconstructedValue": 4.125,
"spriteConsistent": true,
"deviation": 0.0,
"benfordDeviation": 0.014
},
{
"paper": "Radish paradigm replication study: ego depletion revisited",
"test": "GRIM_fail",
"reportedValue": 5.8,
"reconstructedValue": 5.75,
"spriteConsistent": false,
"deviation": 0.05,
"benfordDeviation": 0.122
}
],
"totalAudited": 87,
"inconsistentCount": 19,
"spriteViolations": 14,
"benfordChiSquared": 23.41,
"benfordPValue": 0.003
}
analyze_p_curve_z_curve response:
{
"pValues": [0.008, 0.012, 0.021, 0.034, 0.041, 0.044, 0.048],
"zScores": [2.65, 2.51, 2.31, 2.12, 2.05, 2.02, 1.98],
"rightSkewTest": 0.041,
"flatnessTest": 0.218,
"evidentialValue": true,
"pHackingSuspected": false,
"expectedDiscoveryRate": 0.58,
"expectedReplicationRate": 0.49,
"zCurveMixtureMeans": [2.48, 3.51, 5.02],
"zCurveMixtureWeights": [0.41, 0.34, 0.25],
"zCurveFitness": 0.871
}
detect_citation_network_anomalies response:
{
"anomalies": [
{
"entity": "Ego depletion: Is the active self a lim ↔ Thinking about you: So",
"anomalyType": "citation_ring",
"severity": 0.82,
"tergmCoefficient": 0.74,
"clusteringCoefficient": 0
},
{
"entity": "Roy F. Baumeister",
"anomalyType": "self_citation_excess",
"severity": 0.67,
"tergmCoefficient": 0.44,
"clusteringCoefficient": 0.44
}
],
"totalAnomalies": 7,
"networkDensity": 0.0341,
"tergmGofPValue": 0.097,
"citationGini": 0.683
}
Output fields
audit_statistical_consistency
| Field | Type | Description |
|---|---|---|
flags[] | array | Per-paper statistical flags, sorted by deviation magnitude |
flags[].paper | string | Paper title (truncated to 60 chars) |
flags[].test | string | GRIM result: "GRIM_pass" or "GRIM_fail" |
flags[].reportedValue | number | The mean as reported in the paper |
flags[].reconstructedValue | number | GRIM-reconstructed nearest valid mean |
flags[].spriteConsistent | boolean | Whether the mean+SD combination is SPRITE-feasible |
flags[].deviation | number | Absolute difference between reported and reconstructed mean |
flags[].benfordDeviation | number | Deviation of first digit from Benford's expected frequency |
totalAudited | number | Total papers in the analysis corpus |
inconsistentCount | number | Papers where spriteConsistent=false |
spriteViolations | number | Papers failing the GRIM test |
benfordChiSquared | number | Global Benford chi-squared statistic across all reported values |
benfordPValue | number | P-value for Benford chi-squared test |
analyze_p_curve_z_curve
| Field | Type | Description |
|---|---|---|
pValues[] | array | Extracted significant p-values from the corpus |
zScores[] | array | Corresponding two-tailed z-scores |
rightSkewTest | number | P-value for right-skew test (Stouffer's method); < 0.05 = evidential value |
flatnessTest | number | KS p-value for flatness test; < 0.05 = p-hacking suspected |
evidentialValue | boolean | True if right-skew significant and not flat |
pHackingSuspected | boolean | True if flatness test significant |
expectedDiscoveryRate | number | Z-curve EDR: proportion of studies with true effects |
expectedReplicationRate | number | Z-curve ERR: expected probability of successful replication |
zCurveMixtureMeans | array | EM-fitted component means (3 components) |
zCurveMixtureWeights | array | EM-fitted mixture weights (3 components) |
zCurveFitness | number | 1 - KS statistic; higher = better fit |
fit_selection_model_meta_analysis
| Field | Type | Description |
|---|---|---|
studies[] | array | Per-study data with effect sizes and selection weights |
studies[].study | string | Study identifier (paper title, truncated) |
studies[].effectSize | number | Cohen's d effect size estimate |
studies[].standardError | number | Standard error of the effect size |
studies[].weight | number | Inverse-variance weight |
studies[].selectionProbability | number | Vevea-Hedges publication probability: 1.0, 0.3, or 0.1 |
pooledEffect | number | Unadjusted DerSimonian-Laird pooled effect (Cohen's d) |
pooledSE | number | Standard error of pooled effect |
adjustedEffect | number | Vevea-Hedges selection-adjusted pooled effect |
tauSquared | number | Between-study variance (DerSimonian-Laird estimator) |
iSquared | number | Heterogeneity as percentage: (Q - df) / Q |
selectionSeverity | number | Standardized bias: |pooled - adjusted| / pooledSE |
detect_citation_network_anomalies
| Field | Type | Description |
|---|---|---|
anomalies[] | array | Detected citation anomalies sorted by severity |
anomalies[].entity | string | Paper pair (rings) or author name (self-citation) |
anomalies[].anomalyType | string | "citation_ring", "self_citation_excess", or "coerced_citation" |
anomalies[].severity | number | 0–1 severity score |
anomalies[].tergmCoefficient | number | TERGM reciprocity/transitivity coefficient |
anomalies[].clusteringCoefficient | number | Local clustering coefficient for the entity |
totalAnomalies | number | Total anomalies detected |
networkDensity | number | Observed edges / possible edges in the citation graph |
tergmGofPValue | number | TERGM goodness-of-fit p-value |
citationGini | number | Gini coefficient of citation distribution (0 = equal, 1 = concentrated) |
screen_image_data_forensics
| Field | Type | Description |
|---|---|---|
flags[] | array | Per-paper forensic flags |
flags[].paper | string | Paper title |
flags[].flagType | string | "duplicate_region", "benford_violation", or "noise_pattern" |
flags[].confidence | number | 0–0.99 forensic confidence score |
flags[].dctAnomaly | number | Magnitude of DCT frequency-domain anomaly |
flags[].benfordDeviation | number | First-digit distribution deviation from Benford's law |
totalScreened | number | Total papers screened |
flaggedCount | number | Papers meeting the forensic flag threshold |
averageConfidence | number | Mean confidence across all flagged papers |
minHashSimilarityThreshold | number | Jaccard threshold used for duplicate detection |
trace_causal_contamination
| Field | Type | Description |
|---|---|---|
paths[] | array | Causal contamination pathways |
paths[].source | string | Origin paper or retracted work |
paths[].target | string | Downstream affected paper |
paths[].pathway | array | Ordered list of node IDs in the contamination chain |
paths[].contaminationStrength | number | 0–1 strength of causal link |
paths[].doCalculusIdentifiable | boolean | True if no unblocked backdoor confounders found |
totalPaths | number | Total contamination paths detected |
maxContamination | number | Highest contamination strength in the network |
identifiableCount | number | Paths that are causally identifiable |
dirichletConcentration | number | Estimated Dirichlet concentration parameter alpha |
detect_harking_bayesian_surprise
| Field | Type | Description |
|---|---|---|
signals[] | array | Per-paper HARKing signals |
signals[].paper | string | Paper title |
signals[].klDivergence | number | D_KL(posterior ‖ prior); values > 1.5 warrant scrutiny |
signals[].posteriorShift | number | Magnitude of posterior mean shift from prior |
signals[].hypothesisConsistency | number | 0–1 alignment between hypothesis and results |
signals[].harkingProbability | number | 0–1 estimated probability of HARKing |
totalScreened | number | Total papers screened |
suspectedHarking | number | Papers with harkingProbability > 0.70 |
averageSurprise | number | Mean KL divergence across the corpus |
brierScore | number | Corpus-level Brier score for calibration |
self_calibrate_detection
| Field | Type | Description |
|---|---|---|
metrics[] | array | Per-detector calibration metrics |
metrics[].detector | string | Detector name |
metrics[].truePositiveRate | number | TPR at default threshold |
metrics[].falsePositiveRate | number | FPR at default threshold |
metrics[].brierScore | number | Brier score for this detector |
metrics[].calibrationSlope | number | Platt scaling slope; 1.0 = perfectly calibrated |
metrics[].fixedPointConverged | boolean | Whether the Platt scaling loop converged |
overallBrier | number | Weighted Brier score across all 7 detectors |
calibrationError | number | Reliability component of Brier decomposition |
fixedPointIterations | number | Iterations required for convergence |
godelSelfReferenceDepth | number | Recursion depth at which self-calibration stabilized |
How much does it cost to run scientific fraud detection?
This MCP server uses pay-per-event pricing — you pay a fixed amount per tool call. Apify platform compute costs are included.
| Scenario | Tool | Cost per call | 10 calls | 50 calls |
|---|---|---|---|---|
| P-hacking screen | analyze_p_curve_z_curve | $0.035 | $0.35 | $1.75 |
| Statistical audit | audit_statistical_consistency | $0.040 | $0.40 | $2.00 |
| HARKing detection | detect_harking_bayesian_surprise | $0.035 | $0.35 | $1.75 |
| Citation network | detect_citation_network_anomalies | $0.040 | $0.40 | $2.00 |
| Publication bias | fit_selection_model_meta_analysis | $0.045 | $0.45 | $2.25 |
| Image forensics | screen_image_data_forensics | $0.045 | $0.45 | $2.25 |
| Contamination trace | trace_causal_contamination | $0.040 | $0.40 | $2.00 |
| Self-calibration | self_calibrate_detection | $0.040 | $0.40 | $2.00 |
A full 8-tool audit of a single research topic costs approximately $0.32. Auditing 100 research topics across all 8 tools costs approximately $32.
Apify's free tier includes $5 of monthly platform credits — enough for 125+ individual tool calls before any payment is required. You can set a maximum spending limit per run to control costs; the server stops charging when your budget is reached.
Compare this to commercial research integrity tools like iThenticate ($100+/month) or manual statistician time ($150-300/hour) — this server provides the statistical forensics layer that no turnkey tool currently offers.
Using the API directly
You can trigger any tool call programmatically via the Apify API without an MCP client.
Python
from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("ryanclinton/scientific-fraud-detection-mcp").call(run_input={})
# The server runs in Standby mode; use the MCP endpoint for tool calls
# To call a specific tool via HTTP:
import requests
response = requests.post(
"https://scientific-fraud-detection-mcp.apify.actor/mcp",
json={
"jsonrpc": "2.0",
"id": 1,
"method": "tools/call",
"params": {
"name": "audit_statistical_consistency",
"arguments": {
"query": "ego depletion willpower psychology Baumeister"
}
}
},
headers={"Authorization": "Bearer YOUR_API_TOKEN"}
)
result = response.json()
flags = result["result"]["content"][0]["text"]
print(f"Audit result: {flags[:500]}")
JavaScript
import { ApifyClient } from "apify-client";
const client = new ApifyClient({ token: "YOUR_API_TOKEN" });
// Call a tool directly via the MCP HTTP endpoint
const response = await fetch(
"https://scientific-fraud-detection-mcp.apify.actor/mcp",
{
method: "POST",
headers: {
"Content-Type": "application/json",
"Authorization": "Bearer YOUR_API_TOKEN",
},
body: JSON.stringify({
jsonrpc: "2.0",
id: 1,
method: "tools/call",
params: {
name: "analyze_p_curve_z_curve",
arguments: {
query: "social priming unconscious cognition replication",
},
},
}),
}
);
const data = await response.json();
const text = data.result.content[0].text;
const result = JSON.parse(text);
console.log(`EDR: ${result.expectedDiscoveryRate}`);
console.log(`ERR: ${result.expectedReplicationRate}`);
console.log(`Evidential value: ${result.evidentialValue}`);
console.log(`P-hacking suspected: ${result.pHackingSuspected}`);
cURL
# Call audit_statistical_consistency
curl -X POST "https://scientific-fraud-detection-mcp.apify.actor/mcp" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-d '{
"jsonrpc": "2.0",
"id": 1,
"method": "tools/call",
"params": {
"name": "audit_statistical_consistency",
"arguments": {
"query": "precognition Daryl Bem feeling the future"
}
}
}'
# List available tools
curl -X POST "https://scientific-fraud-detection-mcp.apify.actor/mcp" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-d '{
"jsonrpc": "2.0",
"id": 2,
"method": "tools/list",
"params": {}
}'
How Scientific Fraud Detection MCP works
Phase 1 — Parallel data collection across 16 sources
Every tool call triggers buildNetwork(), which fires five parallel Promise.all groups:
- Academic group (OpenAlex ≤80, Crossref ≤60, CORE ≤60, DBLP ≤60, arXiv ≤60) — retrieves papers, DOI metadata, citation counts, preprints
- Biomedical group (PubMed ≤80, Semantic Scholar ≤80, Europe PMC ≤60, ORCID ≤40) — adds MeSH terms, semantic graphs, researcher profiles
- Clinical group (NIH Grants ≤40, ClinicalTrials.gov ≤40) — adds funding context, registered trial enrollment counts
- Archival group (Wayback Machine ≤30, Website to Markdown for Google Scholar) — captures historical versions, deleted content
- Technical group (GitHub ≤30, USPTO Patents ≤30, Hacker News ≤30) — surfaces code reproducibility signals, patent filings, community discussion
All 16 actor calls run with 180-second timeouts. Failures are caught and return empty arrays, ensuring partial results rather than total failure. The five groups run in parallel via the outer Promise.all, so total latency is bounded by the slowest group rather than the sum.
Phase 2 — Research network assembly
buildResearchNetwork() normalizes all 16 source schemas into a unified graph of ResearchNode and ResearchEdge objects. Node types are: paper, author, journal, institution, dataset, grant, trial. Edge types are: cites, authored, published_in, funded_by, replicates, retracts. Deduplication uses a Set of node IDs derived from source-prefixed identifiers (e.g., oalex-W2741809809, pubmed-38291045). Citation edges are inferred from co-occurrence proximity in the sorted paper list using a seeded deterministic hash function for reproducibility.
Phase 3 — Algorithm execution
Each of the 8 tools runs its forensic algorithm against the assembled network:
- GRIM/SPRITE: computes mean × N and sum-of-squares consistency for every paper node with a
sampleSize; applies Benford chi-squared across the full corpus - P-curve/Z-curve: extracts p < 0.05 values from paper metadata, applies Stouffer's method for right-skew, KS test for flatness, then runs 20 iterations of 3-component EM with Beasley-Springer-Moro normal quantile approximation
- Vevea-Hedges: simulates Cohen's d from paper citation signals, assigns step-function publication weights, computes DerSimonian-Laird Q-statistic and tau-squared, then recomputes weighted pooled estimate under selection correction
- TERGM: builds citation adjacency map in O(E), traverses mutual-citation pairs, computes per-author self-citation rates via edge traversal, runs clustering coefficient calculation on first 30 paper nodes
- Benford DCT + MinHash: applies first-digit analysis to citation counts and sample sizes, estimates DCT anomaly via seeded simulation, computes pairwise Jaccard estimates via title hash comparison
- Do-calculus + Dirichlet CRP: performs BFS from high-severity TERGM nodes through the citation graph, checks d-separation heuristically, clusters paths by contamination strength
- Bayesian surprise: applies normal-normal conjugate update using citation-count-derived statistics as the likelihood, computes KL divergence against the standard normal prior
- Platt scaling fixed-point: iterates calibration logistic fits across all 7 detector outputs, converges when calibration slope stabilizes
Phase 4 — Standby mode delivery
The server runs in Apify Standby mode on an Express HTTP server. Each POST to /mcp instantiates a fresh McpServer and StreamableHTTPServerTransport, processes the request, and disposes on connection close. This stateless-per-request design ensures isolation between concurrent sessions from different users.
Tips for best results
-
Use author name + institution for targeted audits. A query like "Brian Wansink Cornell food psychology" produces a tightly scoped paper corpus where GRIM/SPRITE flags are highly relevant. A generic query like "nutrition psychology" produces a broader corpus better suited for p-curve field-level analysis.
-
Chain tools in the correct order for a full audit. Start with
analyze_p_curve_z_curveto assess field-level signal, thenfit_selection_model_meta_analysisfor a publication-bias-corrected effect estimate, thendetect_citation_network_anomaliesto check for cartel amplification of that effect. -
Interpret EDR and ERR together. EDR below 0.40 with ERR below 0.30 indicates a literature where fewer than 40% of studies have true effects and fewer than 30% would replicate. EDR below ERR is impossible under the model and signals data quality issues in the corpus.
-
GRIM violations require integer-data context. GRIM failures are only meaningful for Likert-scale or other integer-constrained data. A GRIM_fail on a continuous measurement scale is a false positive — check the paper's measurement instrument before drawing conclusions.
-
High citationGini (> 0.70) is a red flag but not proof of manipulation. Natural winner-take-all citation dynamics can produce Gini > 0.65 in competitive fields. Combine with
detect_citation_network_anomaliesTERGM coefficients for a stronger signal. -
Run
self_calibrate_detectionlast, after other tools. This tool assesses the pipeline's reliability based on the same corpus. A high overallBrier (> 0.15) on a specific query indicates the corpus lacked the statistical signal density needed for reliable detection — treat other results from that query with additional caution. -
Schedule weekly monitoring via Apify. Research integrity signals evolve as papers get retracted and citations accumulate. Scheduling a weekly
detect_citation_network_anomaliescall on a journal or author produces a time series that is far more diagnostic than a one-time snapshot.
Combine with other Apify actors
| Actor | How to combine |
|---|---|
| Company Deep Research | After identifying a high-fraud-signal author, run company deep research on their affiliated institution or spin-off to check for financial conflicts of interest |
| Website Content to Markdown | Convert retraction notices, PubPeer comment threads, or journal editorial notices to structured markdown for downstream LLM analysis |
| WHOIS Domain Lookup | Verify the provenance of predatory journals flagged in citation analysis by checking domain registration dates and registrant information |
| Trustpilot Review Analyzer | Cross-reference journal or publisher reputations in community review databases when citation anomalies point to specific venues |
| Website Change Monitor | Monitor a flagged journal's website or retraction database for newly posted retractions related to papers identified in the citation network analysis |
| B2B Lead Qualifier | For research integrity consultancies, qualify leads by cross-referencing company research spending and prior audit history |
| Multi-Review Analyzer | Scrape and analyze community reviews of flagged researchers or institutions across multiple platforms for corroborating signals |
Limitations
- GRIM and SPRITE only apply to integer-constrained data. They are not valid for continuous measurements, percentages, or log-transformed values. Applying them to non-integer data will produce misleading flags.
- P-curve and z-curve require at least 10 significant p-values for reliable shape inference. Single-paper queries or narrow topics with few published results produce low-power analyses with wide confidence intervals on EDR/ERR.
- Citation network assembly uses proximity-based edge inference. In the absence of full citation metadata from the underlying databases, citation edges are inferred from co-occurrence order in the assembled corpus. This is a heuristic approximation, not a true bibliometric citation graph.
- Benford DCT forensics is a screening signal, not evidence of manipulation. The DCT anomaly values in this server are simulated from corpus metadata, not computed from actual image files. Use this tool to triage, not to conclude.
- MinHash similarity uses title-level hashing only. Full content-level plagiarism detection requires the full paper text, which is not retrieved by this pipeline. False negatives (missed duplicates with different titles) are common.
- Do-calculus d-separation is approximate. The identifiability check uses a heuristic BFS-based assessment, not a full Pearl do-calculus solver. Non-identifiable paths may be incorrectly marked identifiable.
- The server cannot access paywalled full text. All 16 data sources are public APIs and open access repositories. Papers available only behind institutional journal subscriptions are not included in the analysis corpus.
- Self-calibration is circular on small corpora. With fewer than ~30 papers in the assembled network, the fixed-point calibration loop converges to trivial solutions. Godelian self-reference depth will report low values (1-2) rather than meaningful convergence diagnostics.
Integrations
- Zapier — trigger a fraud screen automatically when a new paper is flagged in a journal watch list
- Make — build multi-step workflows that route high-severity GRIM flags to a review queue in Notion or Airtable
- Google Sheets — pipe citation anomaly scores and HARKing probabilities into a running spreadsheet tracker for a watchlisted field
- Apify API — trigger any of the 8 tools from Python, JavaScript, or any HTTP client as part of a larger research pipeline
- Webhooks — receive a POST notification when a tool call completes or when a run produces anomaly scores above a configurable threshold
- LangChain / LlamaIndex — connect this MCP server to an LLM agent that evaluates scientific claims in real time before generating answers
Troubleshooting
-
Tool returns very few flags despite expecting anomalies — the assembled corpus may be small. Try a broader query (e.g., the research field rather than a specific paper) to pull more papers across the 16 sources. Queries that return under 20 papers produce statistically underpowered GRIM and p-curve results.
-
selectionSeverityis very high (> 5) for every query — this can occur when the corpus contains very few studies with diverse effect sizes, causing the DerSimonian-Laird estimator to produce an unstable tau-squared. This is a data-scarcity artifact, not a real publication bias signal. -
Spending limit reached error — the
eventChargeLimitReachedflag means your configured per-run spending limit was hit before the tool completed. Increase the limit in your Apify run settings or split your analysis across separate sessions. -
Server returns 405 Method Not Allowed on GET /mcp — the MCP endpoint only accepts POST requests. GET is blocked by design per the MCP protocol specification. Use POST with a JSON-RPC body.
-
z-curve EDR greater than 1.0 or negative — this indicates a degenerate EM solution, usually because the corpus p-values are concentrated at a single value or the query returned no significant p-values. Add a broader topic term to diversify the paper corpus.
Responsible use
- This server queries only publicly available academic databases, government research portals, and open-access repositories.
- Statistical anomalies identified by these tools are screening signals, not proof of fraud or misconduct. Always apply professional judgment before acting on results.
- Do not use output from this tool to make public accusations without independent verification by a qualified statistician or research integrity professional.
- Comply with the terms of service of each underlying data source (OpenAlex CC0, PubMed public API, Crossref public API, etc.).
- For guidance on responsible use of web scraping and data aggregation, see Apify's guide on web scraping legality.
FAQ
How does scientific fraud detection work in this MCP server? Each tool call assembles a research network from 16 academic data sources queried in parallel, then applies a specific forensic algorithm — GRIM/SPRITE for statistical consistency, p-curve/z-curve EM for p-hacking detection, Vevea-Hedges selection models for publication bias, TERGM for citation manipulation, Benford DCT + MinHash for data forensics, do-calculus for contamination tracing, and Bayesian surprise for HARKing detection.
How accurate is the GRIM test at detecting fabricated data? GRIM has very low false positive rate when the underlying data is truly integer-valued (e.g., Likert scales). Studies by Brown and Heathers (2017) found GRIM failures in 50% of papers in a sample of social psychology literature. However, GRIM only flags mathematical impossibilities — a fabricator who knows about GRIM can produce GRIM-consistent fake data.
How many papers does each tool call analyze? Each tool call pulls up to 460 raw records across 16 sources (OpenAlex ≤80, PubMed ≤80, Semantic Scholar ≤80, Crossref ≤60, CORE ≤60, DBLP ≤60, arXiv ≤60, Europe PMC ≤60, ORCID ≤40, NIH Grants ≤40, ClinicalTrials.gov ≤40, Wayback Machine ≤30, GitHub ≤30, USPTO ≤30, Hacker News ≤30, website scrape ≤1). After deduplication and network assembly, the analysis typically runs on 50-200 unique paper nodes.
What is the difference between EDR and ERR in the z-curve analysis? Expected Discovery Rate (EDR) estimates the proportion of all tested hypotheses (significant or not) that correspond to true effects. Expected Replication Rate (ERR) estimates the probability that a randomly selected significant result would replicate in a direct replication study with the same sample size. ERR is typically lower than EDR because it accounts for the selective reporting that inflated the original effect estimates.
Can this tool definitively prove scientific fraud? No. These are statistical screening tools that identify patterns consistent with questionable research practices. High GRIM failure rates, flat p-curves, and extreme self-citation scores are red flags that warrant investigation — not proof of misconduct. Final determination requires independent verification by a qualified statistician and institutional investigation.
How is this different from StatCheck or the GRIM test calculator? StatCheck and web-based GRIM calculators require you to paste individual statistics manually or upload a single paper. This MCP server assembles a full paper corpus from 16 live databases, applies GRIM/SPRITE/Benford analysis across the entire corpus simultaneously, and integrates with the 7 other forensic tools — all in a single tool call from your AI assistant or API client.
Is it legal to use this tool to audit published research? All 16 data sources are publicly available academic databases and government portals. Analyzing publicly available bibliometric data for research integrity purposes is legally unambiguous. For guidance on data access and scraping legality see Apify's guide.
How long does a typical tool call take?
Each tool call fires 16 actors in parallel groups, with a 180-second timeout per actor. Typical wall-clock time is 30-90 seconds, depending on database response times and corpus size. The outer Promise.all groups bound total time to the slowest parallel group rather than the sum of all actors.
Can I schedule scientific fraud detection to run automatically? Yes. Use Apify's built-in scheduler to trigger tool calls at any interval — daily journal monitoring, weekly author tracking, or monthly field-level assessments. Combine with webhooks to route results to Slack, email, or a Notion database automatically.
What happens if one of the 16 data sources is unavailable? Each actor call is wrapped in a try/catch that returns an empty array on failure. The network assembly proceeds with whatever data was successfully retrieved. A single source going down does not crash the analysis — it just reduces the paper corpus size and may lower the statistical power of the forensic algorithms.
How does the self-calibration tool work?
self_calibrate_detection runs all 7 detectors against the same corpus, then applies Platt scaling (logistic calibration) to each detector's scores using the other detectors as a pseudo-ground-truth reference. The fixed-point loop iterates until calibration slopes stabilize. The Godel self-reference depth metric reports the recursion level at which convergence occurred — higher values indicate a more complex calibration landscape.
Can I use this with any MCP-compatible AI assistant?
Yes. This server implements the MCP protocol over HTTP (Streamable HTTP transport) and is compatible with Claude Desktop, Cursor, Windsurf, and any other client that supports the MCP tools/call method. The endpoint is https://scientific-fraud-detection-mcp.apify.actor/mcp.
Help us improve
If you encounter issues, you can help us debug faster by enabling run sharing in your Apify account:
- Go to Account Settings > Privacy
- Enable Share runs with public Actor creators
This lets us see your run details when something goes wrong, so we can fix issues faster. Your data is only visible to the actor developer, not publicly.
Support
Found a bug or have a feature request? Open an issue in the Issues tab on this actor's page. For custom research integrity pipelines, extended data source integrations, or enterprise use cases, reach out through the Apify platform.
How it works
Configure
Set your parameters in the Apify Console or pass them via API.
Run
Click Start, trigger via API, webhook, or set up a schedule.
Get results
Download as JSON, CSV, or Excel. Integrate with 1,000+ apps.
Use cases
Sales Teams
Build targeted lead lists with verified contact data.
Marketing
Research competitors and identify outreach opportunities.
Data Teams
Automate data collection pipelines with scheduled runs.
Developers
Integrate via REST API or use as an MCP tool in AI workflows.
Related actors
Bulk Email Verifier
Verify email deliverability at scale. MX record validation, SMTP mailbox checks, disposable and role-based detection, catch-all flagging, and confidence scoring. No external API costs.
GitHub Repository Search
Search GitHub repositories by keyword, language, topic, stars, forks. Sort by stars, forks, or recently updated. Returns metadata, topics, license, owner info, URLs. Free API, optional token for higher limits.
Website Content to Markdown
Convert any website to clean Markdown for RAG pipelines, LLM training, and AI apps. Crawls pages, strips boilerplate, preserves headings, tables, and code blocks. GFM support.
Website Tech Stack Detector
Detect 100+ web technologies on any website. Identifies CMS, frameworks, analytics, marketing tools, chat widgets, CDNs, payment systems, hosting, and more. Batch-analyze multiple sites with version detection and confidence scoring.
Ready to try Scientific Fraud Detection MCP?
Start for free on Apify. No credit card required.
Open on Apify Store