ApifyForge A/B Tester

Choose the right Apify actor with real data

ApifyForge A/B Tester runs two Apify actors on identical input, N iterations each, and returns a production decision — not a vibe. Branch on decisionPosture (switch_now / canary_recommended / monitor_only / no_call) in your CI pipeline or agent loop. $0.30 per run.

Single-run comparisons swing 30%+ on network variance alone. A/B Tester aggregates median and p90 across 3–10 runs, detects flakiness and output-shape divergence, and collapses winner + confidence + stability into one routable enum.

What ApifyForge A/B Tester produces

decisionPosture enum

One routable field collapses winner, confidence, stability, and materiality: switch_now (commit), canary_recommended (10% traffic), monitor_only (do not switch), no_call (insufficient evidence). Branch on this — never parse prose.

Multi-run aggregation

Median, p90, and variance across 3–10 runs per actor. Excludes failed runs from rate calculations and surfaces them as advisory warnings. More runs tighten the confidence band.

Cost-per-result comparison

Per-run and per-result cost deltas — not just raw run cost. An actor that costs more per run but produces 3x the results may still win on cost-per-result.

Output shape divergence detection

Jaccard similarity on output field sets. Low overlap flips a RESULT_SHAPE_DIVERGENCE blocking warning so you don't treat two actors as interchangeable when their output shapes disagree.

Decision stability score

Pairwise matchups across run subsets check whether the winner holds under different samples. Low winnerConsistency flags UNSTABLE_WINNER — the aggregate verdict may flip with different runs.

Fairness guarantees

Parallel launch and clock-skew detection (childRunStartSpreadSec). Unfair launches flip FAIRNESS_VIOLATION and force the verdict to monitor_only regardless of confidence.

Actor comparison methods compared

MethodReliabilityMachine-readable verdictCost per comparison
ApifyForge A/B Tester (5 runs)High — median + p90 + stabilitydecisionPosture + confidenceLevel$1.50 (5 × $0.30)
Single-run eyeballLow — 30%+ varianceNoneActor run cost only
Custom benchmark scriptMedium — depends on rigorWhatever you build4–12 hours engineering
Apify Store star ratingsVery low — reputation signalNoneFree

Example ApifyForge A/B Tester output

{
  "comparison": {
    "winner": "actorB",
    "decisionPosture": "canary_recommended",
    "confidenceLevel": "medium",
    "recommendationLevel": "directional",
    "runsPerActor": 5,
    "materiality": "material",
    "verdictCode": "B_WINS_COST_EFFICIENCY",
    "verdictHuman": "actorB wins on cost-per-result (\$0.12 vs \$0.31) with medium confidence — canary rollout recommended.",
    "decisionStability": { "flipRisk": "low", "winnerConsistency": 0.9, "pairwiseTotal": 10 }
  },
  "actorA": { "successfulRuns": 5, "medianResults": 42, "medianCostUsd": 0.31, "p90DurationSec": 18.2 },
  "actorB": { "successfulRuns": 5, "medianResults": 38, "medianCostUsd": 0.12, "p90DurationSec": 22.1 },
  "warnings": []
}

Branch on decisionPosture. Never parse verdictHuman — the wording may change between versions.

How ApifyForge A/B Tester works

1

Pick two Apify actors and a shared input payload (3–10 runs per actor)

2

A/B Tester launches both in parallel, aggregates median + p90 + stability, and runs fairness checks

3

Returns decisionPosture + verdictCode + warnings — ready for CI or agent routing

Limitations

  • 1.Two actors only. A/B Tester compares exactly two actors — no multi-actor ranking or portfolio analysis. For fleet-wide scoring use Quality Monitor.
  • 2.Shared input required. Both actors must accept the same input JSON. Fundamentally different input contracts produce misleading verdicts.
  • 3.No semantic output validation. A/B Tester counts results and measures cost — it does not judge whether the content of each actor's output is correct. Pair with Schema Validator for output quality checks.
  • 4.Single-run mode is blocked from actionable verdicts. Under 3 runs returns no_call with SMOKE_TEST_ONLY. Design choice — single-run evidence is not statistically safe.
  • 5.Requires Apify account. Both target actors run on your own Apify account at their native run cost, plus $0.30 per A/B Tester invocation.

What ApifyForge A/B Tester costs

$0.30 per A/B test run charged to your own Apify account via pay-per-event. Plus the native run cost of whichever two actors you're comparing (3–10 runs each). A typical production-grade test: 5 runs × 2 actors × $0.05 average = $0.50 actor compute + $1.50 A/B Tester = ~$2.00 total for a statistically sound verdict.

Frequently asked questions

Why not just run both actors once and compare?

Single-run comparisons are statistically unreliable. Network variance, rate-limit state, and target-site volatility can swing a single run by 30% or more. ApifyForge A/B Tester runs each actor at least 3 times (ideally 5+) and aggregates median and p90 performance, so the returned decisionPosture reflects real relative performance — not noise.

What is decisionPosture and why should I branch on it?

decisionPosture is the routable control signal for automation: switch_now (commit to the winner), canary_recommended (partial rollout — 10% traffic), monitor_only (directional result, do not switch), no_call (insufficient or unreliable evidence). It collapses winner, confidence, stability, and materiality into one enum your CI pipeline or agent can act on directly, without parsing the verdictHuman prose. Prose copy changes between versions; decisionPosture is stable contract.

How does A/B Tester handle actor failures during the test?

Failed runs are recorded as stats.status = FAILED_TO_START, TIMED_OUT, or ERROR. Results from failed runs are excluded from the median/p90 calculations. If one actor fails every run while the other succeeds, a warning flag (ONE_SIDE_FAILED, advisory severity) is attached to the comparison — the verdict stands but signals that the losing actor may just be misconfigured for this input.

What happens when the two actors produce different output shapes?

ApifyForge A/B Tester computes Jaccard similarity on the output field sets. When shape overlap falls below a threshold, a RESULT_SHAPE_DIVERGENCE blocking warning is attached — because comparing results/run or cost/result between fundamentally different output shapes is misleading. Review sample records side by side before trusting the verdict in that case.

How many runs should I use?

3 runs minimum for a smoke test, 5+ for a production decision. Below 3 the tool returns decisionPosture: no_call with SMOKE_TEST_ONLY flag — you cannot get an actionable verdict. More runs tighten the confidence band and detect flakiness. Each additional run is one additional $0.30 charge.

Can I A/B test my own actor against a competitor's?

Yes, with caveats. Both actors must accept the same input JSON (shared input schema). If they accept different inputs, you are comparing apples and oranges — the decisionPosture may be statistically sound but practically useless. A fairness check (parallelLaunch, childRunStartSpreadSec) guards against clock skew between the two actor launches.