Choose the right Apify actor with real data
ApifyForge A/B Tester runs two Apify actors on identical input, N iterations each, and returns a production decision — not a vibe. Branch on decisionPosture (switch_now / canary_recommended / monitor_only / no_call) in your CI pipeline or agent loop. $0.30 per run.
Single-run comparisons swing 30%+ on network variance alone. A/B Tester aggregates median and p90 across 3–10 runs, detects flakiness and output-shape divergence, and collapses winner + confidence + stability into one routable enum.
One routable field collapses winner, confidence, stability, and materiality: switch_now (commit), canary_recommended (10% traffic), monitor_only (do not switch), no_call (insufficient evidence). Branch on this — never parse prose.
Median, p90, and variance across 3–10 runs per actor. Excludes failed runs from rate calculations and surfaces them as advisory warnings. More runs tighten the confidence band.
Per-run and per-result cost deltas — not just raw run cost. An actor that costs more per run but produces 3x the results may still win on cost-per-result.
Jaccard similarity on output field sets. Low overlap flips a RESULT_SHAPE_DIVERGENCE blocking warning so you don't treat two actors as interchangeable when their output shapes disagree.
Pairwise matchups across run subsets check whether the winner holds under different samples. Low winnerConsistency flags UNSTABLE_WINNER — the aggregate verdict may flip with different runs.
Parallel launch and clock-skew detection (childRunStartSpreadSec). Unfair launches flip FAIRNESS_VIOLATION and force the verdict to monitor_only regardless of confidence.
| Method | Reliability | Machine-readable verdict | Cost per comparison |
|---|---|---|---|
| ApifyForge A/B Tester (5 runs) | High — median + p90 + stability | decisionPosture + confidenceLevel | $1.50 (5 × $0.30) |
| Single-run eyeball | Low — 30%+ variance | None | Actor run cost only |
| Custom benchmark script | Medium — depends on rigor | Whatever you build | 4–12 hours engineering |
| Apify Store star ratings | Very low — reputation signal | None | Free |
{
"comparison": {
"winner": "actorB",
"decisionPosture": "canary_recommended",
"confidenceLevel": "medium",
"recommendationLevel": "directional",
"runsPerActor": 5,
"materiality": "material",
"verdictCode": "B_WINS_COST_EFFICIENCY",
"verdictHuman": "actorB wins on cost-per-result (\$0.12 vs \$0.31) with medium confidence — canary rollout recommended.",
"decisionStability": { "flipRisk": "low", "winnerConsistency": 0.9, "pairwiseTotal": 10 }
},
"actorA": { "successfulRuns": 5, "medianResults": 42, "medianCostUsd": 0.31, "p90DurationSec": 18.2 },
"actorB": { "successfulRuns": 5, "medianResults": 38, "medianCostUsd": 0.12, "p90DurationSec": 22.1 },
"warnings": []
}Branch on decisionPosture. Never parse verdictHuman — the wording may change between versions.
Pick two Apify actors and a shared input payload (3–10 runs per actor)
A/B Tester launches both in parallel, aggregates median + p90 + stability, and runs fairness checks
Returns decisionPosture + verdictCode + warnings — ready for CI or agent routing
no_call with SMOKE_TEST_ONLY. Design choice — single-run evidence is not statistically safe.$0.30 per A/B test run charged to your own Apify account via pay-per-event. Plus the native run cost of whichever two actors you're comparing (3–10 runs each). A typical production-grade test: 5 runs × 2 actors × $0.05 average = $0.50 actor compute + $1.50 A/B Tester = ~$2.00 total for a statistically sound verdict.
Single-run comparisons are statistically unreliable. Network variance, rate-limit state, and target-site volatility can swing a single run by 30% or more. ApifyForge A/B Tester runs each actor at least 3 times (ideally 5+) and aggregates median and p90 performance, so the returned decisionPosture reflects real relative performance — not noise.
decisionPosture is the routable control signal for automation: switch_now (commit to the winner), canary_recommended (partial rollout — 10% traffic), monitor_only (directional result, do not switch), no_call (insufficient or unreliable evidence). It collapses winner, confidence, stability, and materiality into one enum your CI pipeline or agent can act on directly, without parsing the verdictHuman prose. Prose copy changes between versions; decisionPosture is stable contract.
Failed runs are recorded as stats.status = FAILED_TO_START, TIMED_OUT, or ERROR. Results from failed runs are excluded from the median/p90 calculations. If one actor fails every run while the other succeeds, a warning flag (ONE_SIDE_FAILED, advisory severity) is attached to the comparison — the verdict stands but signals that the losing actor may just be misconfigured for this input.
ApifyForge A/B Tester computes Jaccard similarity on the output field sets. When shape overlap falls below a threshold, a RESULT_SHAPE_DIVERGENCE blocking warning is attached — because comparing results/run or cost/result between fundamentally different output shapes is misleading. Review sample records side by side before trusting the verdict in that case.
3 runs minimum for a smoke test, 5+ for a production decision. Below 3 the tool returns decisionPosture: no_call with SMOKE_TEST_ONLY flag — you cannot get an actionable verdict. More runs tighten the confidence band and detect flakiness. Each additional run is one additional $0.30 charge.
Yes, with caveats. Both actors must accept the same input JSON (shared input schema). If they accept different inputs, you are comparing apples and oranges — the decisionPosture may be statistically sound but practically useless. A fairness check (parallelLaunch, childRunStartSpreadSec) guards against clock skew between the two actor launches.