Question 1

Why not just run both actors once and compare?

Accepted Answer

Single-run comparisons are statistically unreliable. Network variance, rate-limit state, and target-site volatility can swing a single run by 30% or more. ApifyForge A/B Tester runs each actor at least 3 times (ideally 5+) and aggregates median and p90 performance, so the returned decisionPosture reflects real relative performance — not noise.

Question 2

What is decisionPosture and why should I branch on it?

Accepted Answer

decisionPosture is the routable control signal for automation: switch_now (commit to the winner), canary_recommended (partial rollout — 10% traffic), monitor_only (directional result, do not switch), no_call (insufficient or unreliable evidence). It collapses winner, confidence, stability, and materiality into one enum your CI pipeline or agent can act on directly, without parsing the verdictHuman prose. Prose copy changes between versions; decisionPosture is stable contract.

Question 3

How does A/B Tester handle actor failures during the test?

Accepted Answer

Failed runs are recorded as stats.status = FAILED_TO_START, TIMED_OUT, or ERROR. Results from failed runs are excluded from the median/p90 calculations. If one actor fails every run while the other succeeds, a warning flag (ONE_SIDE_FAILED, advisory severity) is attached to the comparison — the verdict stands but signals that the losing actor may just be misconfigured for this input.

Question 4

What happens when the two actors produce different output shapes?

Accepted Answer

ApifyForge A/B Tester computes Jaccard similarity on the output field sets. When shape overlap falls below a threshold, a RESULT_SHAPE_DIVERGENCE blocking warning is attached — because comparing results/run or cost/result between fundamentally different output shapes is misleading. Review sample records side by side before trusting the verdict in that case.

Question 5

How many runs should I use?

Accepted Answer

3 runs minimum for a smoke test, 5+ for a production decision. Below 3 the tool returns decisionPosture: no_call with SMOKE_TEST_ONLY flag — you cannot get an actionable verdict. More runs tighten the confidence band and detect flakiness. Each additional run is one additional $0.30 charge.

Question 6

Can I A/B test my own actor against a competitor's?

Accepted Answer

Yes, with caveats. Both actors must accept the same input JSON (shared input schema). If they accept different inputs, you are comparing apples and oranges — the decisionPosture may be statistically sound but practically useless. A fairness check (parallelLaunch, childRunStartSpreadSec) guards against clock skew between the two actor launches.

Method	Reliability	Machine-readable verdict	Cost per comparison
ApifyForge A/B Tester (5 runs)	High — median + p90 + stability	decisionPosture + confidenceLevel	$1.50 (5 × $0.30)
Single-run eyeball	Low — 30%+ variance	None	Actor run cost only
Custom benchmark script	Medium — depends on rigor	Whatever you build	4–12 hours engineering
Apify Store star ratings	Very low — reputation signal	None	Free

ApifyForge A/B Tester

What ApifyForge A/B Tester produces

decisionPosture enum

Multi-run aggregation

Cost-per-result comparison

Output shape divergence detection

Decision stability score

Fairness guarantees

Actor comparison methods compared

Example ApifyForge A/B Tester output

How ApifyForge A/B Tester works

Limitations

What ApifyForge A/B Tester costs

Frequently asked questions