reliabilitymonitoringapify-storefleet-management

How We Monitor 250+ Apify Actors Without Losing Sleep

A user reported a broken actor at 3pm. We'd already fixed it by 3:15. Here's why automated monitoring changes everything when you're running actors at scale.

ApifyForge Team

How We Monitor 250+ Apify Actors Without Losing Sleep

Today we got an email from a user: "All runs failing since yesterday's release." Our GitHub Repository Search actor — one of our most popular — was broken.

The fix took 15 minutes. But the real question was: why did a user have to tell us?

If you publish actors on the Apify Store, you know the feeling. You push a quick update, everything looks fine in testing, and then a day later someone reports a failure you never saw coming. When you manage a handful of actors, these incidents are annoying. When you manage 250+, they become existential threats to your revenue.

The Problem With Scale

When you have 5 actors on the Apify Store, you can check them manually. Open the Console, look at recent runs, spot any failures. Takes a few minutes.

When you have 250+ actors, that approach falls apart completely. Here is the math: if each actor takes 2 minutes to check (open page, review recent runs, scan for errors), that is over 8 hours of manual checking. Every single day. Nobody is doing that.

So issues slip through — and the first person to notice is a paying user who is now having a bad experience with your product.

That is what happened today. A schema change broke runs for a specific edge case. The actor worked fine for most queries but failed on others. We would not have caught it by glancing at the dashboard.

The Cost of Silent Failures

Every failed run has a cost, even if it is not immediately obvious:

  • Direct revenue loss: If you use Pay Per Event pricing, failed runs produce no events to charge for
  • User churn: A user whose first run fails almost never comes back — they switch to a competitor
  • Maintenance flags: Consistent failures trigger Apify's maintenance flag system, which tanks your search visibility
  • Reputation damage: Bad reviews and word-of-mouth spread faster than good ones

Across our 250+ actor portfolio, we estimate that each day of undetected failure costs between $5-50 in lost revenue per actor, depending on its traffic. Multiply that by even 5 broken actors and you are looking at real money.

What Automated Monitoring Looks Like

After fixing the immediate issue, we ran our fleet health monitor across all 250+ actors. The results were eye-opening:

  • 78% fleet health score — lower than expected
  • 62 actors flagged — a mix of real failures and actors that had never been run by external users
  • 5 actors with genuine issues — external APIs returning errors, edge cases in input handling
  • 47 stale actors — never run or not run in 30+ days

We would not have found most of these manually. The health monitor checks every actor's success rate, identifies failure patterns, and generates specific recommendations — all in one run.

Building a Health Check Script

The foundation of our monitoring is a script that queries the Apify API for every actor we own and evaluates its health. Here is the core logic:

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: process.env.APIFY_TOKEN });

async function checkFleetHealth() {
    const actors = await client.actors().list({ my: true });
    const report = { healthy: [], warning: [], critical: [], stale: [] };

    for (const actor of actors.items) {
        const runs = await client.actor(actor.id).runs().list({ limit: 10 });
        const builds = await client.actor(actor.id).builds().list({ limit: 1 });

        const recentRuns = runs.items || [];
        const failures = recentRuns.filter(r => r.status === 'FAILED').length;
        const successRate = recentRuns.length > 0
            ? ((recentRuns.length - failures) / recentRuns.length) * 100
            : null;

        const lastBuild = builds.items?.[0];
        const daysSinceLastBuild = lastBuild?.finishedAt
            ? Math.floor((Date.now() - new Date(lastBuild.finishedAt)) / 86400000)
            : Infinity;

        // Classify actor health
        if (recentRuns.length === 0 || daysSinceLastBuild > 90) {
            report.stale.push({ name: actor.name, daysSinceLastBuild });
        } else if (failures >= 5) {
            report.critical.push({ name: actor.name, successRate, failures });
        } else if (failures >= 2 || successRate < 90) {
            report.warning.push({ name: actor.name, successRate, failures });
        } else {
            report.healthy.push({ name: actor.name, successRate });
        }
    }

    return report;
}

We run this daily. It catches problems before Apify's maintenance system does, giving us time to fix things proactively. The key insight: Apify gives you a grace period before flagging an actor for maintenance. If you catch failures in that window, you can fix them before any user-facing consequences.

Tracking Failure Patterns Over Time

A single failed run is not necessarily a problem — websites go down, APIs have transient errors, users pass invalid inputs. What matters is the trend. We track failure rates over rolling 7-day and 30-day windows:

async function getFailureTrend(actorId) {
    const runs = await client.actor(actorId).runs().list({ limit: 100 });
    const now = Date.now();
    const sevenDaysAgo = now - 7 * 86400000;
    const thirtyDaysAgo = now - 30 * 86400000;

    const last7d = runs.items.filter(r =>
        new Date(r.startedAt) > sevenDaysAgo
    );
    const last30d = runs.items.filter(r =>
        new Date(r.startedAt) > thirtyDaysAgo
    );

    const rate7d = last7d.length > 0
        ? last7d.filter(r => r.status === 'FAILED').length / last7d.length
        : 0;
    const rate30d = last30d.length > 0
        ? last30d.filter(r => r.status === 'FAILED').length / last30d.length
        : 0;

    return {
        rate7d: (rate7d * 100).toFixed(1) + '%',
        rate30d: (rate30d * 100).toFixed(1) + '%',
        trending: rate7d > rate30d ? 'WORSENING' : 'STABLE'
    };
}

An actor with a 2% failure rate over 30 days but 10% over the last 7 days is trending in the wrong direction. That is the kind of signal you need to catch early.

What We Fixed Today

The GitHub Repository Search issue was a schema validation problem — an edge case where the data did not match the expected format. We shipped 3 builds in quick succession to resolve it completely.

But the health monitor also flagged:

  • An actor that crashed when external APIs returned temporary errors — needed retry logic with exponential backoff
  • Several actors that failed when users ran them with empty input — needed graceful handling with proper defaults
  • An actor that hung when processing large XML files — needed a timeout and chunked processing
  • Two actors with outdated dependencies — a breaking change in a library update caused silent data corruption

All fixed and deployed within hours. Without the automated scan, these would have been ticking time bombs waiting for a user to trigger them.

The Retry Logic Pattern

One of the most common fixes we apply is proper retry logic for external API calls. Here is the pattern we use across all our actors:

async function fetchWithRetry(url, options = {}, maxRetries = 3) {
    for (let attempt = 1; attempt <= maxRetries; attempt++) {
        try {
            const response = await fetch(url, {
                ...options,
                signal: AbortSignal.timeout(30000) // 30s timeout
            });

            if (response.status === 429) {
                // Rate limited — wait and retry
                const retryAfter = parseInt(response.headers.get('retry-after') || '5');
                await new Promise(r => setTimeout(r, retryAfter * 1000));
                continue;
            }

            if (!response.ok && attempt < maxRetries) {
                await new Promise(r => setTimeout(r, Math.pow(2, attempt) * 1000));
                continue;
            }

            return response;
        } catch (error) {
            if (attempt === maxRetries) throw error;
            await new Promise(r => setTimeout(r, Math.pow(2, attempt) * 1000));
        }
    }
}

This pattern alone eliminated roughly 40% of our transient failures across the fleet.

The Five Metrics We Track Across Our Fleet

After a year of managing 250+ actors, we have narrowed our monitoring down to five essential metrics:

  1. Success rate per actor — anything below 99% gets investigated. We pull this from the Apify API daily and flag actors that drop below threshold.

  2. Failure trends — is an actor getting worse over time? A sudden spike usually means an external dependency changed. A gradual decline means something is slowly rotting.

  3. Stale actors — actors nobody is using might have broken quietly. We define "stale" as no external runs in 30+ days. These get a manual health check once a month.

  4. External API health — third-party APIs go down, change their response formats, or introduce rate limits. We track which actors depend on which external services so we can quickly assess blast radius when a service has an incident.

  5. Edge case coverage — does the actor handle empty input, null values, Unicode characters, and rate limits? We maintain a standard set of edge case tests that every actor should pass. See our testing guide for the full checklist.

Setting Up Alerts

Metrics are useless if nobody looks at them. We pipe our health check results into notifications:

async function sendAlert(report) {
    const critical = report.critical;
    if (critical.length === 0) return;

    const message = critical.map(a =>
        `CRITICAL: ${a.name} — ${a.failures} failures, ${a.successRate}% success rate`
    ).join('\n');

    // Send to your preferred channel: email, Slack, Discord, etc.
    await fetch(process.env.WEBHOOK_URL, {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ text: message })
    });
}

The rule is simple: critical alerts go out immediately. Warnings get batched into a daily digest. Stale actor reports go out weekly.

Why This Matters for Revenue

Every failed run is a user who might not come back. If your actor fails on someone's first try, they will switch to a competitor. They will not file a bug report — they will just leave.

When you are earning revenue through PPE (Pay Per Event), reliability directly impacts your income. Here are real numbers from our portfolio:

  • 99.5% success rate: approximately $0.12 revenue per run (users trust the actor, run it repeatedly)
  • 95% success rate: approximately $0.07 revenue per run (some users churn after failures)
  • 90% success rate: approximately $0.03 revenue per run (maintenance flag risk, poor reviews, high churn)

A 99% success rate across 30,000 monthly runs means 300 failures. That is 300 potentially lost users. At our average revenue per user, that translates to roughly $200/month in lost revenue — from just a 1% failure rate.

Practical Tips for Monitoring at Any Scale

You do not need 250 actors to benefit from monitoring. Here is what to implement at different portfolio sizes:

5-10 Actors

  • Run a weekly manual check of recent runs in the Apify Console
  • Set up email notifications for failed runs (Apify supports this natively)
  • Keep a spreadsheet tracking each actor's last successful run date

10-50 Actors

  • Automate health checks with a script that runs daily
  • Track success rates over time to catch trends
  • Set up webhook alerts for critical failures
  • Use the Schema Validator to catch input issues before they cause failures

50+ Actors

  • Full automated monitoring with daily health reports
  • Failure trend analysis with 7-day and 30-day windows
  • Dependency mapping (which actors share external APIs)
  • Automated test runs after every deployment
  • Fleet-wide dashboards showing health, revenue, and staleness

The Takeaway

If you are managing more than a handful of actors on the Apify Store, you need automated monitoring. Checking manually does not scale, and waiting for user reports means you have already lost trust.

We built ApifyForge specifically for this — a dashboard that monitors your entire actor fleet and flags issues before your users find them. The Fleet Analytics tool gives you the bird's-eye view of actor health, failure rates, revenue, and staleness across your entire portfolio in one place.

Today's incident was a good reminder: the actor that makes you money today can break tomorrow. The question is whether you find out from your monitoring system or from an angry email.

Start with one health check script. Run it daily. Expand from there. Your future self — and your revenue — will thank you.


Related resources: