Developer ToolsWeb ScrapingData IntelligenceApifyPPE Pricing

How to Detect Abandoned GitHub Repositories at Scale

Detect abandoned GitHub repos at scale. Catch zombies, COLLAPSING trajectories, and bus-factor risk across 10,000 repos at $0.15 per repo.

Ryan Clinton

The problem: Naive abandonment checks miss roughly 40% of dead repos because they trust the last-commit date. A Kubernetes operator with daily Dependabot pushes looks alive but hasn't seen a human commit in 18 months. A popular Terraform provider looks healthy until you notice the sole maintainer stopped reviewing PRs four months ago. An AI framework with 45,000 stars hasn't tagged a release since 2023, and the README still says "production-ready". Multiply that by a dependency tree of 500 packages or a category audit of 10,000 repos and the manual approach falls apart — both because GitHub's Search API caps results at 1,000 per query and because nobody on the team has time to read 10,000 commit graphs.

What is abandoned-repo detection at scale? It's the process of classifying public GitHub repositories as ACTIVE, AT_RISK, or ABANDONED across hundreds or thousands of repos in a single run, using a stack of signals — days since last push, commit trajectory, decay velocity, bot-vs-human activity, contributor concentration, release cadence, and bus-factor impact — instead of a single date field. The output is a verdict per repo plus a diff against the previous run.

Why it matters: Most teams discover an abandoned dependency after a production incident, not before. Two in three open-source projects that application teams depend on have become dormant, per Sonatype's 2024 State of the Software Supply Chain report. Synopsys' 2024 OSSRA audit of 1,067 commercial codebases found 84% contained at least one open-source component with a known vulnerability. Abandoned repos don't patch CVEs and don't fix bugs — yesterday's clean dependency becomes tomorrow's unreported risk, and the gap between "the maintainer left" and "the SRE channel pages at 3am" is usually weeks, not months.

Use it when: You need to audit a dependency tree, monitor a category for newly abandoned projects, screen a VC's open-source portfolio for COLLAPSING trajectories, or run scheduled re-evaluation on hundreds-to-thousands of repos.

Quick answer:

  • What it is: A signal-stack approach to repo abandonment that combines daysSinceLastPush, decay score, trajectory, zombie detection, bus factor, and release cadence into a maintenance verdict per repo.
  • When to use it: Dependency audits over 50+ repos, weekly category monitoring, supply-chain due diligence, and one-off market scans up to 10,000 repos.
  • When NOT to use it: A single repo you can eyeball in a minute, private repos (Search API doesn't return them), or quality reviews that need code-level static analysis.
  • Typical steps: Define scope → pick mode → enable auto-partition → enable cross-run diff → schedule weekly → read the diff.
  • Main tradeoff: At-scale detection costs per-repo API time and PPE charges. Worth it once the dependency or category list is bigger than a human can manually re-read every week.

Decision shortcut — flag a GitHub repo as abandoned in 5 signals:

  1. daysSinceLastPush > 365 — formal abandonment threshold.
  2. maintenance.trajectory: COLLAPSING — abandonment in progress, not yet over the line.
  3. isZombie: true — recent pushes are bot-only; no human activity in 90 days.
  4. contributors.topContributorShare > 0.8 + dropoff — bus factor concentrated, departure already happened.
  5. No tagged release in 180+ days for a production library — release contract broken.

Any one signal is suspicious. Two together is decisive.

If you only remember 5 things:

  1. pushedAt lies. Bot commits, license edits, and Dependabot bumps all tick the timestamp without anyone maintaining the repo.
  2. Zombie repos are common. A 12k-star library can have daily commits and zero human contributions in 90 days — that's not maintained, that's running on autopilot.
  3. Trajectory beats stars. A 45k-star COLLAPSING repo is more dangerous than a 2k-star GROWING one. Direction is the leading indicator; star count is a lagging vanity metric.
  4. Bus factor predicts collapse. One person doing 90%+ of recent commits is the single most reliable abandonment predictor. Concentration kills more dependencies than CVEs.
  5. Read the diff, not the dataset. A 200-row weekly audit is unreadable. The 3–5 row "what changed since last week" diff is the only artefact a team will actually open.

In this article: Definition · Why naive checks miss ~40% · The signal stack · The 1,000-result cap · Scheduled diff monitoring · KV-summary outputs · Worked example · Cost math · Limitations · FAQ

Key takeaways:

  • Last-commit date is gameable — Dependabot, GitHub Actions bots, and license-file edits all bump pushedAt without anyone maintaining the project.
  • Real abandonment detection needs a signal stack: daysSinceLastPush, maintenance.status, maintenance.trajectory, decayScore, isZombie, bus factor concentration, and release cadence — read together, not separately.
  • GitHub's Search API hard-caps results at 1,000 per query. Auditing a 10,000-repo category needs partitioning across star ranges with deduplication, which the GitHub Repo Intelligence actor handles via autoPartitionResults.
  • Scheduled monitoring with compareToPreviousRun flags NEW, SCORE_CHANGE, STATUS_CHANGE, and NEWLY_ABANDONED repos so a team only reads what changed — not the full 10,000-row dataset every week.
  • At $0.15 per repo, weekly monitoring of a 200-repo dependency tree costs roughly $30/week — about 2 minutes of one engineer's loaded time.

Problems this solves:

  • How to detect abandoned GitHub repos across a 500-package dependency tree
  • How to flag zombie repos that look maintained but aren't
  • How to scan more than 1,000 repos in a single category audit
  • How to surface NEWLY_ABANDONED projects between weekly scheduled runs
  • How to score bus-factor risk across hundreds of dependencies at once
  • How to separate feature-complete projects from dying ones

Examples table — input scenarios mapped to verdicts:

Repo profileLast pushNaive checkReal verdict
AI inference framework, 45k stars, 1 maintainer doing 94% of commits in last year, no release in 412 days127 days ago"Active, looks fine"COLLAPSING, bus factor HIGH, NEWLY_ABANDONED triggered
Kubernetes operator, 12k stars, daily commits, 100% from dependabot[bot] and github-actions[bot]2 days ago"Active, recent commits"ZOMBIE, AT_RISK, no human activity 90d
Cryptography helper library, 800 stars, no commits for 720 days, archived flag not set720 days ago"Probably dead, skip"ABANDONED, isAbandoned: true
Small CLI utility, 2k stars, no commits for 540 days, but stable test suite, clean license, zero open issues540 days ago"Abandoned, avoid"FEATURE_COMPLETE, MEDIUM adoption readiness
Terraform provider, 6k stars, dormant 2 years, suddenly 80 commits this month, 14 contributors4 days ago"Wait what?"REVIVING, isRevived: true, MEDIUM risk

These archetypes are what the signal stack disambiguates. A single field like pushedAt can't tell COLLAPSING from FEATURE_COMPLETE from REVIVING — but together with decayScore, isZombie, topContributorShare, and latestRelease.daysSinceRelease, the verdicts separate cleanly.

What is an abandoned GitHub repository at scale?

Definition (short version): An abandoned GitHub repository at scale is any public repository in a batch (typically dependency trees of 50–500 or category audits of 1,000–10,000) that crosses the abandonment threshold across multiple lifecycle signals — not just one — and is detected automatically rather than manually.

The "at scale" part is doing real work in that definition. Detecting abandonment in a single repo is a five-second eyeball check. Detecting abandonment across 500 dependencies, 30 forks of a popular framework, or 10,000 repos in a vector-database category scan is a different problem entirely. The bottleneck stops being "what does abandonment mean" and starts being pagination, rate limits, query partitioning, deduplication, scheduled re-evaluation, and diff detection.

There are roughly four levels of abandonment-detection maturity:

  1. Eyeball one repo at a time — fine for browsing, useless for fleets.
  2. Filter by pushedAt date in a CSV export — catches the obvious carcasses, misses zombies and COLLAPSING projects.
  3. Multi-signal classification per repo — combines daysSinceLastPush, lifecycle stage, decay score, bus factor, and release cadence into a single status enum.
  4. Scheduled multi-signal classification with cross-run diff — runs the classification on a schedule and flags only what changed since the last run.

Also known as: dependency abandonment audit, open-source dormancy detection, repository lifecycle monitoring, supply-chain dropout detection, GitHub fleet maintenance scan, OSS abandonment surveillance.

Why naive abandonment checks fail across a fleet

Most teams build their first abandonment detector in an afternoon. They pull the GitHub Search API, sort by pushedAt, set a 365-day threshold, and call it done. Three weeks later they realise the false-positive rate is somewhere around 40% — because pushedAt lies.

pushedAt is gameable. Anything that touches the repo bumps it: a Dependabot version bump, a GitHub Actions workflow update, a typo fix in a docs PR merged by a bot. None of those are maintenance. They look like maintenance to a date filter.

Stars are independent of health. A repo can have 45,000 lifetime stars and be one maintainer away from collapse. Stars don't decay when activity stops. Filtering by stars catches popular-but-stale projects badly.

A single ABANDONED flag is too late. By the time daysSinceLastPush crosses 365, the project has been effectively dead for a year. The signal an audit actually wants is COLLAPSING — decay velocity is high, bus factor is concentrated, no release in months — before the formal threshold. That's a forward-looking classification, not a date filter.

Forks, archives, and migrations all confuse simple checks. A project might have moved to a successor repo (the original is technically dead, the work is alive). It might be archived (intentionally feature-complete). It might be a dormant fork of an active upstream. None of these read correctly from a pushedAt filter alone.

Category-wide scans hit the 1,000-result cap. GitHub's Search API hard-limits each query to 1,000 results. If you want to audit "all vector database repos" or "all React component libraries" or "all Python ML frameworks", a single query will return the top 1,000 by your sort order — and the abandoned ones tend to live below that line, exactly where the audit is most valuable.

The honest read: at one repo a date filter is fine; at one thousand repos a date filter ships wrong verdicts.

The signal stack that actually detects abandonment

Multi-signal classification means reading several fields together so each signal can correct the others. The GitHub Repo Intelligence actor computes the following fields per repo, and the abandonment verdict comes from how they line up.

        daysSinceLastPush
                 +
        maintenance.trajectory
                 +
        contributors.topContributorShare
                 +
        latestRelease.daysSinceRelease
                 +
        isZombie + zombieSignals
                 +
        forecast.abandonmentRisk90d
                          ↓
            maintenance verdict
        (ACTIVE / SLOWING / AT_RISK / ABANDONED
         × GROWING / STABLE / COLLAPSING / REVIVING
         × isZombie / isFeatureComplete)

No single field carries enough information. Read together they correct each other — that's the difference between a date filter and a maintenance verdict.

Date-based signals:

  • daysSinceLastPush — exact days since the last push to any branch. Computed at extraction time, not stored stale.
  • isAbandoned — boolean shorthand: true if daysSinceLastPush exceeds 365.

Lifecycle classification:

  • maintenance.status — one of ACTIVESTABLESLOWINGAT_RISKABANDONED. The five-stage enum that handles edge cases the date filter misses.
  • maintenance.trajectoryGROWING / STABLE / DECLINING / COLLAPSING / REVIVING. Direction, not state. A SLOWING + REVIVING repo is recovering. An ACTIVE + COLLAPSING repo is on the way down.
  • maintenance.decayScore (0-100) and decayVelocity (NONE/SLOW/FAST) — how fast the repo is losing activity, not just whether it has lost activity.
  • maintenance.timeToCriticalRisk — projected window until a slowing project crosses into AT_RISK ("60-120 days"). Lets a team plan a replacement before a dependency breaks production.

Activity-vs-noise signals:

  • maintenance.isZombie — true if recent commits are bot-only or mechanical. Catches the Dependabot-keepalive pattern.
  • maintenance.zombieSignals — array of reasons (bot_only_commits, no_human_activity_90d, no_releases_180d).
  • maintenance.isRevived — true if a previously dormant project has come back. Stops the audit treating REVIVING repos as dead.
  • maintenance.isFeatureComplete — true if low activity is intentional: stable test suite, clean license, zero open issues. Distinguishes "done" from "dying".

Concentration / governance signals:

  • contributors.topContributorShare — share of recent commits by the top one contributor. Above 0.8 is a single-point-of-failure flag.
  • maintenance.busFactorRiskLOW / MEDIUM / HIGH / CRITICAL derived from contributor concentration.
  • maintenance.ifMaintainerLeavesMINIMAL_IMPACT / MODERATE_IMPACT / PROJECT_LIKELY_STALLS. Predicts what happens if the top contributor disappears.
  • latestRelease.daysSinceRelease — release cadence proxy. Production libraries with no release in 180+ days are caution flags.

Forecast layer:

  • forecast.abandonmentRisk90dLOW / MEDIUM / HIGH / CRITICAL projection for whether the repo will be abandoned within 90 days, with a confidence rating.

Read together, these fields produce a verdict like:

vue-some-library — maintenance.status: AT_RISK, trajectory: COLLAPSING, decayScore: 78, decayVelocity: FAST, isZombie: false, topContributorShare: 0.91, latestRelease.daysSinceRelease: 287, ifMaintainerLeaves: PROJECT_LIKELY_STALLS, forecast.abandonmentRisk90d: HIGH. Verdict: NEWLY_ABANDONED candidate.

That's the difference between a date filter and a maintenance verdict.

JSON output example — abandonment fields

Here's what the maintenance and forecast layers look like for a single COLLAPSING repo flagged in a weekly run:

{
    "fullName": "example-org/dormant-utility",
    "stars": 45200,
    "daysSinceLastPush": 287,
    "isAbandoned": false,
    "isArchived": false,
    "latestRelease": {
        "tag": "v3.1.0",
        "daysSinceRelease": 412
    },
    "contributors": {
        "count": 124,
        "topContributorShare": 0.94,
        "signedCommitRatio": 0.31
    },
    "maintenance": {
        "status": "AT_RISK",
        "trajectory": "COLLAPSING",
        "decayScore": 78,
        "decayVelocity": "FAST",
        "timeToCriticalRisk": "60-90 days",
        "isZombie": false,
        "zombieSignals": [],
        "isRevived": false,
        "isFeatureComplete": false,
        "busFactorRisk": "HIGH",
        "ifMaintainerLeaves": "PROJECT_LIKELY_STALLS",
        "confidence": "HIGH"
    },
    "forecast": {
        "abandonmentRisk90d": "HIGH",
        "maintenanceRiskProjection": "INCREASING",
        "confidence": "HIGH"
    },
    "changeType": "NEWLY_ABANDONED"
}

That's a self-describing record. A reader doesn't need a dashboard — status: AT_RISK, trajectory: COLLAPSING, topContributorShare: 0.94, latestRelease.daysSinceRelease: 412, and changeType: NEWLY_ABANDONED collectively settle it. The repo isn't dead by the formal 365-day rule yet (isAbandoned: false), but every other field says it's effectively over.

Naive checks vs intelligence-layer detection

The thesis condensed into one table — every row is a real failure mode of the naive approach:

SignalNaive verdictIntelligence-layer verdict
Recent commits visible on the repo page"Active"isZombie: true if commits are bot-only
45k stars, looks healthy"Safe to adopt"COLLAPSING trajectory + HIGH bus factor risk
No commits for 540 days"Dead, skip"isFeatureComplete: true if license clean + zero open issues + stable tests
One maintainer"Fine, common pattern"topContributorShare > 0.8 + ifMaintainerLeaves: PROJECT_LIKELY_STALLS
Last release 412 days ago"Stable"Release-cadence flag — production library contract broken
Dormant 2 years, suddenly active"Resurrection? Ignore?"isRevived: true, REVIVING trajectory, MEDIUM adoption readiness
1,200 results returned for a category"That's the whole category"Query coverage report flags partition incomplete; auto-partition needed

Each row is a verdict the date-filter approach gets wrong. Read together, the signal stack catches them all.

The 1,000-result cap problem at category scale

GitHub's Search API returns at most 1,000 results per query. This is a hard cap, not a rate limit — pagination stops at page 10, sort order doesn't extend it, and the total_count field will tell you there are 47,000 matching repos while only handing you 1,000 of them.

For a category audit this is the actual hard problem. Pulling "all React component library repos" or "all Python ML frameworks" or "all vector databases" returns the top 1,000 by your chosen sort, and the abandoned ones tend to live below that line — they don't trend in the search-by-stars view, they don't appear in the search-by-recently-updated view because they haven't been updated. They're invisible to a one-shot query.

Naive workarounds — paginate harder, switch sort order, run multiple queries with overlapping filters — fight the wrong battle. The cap is per query, not per repo. The fix is partitioning the query into sub-queries that each return fewer than 1,000 repos, then deduplicating across the partitions.

The actor handles this with autoPartitionResults: true. It detects when a query matches more than 1,000 repos and recursively splits by star ranges, fetching each partition with rate-limit-aware delays and deduplicating across partitions. Up to 10,000 repos per run. The audit gets the long tail where the abandonment lives. The team gets a query coverage report — how many partitions ran, how many duplicates were deduped, and a confidence level for whether the partitioning actually swept the full category — so the run isn't a black box.

This is the kind of complexity that's easy to underestimate from the outside. You'd own pagination state, rate-limit recovery, partition-boundary collisions, dedup memory pressure, partial-failure recovery, and confidence scoring. Each one is solvable. Together they're a maintained service, not a script.

How to monitor for newly abandoned repos on a schedule

Detection is half the job. The other half is re-detection on a schedule so newly abandoned repos surface as they happen, not three quarters later when something breaks production.

The operational pattern: run the same audit weekly on an Apify Schedule with compareToPreviousRun: true enabled. The actor stores state in a named key-value store between runs and tags each repo with one of:

  • NEW — repo wasn't in the previous run, is in this one.
  • SCORE_CHANGE — repo was here last week, but at least one composite score moved meaningfully.
  • STATUS_CHANGEmaintenance.status flipped (e.g., SLOWING → AT_RISK).
  • NEWLY_ABANDONED — the high-priority flag. Repo crossed the abandonment threshold since the last run.

The diff is what the team reads. A 200-repo dependency audit produces a 200-row dataset every week. Nobody re-reads it. The KV-summary diff produces a 5-row "what changed since last week" list — usually three SCORE_CHANGE entries, one STATUS_CHANGE, one NEWLY_ABANDONED — and that's the actual Monday-morning input.

A week with no NEWLY_ABANDONED entries is also information. It means the dependency tree is stable for another week. The team confirms it in 30 seconds and moves on.

This rolling-window diff pattern is the same thinking that drives Apify actor reliability monitoring — the fleet-wide signal isn't "what's the current state", it's "what changed since last time".

What the team actually reads on Monday morning

Per-repo scoring is one half of the output. The other half is a run-level summary written to the actor's key-value store, designed to be paste-ready for Slack, weekly digests, or audit reports without further processing. For a category audit run weekly, this is the artefact that gets opened on Monday morning.

Leaderboards. Top 10 by each of the five composite scores (project health, adoption readiness, community, supply-chain risk, outreach). A category scan returns five ranked lists — most adoptable, lowest risk, most reachable maintainers — instead of one undifferentiated table.

Category market intelligence. Score distributions across the category, count of declining and abandoned projects, breakout detection (repos with star velocity in the top 1% of category), so a market-map run tells you the shape of the category rather than just the leaders.

Narrative summary. A short paragraph synthesising the run: "Of 847 vector-database repos scanned, 62 are STRONGLY_RECOMMENDED, 14 are NEWLY_ABANDONED since last run, 3 are breakout candidates this week." Drops directly into a stand-up note or weekly digest.

Diff summary (when compareToPreviousRun is on). Counts of NEW, SCORE_CHANGE, STATUS_CHANGE, and NEWLY_ABANDONED repos since the previous scheduled run, with the affected repo names listed. This is the field the audit team actually reads.

Query coverage report. How many partitions were scanned, how many duplicates were deduplicated, and a confidence level for whether the auto-partition strategy actually swept the full category. Avoids the "we scanned 1,000 repos and assumed that was the whole market" failure mode.

The per-repo dataset is the audit trail you keep for compliance. The KV summary is the executive read.

What are the alternatives to fleet-wide abandonment detection?

There are roughly five approaches. Each makes different tradeoffs between effort, cost, and accuracy.

  1. Manual review per repo — Open the GitHub page, eyeball the commit graph, scroll the contributors. Fine for a handful of repos. At a 200-package dependency tree this is roughly two engineer-days a week, and it doesn't scale to weekly cadence.
  2. Custom GitHub Search API integration — Write a service that paginates Search results, applies a date filter, deduplicates across queries, handles 403/429 rate-limit recovery, partitions broad queries past the 1,000 cap, classifies multi-signal status, persists state across runs, and ships a diff. Possible. You also own all of it — schema drift in GitHub's API, rate-limit policy changes, partition boundary collisions, dedup memory pressure. That's a maintained service, not an afternoon script.
  3. OSS Insight or GitHub Insights dashboards — Free, dashboard-style category browsing. No abandonment verdict per repo, no zombie detection, no scheduled diff, no auto-partition past the 1,000 cap. Useful for trend visualisation, not for fleet-scale abandonment surveillance.
  4. Snyk, Sonatype, or similar enterprise supply-chain platforms — Strong on CVE scanning and license posture. Designed primarily for security-first views of already-adopted dependencies. Abandonment surfaces as a derived signal, not a primary classification, and pricing is typically tens of thousands per year.
  5. A decision-intelligence actor — Turnkey multi-signal abandonment classification with auto-partitioning and cross-run diff. One of the best fits when the goal is fleet-scale "what's newly dead this week" rather than ad-hoc one-repo checks. The developer and open-source tools comparison breaks down the feature differences.

Each approach has trade-offs in effort, cost, depth, and operational maturity. The right choice depends on how many repos you're auditing, whether weekly cadence matters, and whether you need a single classification verdict or a metrics dashboard.

Comparison table

ApproachCost per repoOutputScaleCross-run diffAuto-partitionBest for
Manual reviewEngineer timeSubjective5-20/hourNoNoOne-off deep dives
Custom API integrationEngineering build + runWhatever you ship100s–1000sYou build itYou build itBespoke needs
OSS Insight / GitHub InsightsFreeDashboards10s/sessionNoNoCategory browsing
Snyk / Sonatype$10k–100k/yr seatCVE + license1000s/runContinuousPartialPost-adoption security monitoring
Decision-intelligence actor$0.15/repoVerdict + diffUp to 10,000/runYesYes (up to 10,000)Fleet-scale abandonment surveillance

Pricing and features based on publicly available information as of May 2026 and may change.

Best practices for fleet-wide abandonment detection

  1. Use a signal stack, not a single field. daysSinceLastPush alone is a 40%-false-positive filter. Pair it with maintenance.trajectory, isZombie, topContributorShare, and latestRelease.daysSinceRelease to drop the false-positive rate.
  2. Separate FEATURE_COMPLETE from ABANDONED. A small utility with a clean license, zero open issues, and a stable test suite is done. Treating it as abandoned will block adoption of perfectly fine software.
  3. Run weekly, not quarterly. Abandonment is a weekly-cadence signal. Quarterly audits miss the four weeks where a NEWLY_ABANDONED dependency could break production.
  4. Filter on COLLAPSING trajectory, not just ABANDONED status. COLLAPSING is the early warning. ABANDONED is the post-mortem. The audit you actually want catches projects on the way down, not after they're cold.
  5. Use auto-partitioning for any category over 500 repos. The 1,000-result cap is where naive scans silently lose data. Partitioned scans return a query coverage confidence rating so you know what was actually swept.
  6. Read the diff, not the dataset. A 200-row weekly dataset is unreadable. The cross-run diff is 3-5 rows: the only ones that changed.
  7. Set excludeArchived: true for dependency audits. GitHub's archive flag is intentional abandonment by the maintainer — it's a positive signal, not a verdict. Filtering it out keeps the audit focused on uncertain cases.
  8. Cross-reference NEWLY_ABANDONED with CVE scans. Abandoned repos don't patch CVEs. The combination of NEWLY_ABANDONED status plus an open CVE is the highest-priority signal in a supply-chain audit.

Common mistakes in abandonment detection

  • Mistake 1: Trusting pushedAt on its own. Dependabot bumps, Actions workflow updates, and bot-merged PRs all tick pushedAt. Date filters miss zombies entirely.
  • Mistake 2: Assuming the 1,000-result cap is per-page. It's per-query. Pagination stops at page 10. Sort-order changes give you a different 1,000, not a 2,000.
  • Mistake 3: Treating archived repos as a finding. isArchived: true is the maintainer explicitly saying "this is done". Filtering them out is the right default for adoption audits.
  • Mistake 4: Reading the full dataset weekly. A 1,000-repo audit produces a 1,000-row dataset every run. Nobody re-reads it. The diff is what matters.
  • Mistake 5: Running the audit once and forgetting. A green audit at adoption time decays in 90 days. Without scheduled re-evaluation the verdict is technical debt.
  • Mistake 6: Conflating "no activity" with "no maintenance". A feature-complete utility can run for years without commits. The audit needs to distinguish "stable and done" from "stalled and dying".

Worked example — dependency audit across 500 deps

A hypothetical platform engineering team at a mid-size SaaS company has a dependency tree of roughly 500 direct and transitive open-source packages. They want a weekly read on which ones are dying before they break production.

Before: One engineer maintained a Google Sheet listing every dependency, manually opened the GitHub page for each one quarterly, eyeballed the commit graph, and updated a "last reviewed" column. Total effort: roughly two engineer-days per quarter. Zombies and COLLAPSING projects routinely slipped through.

After: They scheduled the GitHub Repo Intelligence actor on Apify Schedules to run weekly. Mode: dependency-audit. Input: the 500 repo owner/name pairs fed into the compareRepos array (or a query-based scan with excludeArchived: true). compareToPreviousRun: true enabled. Cost: 500 repos × $0.15 = $75 per run. Weekly cadence = roughly $300/month.

Result: The Monday-morning artefact is a 5-bullet diff summary: usually a couple of SCORE_CHANGE entries, one STATUS_CHANGE, occasionally one NEWLY_ABANDONED. Total reading time: under five minutes. The two engineer-days per quarter became five minutes per week.

In the first month, the team caught one COLLAPSING flag on a logging library that had silently lost its sole maintainer. They migrated off it before any production impact. The single avoided incident — even at conservative cost — covered roughly two years of weekly run charges.

These numbers reflect one internal scenario. Results vary depending on dependency tree size, team rates, and how rigorous the manual baseline is.

How much does it cost to detect abandoned repos at scale?

The actor is priced at $0.15 per repository fetched (pay-per-event). Platform compute costs are included.

ScenarioReposCadenceCost per runAnnual cost
Quick audit (single dep tree)50One-off$7.50
Weekly dep tree monitoring200Weekly$30$1,560
Larger dep tree weekly500Weekly$75$3,900
Monthly dep tree (relaxed cadence)500Monthly$75$900
Category market-map5,000Quarterly$750$3,000
Full category sweep with auto-partition10,000One-off$1,500

For comparison, manual evaluation by an engineer at loaded rates of roughly $150–250/hour costs roughly $50–200 per repo depending on depth — meaning a single 500-repo manual audit runs $25,000–$100,000 of engineering time. A weekly scheduled scan at $75/run covers a year of monitoring for roughly the cost of an afternoon of manual review.

Apify's free tier includes $5 of monthly credits — enough for around 33 repository intelligence reports at no cost during evaluation.

Set a spending limit in the Apify Console to cap charges per run. The actor stops and saves partial results when the limit is reached.

Implementation checklist

To set up scheduled abandonment detection at scale:

  1. Define scope. Single dependency tree (use compareRepos array with explicit owner/name pairs) or category audit (use a query plus autoPartitionResults: true)?
  2. Pick a solution mode. dependency-audit for supply-chain monitoring, market-map for category sweeps, repo-due-diligence for one-off deep dives.
  3. Get a free GitHub token. No scopes required. Triples throughput from 10 to 30 requests/minute.
  4. Configure auto-partition. Set autoPartitionResults: true for any query matching more than ~500 repos so the audit doesn't silently truncate.
  5. Enable cross-run diff. Set compareToPreviousRun: true. The first run baselines; every run after surfaces only what changed.
  6. Filter sensibly. excludeArchived: true keeps the audit focused on uncertain cases. excludeForks: true removes downstream copies that don't represent independent maintenance.
  7. Schedule weekly. Apify Schedules — pick a quiet time on Monday so the diff is ready before stand-up.
  8. Wire the diff into Slack or email. Apify's webhook or Zapier integration pipes the KV summary diff into a Slack channel or weekly digest.
  9. Cross-reference NEWLY_ABANDONED with CVE scans. Abandoned + open CVE is the highest-priority combination.

Example input JSON

{
    "compareRepos": [
        "facebook/react",
        "vuejs/vue",
        "sveltejs/svelte"
    ],
    "mode": "dependency-audit",
    "compareToPreviousRun": true,
    "excludeArchived": true,
    "githubToken": "ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
}

That input scores three frameworks side-by-side, classifies each by maintenance status and trajectory, and tags any repo whose status flipped since the previous run.

For category-wide auditing:

{
    "query": "topic:vector-database",
    "mode": "market-map",
    "maxResults": 5000,
    "autoPartitionResults": true,
    "excludeArchived": true,
    "excludeForks": true,
    "compareToPreviousRun": true,
    "githubToken": "ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
}

That input scans up to 5,000 vector-database repos with star-range partitioning, deduplicates across partitions, scores and classifies each, and flags newly abandoned ones since the prior weekly run.

Limitations

  • Search index lag. GitHub's Search API doesn't instantly reflect new repos or updated star counts. Brand-new projects may be invisible for minutes to hours after creation.
  • 1,000-result hard cap per query segment. Auto-partitioning breaks past this by splitting star ranges, but very broad queries (e.g., bare language:python) still need thoughtful sub-querying. The query coverage report tells you when confidence is low.
  • No private repo access. Search API returns public repos only. Private repo abandonment audits need a different approach with explicit token scopes.
  • Forecasts are probabilistic. forecast.abandonmentRisk90d: "HIGH" is a weighted classification, not a certainty. Projects can flip trajectory fast.
  • Search index doesn't include code. Abandonment classification works on metadata — commits, contributors, releases, community files. Code-quality issues that don't show up at the metadata level need static analysis.
  • Email extraction varies by repo. Many maintainers use GitHub noreply addresses. Hit rates of 20-60% are typical when the audit also wants to surface reachable maintainers.

Key facts about abandonment detection

  1. GitHub's Search API hard-caps at 1,000 results per query — partitioning across star ranges with deduplication is the only way to sweep past it cleanly, up to 10,000 repos per run with the actor's autoPartitionResults.
  2. A repo can have 45,000 stars and be classified COLLAPSING with HIGH bus factor risk simultaneously — stars and abandonment are independent variables.
  3. Naive pushedAt-only filters miss roughly 40% of dead repos because Dependabot, Actions, and license-bot commits all bump the timestamp.
  4. Two of three open-source projects that commercial application teams depend on have become dormant, per Sonatype's 2024 State of the Software Supply Chain report.
  5. Unauthenticated GitHub API access is limited to 10 requests/minute; a free token (no scopes needed) raises this to 30/minute — a 3x throughput gain at zero cost.
  6. The five-stage maintenance enum (ACTIVE → STABLE → SLOWING → AT_RISK → ABANDONED) plus the trajectory enum (GROWING / STABLE / DECLINING / COLLAPSING / REVIVING) together produce verdicts that single-field date filters can't.
  7. Decision-intelligence detection at $0.15 per repo is roughly 100-1000x cheaper than manual technical review at engineering loaded rates for category-scale audits.
  8. Cross-run diff (NEW, SCORE_CHANGE, STATUS_CHANGE, NEWLY_ABANDONED) reduces a weekly 1,000-row dataset to a 3-5-row Monday-morning read.

Short glossary

  • Bus factor — Number of contributors who could disappear before a project stalls. Low bus factor = high single-point-of-failure risk.
  • Zombie repo — A project showing recent activity that's bot-driven (Dependabot, Actions, license bots) rather than human-authored maintenance.
  • Trajectory — Directional classification: GROWING / STABLE / DECLINING / COLLAPSING / REVIVING. Direction, not state.
  • Decay score — A 0-100 measure of how fast a project is losing activity, with an associated velocity (NONE, SLOW, FAST).
  • NEWLY_ABANDONED — A change-type flag emitted on a scheduled run when a repo crosses the abandonment threshold since the previous run.
  • Auto-partition — Strategy that splits a broad GitHub Search query into sub-queries by star range to retrieve more than the 1,000-result API cap.
  • Feature-complete — A project that's intentionally low-activity because it's done — clean license, stable test suite, zero open issues. Distinct from abandoned.

Broader applicability

The pattern here — multi-signal classification + scheduled re-evaluation + diff-based reads — applies far beyond GitHub. The same five universal principles show up across B2B lead scoring, corporate due diligence, and Apify actor reliability monitoring:

  1. Convert signals into verdicts, not dashboards. A category status enum beats five separate metrics.
  2. Read several signals together so they correct each other. No single field carries enough information.
  3. Separate direction (trajectory) from state (status). Direction is the leading indicator.
  4. Re-evaluate on a schedule. A one-time read decays fast.
  5. Read the diff, not the dataset. Cross-run change detection is what makes scheduled runs actionable.

This is the same rolling-window thinking that drives Apify actor reliability monitoring — the fleet-wide useful signal isn't the current state, it's what changed since last time.

When you need this

Use scheduled at-scale abandonment detection when:

  • Auditing a dependency tree of more than ~50 packages
  • Mapping a technology category for adoption decisions
  • Doing supply-chain due diligence across hundreds of OSS dependencies
  • Monitoring a portfolio of investment-stage open-source projects
  • Running a vendor-replacement scan when a critical dependency starts to decay
  • Replacing a quarterly manual audit with a weekly automated read

You probably don't need this if:

  • You're evaluating a single repo and can eyeball it in five minutes
  • The dependency is in a private internal repo (Search API doesn't return it)
  • You only need a yes/no CVE check — that's a separate problem (use Snyk, Dependabot, or a similar scanner)
  • The repos are throwaway prototypes where abandonment risk doesn't matter
  • You already have full internal context and a CODEOWNERS-style relationship with the maintainer

How to detect abandoned GitHub repos in a dependency tree

Set mode: "dependency-audit", pass the explicit dependency list via compareRepos, enable compareToPreviousRun: true, and schedule weekly. The actor scores and classifies each repo, flags any that crossed the abandonment threshold since the prior run as NEWLY_ABANDONED, and surfaces a diff in the KV summary.

How to scan more than 1,000 GitHub repos at once

Set autoPartitionResults: true and increase maxResults up to 10,000. The actor recursively splits the query by star range, deduplicates across partitions, and returns a query coverage report so you know whether the partitioning swept the full category. A single one-off category sweep at 10,000 repos costs $1,500.

How to monitor for newly abandoned dependencies on a schedule

Schedule the actor weekly via Apify Schedules with compareToPreviousRun: true. The first run baselines. Every run after writes a cross-run diff to the KV store containing NEW, SCORE_CHANGE, STATUS_CHANGE, and NEWLY_ABANDONED entries. Pipe the diff to Slack or email via Apify's webhook integration so the audit team only reads what changed.

How to compare repository health over time

Use the same scheduled-monitoring pattern. The diff entries surface SCORE_CHANGE for repos whose composite scores moved, STATUS_CHANGE for lifecycle stage flips, and NEWLY_ABANDONED for the formal threshold crossings. The previousState object on each repo gives the audit trail for compliance.

Common misconceptions

"A repo with recent commits is maintained." Not if those commits are bot-only. Dependabot version bumps, GitHub Actions workflow updates, and license-bot edits all bump pushedAt without anyone maintaining the project. The signal that catches this is isZombie: true.

"Abandoned means no commits for a year." That's the formal threshold (isAbandoned: true triggers at 365+ days of no push), but the operationally useful signal is COLLAPSING trajectory — abandonment in progress, before the formal threshold. By the time isAbandoned flips, the project has been functionally dead for months.

"Star count tells me which repos to worry about." Stars and abandonment are independent. A 45,000-star repo can be COLLAPSING. A 2,000-star repo can be GROWING. Filter by trajectory and decay velocity, not by star count.

"GitHub's Search API can scan a whole category if I paginate harder." It can't. The 1,000-result cap is per query, not per page. Pagination stops at page 10, and switching sort order gives you a different 1,000, not a bigger 1,000. Auto-partitioning across star ranges is the only clean fix.

"Archived repos are abandoned." They're not — they're intentionally feature-complete, with the maintainer explicitly saying so. For adoption audits, treat isArchived: true as a positive signal of intent and filter them out of the abandonment classification.

"One audit per quarter is enough." Abandonment is a weekly-cadence signal. Quarterly audits miss the eight weeks where a NEWLY_ABANDONED dependency could break production. Weekly diff-based monitoring is the operationally correct cadence.

Frequently asked questions

How do I detect abandoned GitHub repositories at scale?

Run a multi-signal classification across the target list (dependency tree or category) with compareToPreviousRun: true on a weekly schedule. The actor classifies each repo as ACTIVE / STABLE / SLOWING / AT_RISK / ABANDONED, scores trajectory, flags zombies, measures bus factor, and emits a NEWLY_ABANDONED tag for any repo that crossed the threshold since the last run. Read the KV-summary diff, not the dataset.

What signals matter beyond the last commit date?

maintenance.trajectory (COLLAPSING is the leading indicator), isZombie (catches bot-only activity), topContributorShare (bus factor concentration), latestRelease.daysSinceRelease (release cadence), decayScore and decayVelocity (how fast activity is dropping), and forecast.abandonmentRisk90d (forward projection). Read together they catch the ~40% of zombies and COLLAPSING projects that a pushedAt-only filter misses.

Can I scan more than 1,000 repos in a single category audit?

Yes, with autoPartitionResults: true. The GitHub Search API hard-caps each query at 1,000 results, but the actor recursively splits broad queries by star ranges, deduplicates across partitions, and supports up to 10,000 repos per run. A query coverage report in the KV summary tells you whether the partitioning swept the full category.

How do zombie repos differ from abandoned repos?

A zombie repo has recent pushes (daysSinceLastPush is low) but no real human maintenance — commits are entirely from dependabot[bot], github-actions[bot], or similar automation. An abandoned repo has no pushes at all in 365+ days. Both are unsafe to depend on, but zombies are harder to catch because they pass naive date filters. The actor flags them with isZombie: true and a zombieSignals array.

How often should I rerun abandonment detection?

Weekly for production dependency trees in fast-moving categories (AI tooling, cryptography, web frameworks). Monthly for slower-moving categories. Quarterly is too long — projects can go from STABLE to ABANDONED in 8-12 weeks, and a quarterly audit will surface the change after it's already burnt you. Cross-run diff makes weekly cadence cheap to read.

What's the difference between ABANDONED and FEATURE_COMPLETE?

Abandoned projects are dying — declining commit cadence, growing issue backlog, stale releases, often a single maintainer who left. Feature-complete projects are intentionally done — clean license, stable test suite, zero open issues, often a tagged final release. Both have low recent activity, but they're operationally different. The actor distinguishes them with isFeatureComplete: true.

Does this replace CVE scanning?

No. They answer different questions. Abandonment detection answers "is this project alive and being maintained?" CVE scanning answers "does this specific version contain a known vulnerability?" Use both — abandonment classification to decide adoption and re-evaluation, CVE scanning to monitor what's already adopted. The combination of NEWLY_ABANDONED + an open CVE is the highest-priority signal in a supply-chain audit.

How much does it cost to monitor a 200-package dependency tree weekly?

200 repos × $0.15 per repo = $30 per run. Weekly cadence = roughly $1,560 a year. For comparison, a single avoided production incident from catching a NEWLY_ABANDONED dependency one week earlier easily covers multiple years of monitoring at engineering loaded rates.

Can I get a diff of what changed since last week's run?

Yes — that's exactly what compareToPreviousRun: true produces. The actor stores state in a named KV store between runs and emits NEW / SCORE_CHANGE / STATUS_CHANGE / NEWLY_ABANDONED tags on each repo, plus a top-level diff summary in the KV store with counts and affected repo names. Pipe the summary to Slack or email via Apify webhooks for hands-off monitoring.


Ryan Clinton publishes Apify actors as ryanclinton and builds developer tools at ApifyForge.


Last updated: May 2026

This guide focuses on GitHub repositories, but the same patterns — multi-signal classification, scheduled re-evaluation, and diff-based reads — apply broadly to any fleet-scale lifecycle problem, from B2B lead scoring to corporate due diligence to portfolio-wide actor reliability monitoring.