Job Queue Statistics - Spotify for Backstage

Overview

A scheduled monitoring task that logs aggregate statistics about Soundcheck’s job queue performance. Provides visibility into fact collection throughput, queue backlogs, and worker processing rates.

What it Reports

For each active queue (with non-zero activity):

Waiting: Jobs queued but not yet started
Active: Jobs currently being processed by workers
Completed: Jobs successfully processed since last report (incremental count)
Failed: Jobs that failed since last report (incremental count)
Delayed: Jobs scheduled for future execution (BullMQ only)
Jobs by Type: Breakdown of queued jobs by collector/check type

Configuration

Add to app-config.yaml:

soundcheck:
  job:
    statistics:
      # Enable queue statistics reporting
      enabled: true

      # Cron expression for reporting frequency (default: every 15 minutes)
      reportingFrequencyCron: '*/15 * * * *'

Log Output

Statistics are logged to the backend logger at info level:

=== Job Queue Statistics ===
Total queues: 8 (3 active)

Queue: scm
  Waiting: 64715
  Active: 1
  Completed: 2341
  Failed: 12
  Delayed: 0
  Jobs by type:
    soundcheck/collector/scm/0/scm:default/required_files_exist: 32361
    soundcheck/collector/scm/1/scm:default/api-report-has-no-edit-warning: 32354

Queue: github
  Waiting: 156
  Active: 2
  Completed: 847
  Failed: 3
  Jobs by type:
    soundcheck/collector/github/0/github:default/branch-protection: 156

=== End Job Queue Statistics ===

Understanding the Metrics

Incremental counts: Completed and Failed counts reset after each report, showing jobs processed in the reporting interval
Snapshot counts: Waiting, Active, and Delayed show current queue state
Idle queues filtered: Only queues with activity are shown to reduce log noise
Job type naming: Format is soundcheck/collector/{collector}/{priority}/{namespace}:{scope}/{check-id}

Use Cases

Capacity planning: Identify if Soundcheck pods/workers are struggling to keep up with job volume
Bottleneck detection: High waiting counts may indicate a need for more pods/workers or reducing job frequency
Error monitoring: Failed job counts reveal systematic collection issues
Performance validation: Verify expected throughput after worker configuration changes

Troubleshooting

No logs appearing

Verify soundcheck.job.statistics.enabled: true in config
Check logger level allows info messages
Confirm at least one queue has activity (check will be silent if all queues are idle)

High waiting counts

Review worker configuration at soundcheck.job.workers.{worker-name}.concurrency
Check rate limiter settings at soundcheck.job.workers.{worker-name}.limiter.max and .duration
Consider switching from local queues to Redis queues for global rate limiting across instances

Persistent failed counts

Review backend logs for job failure stack traces
Common causes: API rate limits, network timeouts, invalid check configurations
Failed jobs are not automatically retried by default

Documentation Index

​Overview

​What it Reports

​Configuration

​Log Output

​Understanding the Metrics

​Use Cases

​Troubleshooting

​No logs appearing

​High waiting counts

​Persistent failed counts