Choosing the Right Driver for Your Load Profile
Driver selection is the single highest-leverage decision you make for queue performance. Every driver
has a fundamentally different I/O model and latency floor, which determines your maximum achievable
throughput before any other tuning matters.
The database driver uses SELECT FOR UPDATE SKIP LOCKED — which is efficient, but
you're still doing a round-trip to a relational database for every single job pop. Under heavy load
this creates contention on the jobs table, particularly around index scans. Expect
roughly 200–800 jobs/second per worker on PostgreSQL with proper indexing, far less on MySQL at
high concurrency.
The Redis driver uses atomic Lua scripts for job pop operations (EVAL + LPOP + ZADD).
Redis operates entirely in memory with sub-millisecond latency. A single Redis instance can sustain
50,000–100,000 operations/second, making it appropriate for anything beyond "a few hundred jobs per
minute." Redis is almost always the correct choice for production queue backends.
SQS is a managed, distributed queue with at-least-once delivery and a 1-second
minimum polling interval. It shines for multi-region deployments and workloads that need AWS-native
durability, but its polling model introduces irreducible latency. SQS long-polling (20s wait time)
reduces empty-poll API costs dramatically.
// config/queue.php – production Redis tuning
'redis' => [
'driver' => 'redis',
'connection' => 'queue', // separate Redis connection for queues
'queue' => env('REDIS_QUEUE', 'default'),
'retry_after' => 90, // seconds before a reserved job is re-attempted
'block_for' => 5, // seconds to block on BLPOP before re-checking
'after_commit' => true, // only dispatch after DB transaction commits
],
// config/database.php – dedicated queue Redis connection
'connections' => [
'queue' => [
'url' => env('REDIS_URL'),
'host' => env('REDIS_HOST', '127.0.0.1'),
'password' => env('REDIS_PASSWORD'),
'port' => env('REDIS_PORT', '6379'),
'database' => 1, // use DB 1, not DB 0 (separate keyspace)
'options' => [
'parameters' => [
'tcp_keepalive' => 60,
],
],
],
],
The block_for option is critical: it controls the BLPOP timeout. With null
(default), Laravel falls back to polling with usleep(), which wastes CPU. Setting it
to 5 means the worker will efficiently block and wake up only when a job arrives, reducing both
CPU usage and latency simultaneously.
The retry_after value must always exceed your longest expected job runtime by a safe
margin. If a job takes 60 seconds and retry_after is 90, you have a 30-second buffer
before a slow worker causes duplicate processing. Set it conservatively high — a reserved job
sitting for too long is far less costly than the same job running twice.
Driver selection rule: database for <500 jobs/min, Redis for <50k jobs/min, SQS when you
need AWS-native cross-region durability or serverless workers. Never use database driver with
Horizon — Horizon is Redis-only.
Retry Interval Math – Exponential, Linear & Jitter
Retry strategy is not just about giving a job "another chance." Naive retry strategies (constant
interval) cause thundering-herd problems: every failed job retries at the same moment, spiking
load on the external service that caused the failure in the first place. This section covers the
math behind each approach.
Linear Backoff
Retry after N seconds, then 2N, then 3N... Linear backoff is simple but still clusters retries.
With 100 concurrent failures and a 30-second interval, all 100 jobs retry simultaneously at T+30s.
Exponential Backoff
Retry after base^attempt seconds. With base=2: 2s, 4s, 8s, 16s, 32s... This spreads
retries logarithmically. Laravel's built-in exponential backoff uses:
2^(attempts - 1) * 10 seconds (10s, 20s, 40s, 80s...).
Exponential Backoff with Full Jitter
This is the recommended AWS architecture for any retry loop. The formula is:
random(0, min(cap, base * 2^attempt)) where cap is a maximum wait ceiling.
The randomization ensures that even with 10,000 simultaneous failures, retries are spread
uniformly across the window rather than synchronized.
// App\Jobs\CallExternalApiJob.php
class CallExternalApiJob implements ShouldQueue
{
use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;
public int $tries = 6;
public int $maxExceptions = 3; // stop after 3 different exception types
/**
* Exponential backoff with full jitter.
* Attempt 1: 0–10s, Attempt 2: 0–20s, Attempt 3: 0–40s...
* Cap at 10 minutes (600s) to prevent multi-hour delays.
*/
public function backoff(): array
{
return $this->exponentialJitterBackoff(attempts: 6, base: 10, cap: 600);
}
private function exponentialJitterBackoff(int $attempts, int $base, int $cap): array
{
$delays = [];
for ($i = 1; $i <= $attempts; $i++) {
$ceiling = min($cap, $base * (2 ** $i));
$delays[] = random_int(0, $ceiling);
}
return $delays;
}
public function handle(ExternalApiClient $client): void
{
$response = $client->post('/endpoint', $this->payload);
// Treat 429 (rate limit) differently — respect the Retry-After header
if ($response->status() === 429) {
$retryAfter = (int) $response->header('Retry-After', 60);
$this->release($retryAfter); // manually release with specific delay
return;
}
if ($response->failed()) {
throw new \RuntimeException("API call failed: {$response->status()}");
}
}
}
The release() call is important for rate-limit handling: instead of throwing an exception
(which increments the attempt counter), you release the job back to the queue with the exact delay
the server requested without burning a retry slot.
// Comparing backoff strategies for 100 simultaneous failures
// After attempt 3:
// Constant (30s): all 100 retry at exactly T+30 → spike
// Exponential (2^n * 10): all 100 retry at T+80 → spike, but larger window
// Exponential + full jitter: 100 jobs spread across [0, 80s] uniformly → no spike
// With jitter, the expected wait equals half the exponential ceiling:
// E[retry_after] = ceiling / 2 = 80 / 2 = 40s average wait
// vs deterministic 80s → jitter also reduces average wait time
Worker Connection Pool Tuning
Every queue worker process maintains its own set of long-lived connections — to the database,
Redis, and any other services it uses. Understanding how these connections are managed is
essential for preventing both connection exhaustion and idle resource waste.
A worker running with --sleep=0 and processing 10 jobs per second will hold
its database connection open for the entire lifetime of the process. If you run 20 workers
on a single server, you're consuming 20 persistent database connections just for the queue.
At 10 servers, that's 200 connections before your application servers even connect.
// config/database.php – production connection settings for workers
'mysql' => [
'driver' => 'mysql',
'host' => env('DB_HOST'),
'database' => env('DB_DATABASE'),
'username' => env('DB_USERNAME'),
'password' => env('DB_PASSWORD'),
'options' => [
PDO::ATTR_TIMEOUT => 5,
PDO::ATTR_PERSISTENT => false, // workers should NOT use persistent connections
PDO::MYSQL_ATTR_INIT_COMMAND => "SET time_zone='+00:00'",
],
'pool' => [
'min_connections' => 1,
'max_connections' => 5, // per worker process
'connect_timeout' => 3.0,
'wait_timeout' => 3.0,
'heartbeat' => -1,
'max_idle_time' => 60.0,
],
],
Workers lose their database connection when the MySQL server times out the idle connection
(default 8 hours, but often tuned lower on managed services). Laravel will automatically
reconnect on the next query, but this causes a silent one-request delay. You can force
reconnection proactively by adding a connection health check at the start of each job:
// App\Jobs\Concerns\RefreshesConnections.php
trait RefreshesConnections
{
public function handle(): void
{
// Force reconnect if connection has gone away
try {
DB::connection()->getPdo();
} catch (\Exception) {
DB::reconnect();
}
$this->process(); // implemented by the job
}
}
// Alternatively — configure the Horizon supervisor to restart workers periodically:
// This is cleaner than manual reconnection logic
// horizon.php
'supervisors' => [
'default' => [
'maxProcesses' => 10,
'balanceMaxShift' => 1,
'balanceCooldown' => 3,
// restart workers after processing N jobs to clear any state/memory leaks
// done via --max-jobs flag
],
],
# Supervisor config for workers with connection recycling
# /etc/supervisor/conf.d/laravel-worker.conf
[program:laravel-worker]
process_name=%(program_name)s_%(process_num)02d
command=php /var/www/artisan queue:work redis \
--sleep=3 \
--tries=3 \
--max-time=3600 \
--max-jobs=1000 \
--memory=256
autostart=true
autorestart=true
numprocs=8
redirect_stderr=true
stdout_logfile=/var/log/worker.log
# --max-time=3600 restart after 1 hour (clears memory leaks, refreshes connections)
# --max-jobs=1000 restart after 1000 jobs (prevents slow memory growth)
The --max-jobs and --max-time flags are the cleanest way to handle
connection recycling. By restarting workers periodically, you guarantee fresh connections
and prevent the cumulative memory growth that happens when jobs leave small allocations
behind across thousands of executions.
PHP OPcache for Long-Running Workers
Standard web-request PHP processes are short-lived: OPcache compiles bytecode once, serves
the request, and the process dies. Queue workers are the opposite — they're long-running
processes that handle thousands of jobs within a single PHP process lifetime.
This means OPcache revalidation settings that are acceptable for web servers can actively
harm worker performance. Every time OPcache checks whether a file has changed on disk,
it performs a stat() syscall. For workers, this is pure overhead since you
never deploy mid-job.
; /etc/php/8.3/cli/conf.d/20-opcache-worker.ini
; For queue workers specifically (PHP CLI)
[opcache]
opcache.enable=1
opcache.enable_cli=1 ; critical — OPcache disabled for CLI by default!
opcache.memory_consumption=256 ; MB — tune based on your application size
opcache.interned_strings_buffer=32 ; MB for interned strings (class names, keys, etc.)
opcache.max_accelerated_files=20000 ; increase if you have many files (vendor/)
opcache.revalidate_freq=0 ; never revalidate — workers don't hot-reload
opcache.validate_timestamps=0 ; skip stat() calls entirely
opcache.save_comments=1 ; needed for doctrine annotations / phpdoc parsing
opcache.fast_shutdown=1 ; skip destructor calls on shutdown
opcache.jit_buffer_size=128M ; PHP 8+ JIT — significant gain for CPU-bound jobs
opcache.jit=tracing ; tracing JIT — best for long-running CPU work
The opcache.enable_cli=1 line is the most commonly missed setting. PHP CLI
processes (which is what queue:work runs as) have OPcache disabled by default.
Without it, every job invocation re-parses and re-compiles all your PHP files from scratch.
Enabling it for workers can reduce per-job overhead by 30–60% depending on your application size.
// Verifying OPcache is active in workers
// Add this to a diagnostic job or artisan command:
class DiagnoseWorkerCommand extends Command
{
protected $signature = 'worker:diagnose';
public function handle(): void
{
$status = opcache_get_status();
$this->table(['Setting', 'Value'], [
['OPcache enabled', $status['opcache_enabled'] ? 'Yes' : 'No'],
['Cached scripts', $status['opcache_statistics']['num_cached_scripts']],
['Cache hits', $status['opcache_statistics']['hits']],
['Cache misses', $status['opcache_statistics']['misses']],
['Memory used (MB)', round($status['memory_usage']['used_memory'] / 1024 / 1024, 2)],
['Memory free (MB)', round($status['memory_usage']['free_memory'] / 1024 / 1024, 2)],
['JIT enabled', $status['jit']['enabled'] ?? false ? 'Yes' : 'No'],
]);
}
}
Since workers are long-running, OPcache will not pick up code changes after deployment.
This is correct behaviour: you should restart workers after deployment anyway
(php artisan queue:restart). This sends a signal via the cache driver telling
all workers to finish their current job and exit gracefully, after which Supervisor restarts them.
Measuring & Benchmarking Throughput
You cannot tune what you cannot measure. Before making any performance changes, establish a
baseline throughput number: jobs processed per second per worker. Then measure again after
each change. Without this discipline, you may be optimizing something that has no real impact.
// Benchmarking baseline — dispatch N no-op jobs and time them
class BenchmarkQueueCommand extends Command
{
protected $signature = 'queue:benchmark {count=1000} {--queue=benchmark}';
public function handle(): void
{
$count = (int) $this->argument('count');
$queue = $this->option('queue');
// Dispatch N no-op jobs
$this->info("Dispatching {$count} benchmark jobs...");
$dispatchStart = microtime(true);
for ($i = 0; $i < $count; $i++) {
NoOpBenchmarkJob::dispatch()->onQueue($queue);
}
$dispatchTime = microtime(true) - $dispatchStart;
$this->info(sprintf('Dispatched %d jobs in %.2fs (%.0f/s)',
$count,
$dispatchTime,
$count / $dispatchTime
));
// Now process them and measure throughput
// Run: php artisan queue:work --queue=benchmark --max-jobs=1000 --once
// and observe the time in Horizon or via Redis monitor
}
}
// App\Jobs\NoOpBenchmarkJob.php
class NoOpBenchmarkJob implements ShouldQueue
{
use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;
public function handle(): void
{
// Intentionally empty — measures queue infrastructure overhead only
}
}
# Real-time throughput monitoring via Redis
# Count jobs processed in the last 60 seconds using Redis keyspace events
# In redis-cli:
MONITOR # shows all commands in real time — useful for seeing job pop rate
# Better: use Redis XADD streams or Horizon's built-in metrics
# Horizon stores throughput per supervisor per minute in Redis
redis-cli HGETALL "horizon:supervisor:throughput"
# Measure queue depth changes over time
watch -n1 'redis-cli LLEN "queues:default"'
# Full queue stats via Horizon API
curl http://your-app.test/horizon/api/stats | jq .
Beyond raw throughput, track these metrics in production:
Queue depth (lag) — how many jobs are waiting. A growing queue means workers cannot keep up with dispatch rate.
Job processing time (p50, p95, p99) — median hides tail latency. A p99 of 30s when p50 is 100ms means 1% of jobs are stalling workers.
Failed job rate — percentage of jobs that fail after all retries. Sustained failure rates above 0.1% warrant investigation.
Worker memory growth rate — plot worker RSS over time. Linear growth indicates a memory leak in job logic.
// Instrumenting job processing time with Telescope or custom metrics
// App\Jobs\Concerns\TracksExecutionTime.php
trait TracksExecutionTime
{
private float $startTime;
public function middleware(): array
{
return [
new class implements JobMiddleware {
public function handle(mixed $job, callable $next): void
{
$start = microtime(true);
$next($job);
$duration = microtime(true) - $start;
// Push to your metrics system (Prometheus, Datadog, etc.)
app('metrics')->histogram(
name: 'queue_job_duration_seconds',
value: $duration,
labels: ['job' => class_basename($job)]
);
}
}
];
}
}
Payload Size Optimization
Every job you dispatch is serialized and stored in your queue backend. The size of that
payload affects: network transfer to Redis/SQS, storage memory/space, deserialization time,
and — critically with SQS — you have a hard 256 KB message limit.
The most common payload bloat comes from passing entire Eloquent models. When you dispatch
a job with a model instance, Laravel serializes it using SerializesModels,
which stores only the model's primary key and class name — not the full row data. However,
passing model collections, arrays of models, or nested objects doesn't benefit from this
optimization.
// BAD: passes the entire array of data unnecessarily
class ProcessOrderJob implements ShouldQueue
{
public function __construct(
private array $orderData, // could be 50KB+ of serialized array
private array $customerData, // more unnecessary data
private array $productList, // potentially hundreds of products
) {}
}
// GOOD: pass only IDs — let the job fetch what it needs
class ProcessOrderJob implements ShouldQueue
{
use SerializesModels;
public function __construct(
private readonly int $orderId, // 8 bytes
) {}
public function handle(OrderRepository $orders): void
{
$order = $orders->findWithRelations($this->orderId);
// fetch exactly the relations you need, nothing more
}
}
// For batches — use IDs, not models
Bus::batch(
$orderIds->map(fn($id) => new ProcessOrderJob($id))->toArray()
)->dispatch();
// Measuring actual payload sizes before they hit the queue
class PayloadSizeAuditMiddleware implements JobMiddleware
{
public function handle(mixed $job, callable $next): void
{
$payload = serialize($job);
$size = strlen($payload);
if ($size > 10_240) { // 10 KB warning threshold
logger()->warning('Large job payload detected', [
'job' => get_class($job),
'size_kb' => round($size / 1024, 2),
'threshold' => '10 KB',
]);
}
$next($job);
}
}
// SQS hard limit math:
// SQS max message size: 256 KB = 262,144 bytes
// After base64 encoding (SQS uses base64): effective limit ~192 KB raw
// After JSON envelope overhead: ~185 KB of actual payload
// If you need larger — use S3 Extended Client pattern:
// 1. Store payload in S3
// 2. Pass only the S3 key in the SQS message
// 3. Job downloads payload from S3 in handle()
For jobs that genuinely need large data (e.g., a list of 10,000 user IDs for a bulk operation),
use the chunking pattern instead of passing the full list:
// Instead of dispatching one job with 10,000 IDs:
// Dispatch 100 jobs with 100 IDs each
$userIds
->chunk(100)
->each(fn ($chunk) => ProcessUserBatchJob::dispatch($chunk->values()->toArray()));
// For very dynamic payloads — store in Redis/DB and pass a reference key
class LargePayloadJob implements ShouldQueue
{
public function __construct(
private readonly string $payloadKey // Redis key
) {}
public function handle(): void
{
$payload = Cache::get($this->payloadKey);
if (! $payload) {
// Payload expired — job cannot run, fail gracefully
$this->fail(new \RuntimeException("Payload key {$this->payloadKey} expired"));
return;
}
// Process...
// Clean up
Cache::forget($this->payloadKey);
}
}
Conclusion
Queue performance tuning is a stack of decisions, not a single setting. The hierarchy of impact,
from highest to lowest leverage, looks like this:
Driver selection — Redis vs database is a 10–50x throughput difference. Everything else is incremental.
OPcache for CLI — enabling opcache.enable_cli=1 can give 30–60% per-job overhead reduction for free.
Payload size — pass IDs, not models. Eliminate any data the job doesn't need at dispatch time.
Retry strategy — exponential backoff with full jitter prevents thundering-herd at external service boundaries.
Connection management — recycle workers with --max-jobs and --max-time rather than manual reconnection logic.
Benchmarking — measure before and after every change. Intuition about bottlenecks is frequently wrong.
Most production queue problems are not caused by Laravel itself — they're caused by jobs doing
too much, payloads carrying too much data, or retry strategies that amplify failure rather than
gracefully absorbing it. Fix those three things first.