Performance
ArdiQ’s edge isn’t a single number — it’s the balance. Because the worker loop and every Redis round-trip run in Rust, off the GIL, ArdiQ delivers near-top throughput at the lowest memory of any fast queue, which gives it the best throughput-to-memory ratio in the field.
The numbers below come from an apples-to-apples suite that runs six Redis-backed Python queues through the same scenarios on the same machine. It’s open and reproducible: python-task-queue-benchmarks.
Efficiency: throughput per MB
Section titled “Efficiency: throughput per MB”The metric that captures the whole trade-off is how much work a queue does per megabyte of memory it holds. ArdiQ leads it.
| Queue | I/O (tasks/s per MB) | CPU (tasks/s per MB) |
|---|---|---|
| ArdiQ 🦀 | 2.9 | 11.2 |
| arq | 2.9 | 10.5 |
| Streaq | 1.9 | 7.2 |
| Taskiq | 1.1 | 4.1 |
| Celery | 1.4 | 0.3 |
| Dramatiq | 1.7 | 0.2 |
ArdiQ does the most work per megabyte of any queue tested — roughly 2.7× Taskiq’s I/O efficiency. (arq matches it on I/O efficiency, but only by running ~10% slower; ArdiQ stays this light while sitting near the throughput ceiling.)
Test setup
Section titled “Test setup”- 1,000 tasks, 1 worker process, 10 concurrent tasks, 3 iterations —
metrics reported as
mean ± std. - Two scenarios:
io_task— a 100 ms sleep (asyncio.sleepfor async libs,time.sleepfor sync).cpu_task— 1,000 SHA-256 hashes over 1 KiB inputs per task.
- Machine: 8-core / 16-thread x86-64, 15 GB RAM, CPython 3.13, Redis 7.4.
- Versions: ArdiQ 0.1.1, arq 0.28, Taskiq 0.12.4, Streaq 6.5.0, Celery 5.5.3, Dramatiq 2.1.0.
I/O-bound throughput
Section titled “I/O-bound throughput”The io_task scenario is the realistic one for these libraries — async-native queues
multiplex the 10 sleeps on one event loop. With 1,000 tasks at concurrency 10 and a 100 ms
sleep, the theoretical ceiling is 100 tasks/s, so anything near it is essentially
network-bound.
| Queue | Throughput (tasks/s) | Memory |
|---|---|---|
| ArdiQ 🦀 | 98.6 | 34 MB 🪶 |
| Taskiq | 97.9 | 92 MB |
| Dramatiq | 93.5 | 56 MB |
| Streaq | 93.4 | 48 MB |
| arq | 87.7 | 30 MB |
| Celery | 71.7 | 51 MB |
ArdiQ runs within ~1% of the fastest queue, practically hitting the network ceiling — at roughly a third of that queue’s memory. It’s the lightest of every queue that clears 90% of the ceiling.
CPU-bound throughput
Section titled “CPU-bound throughput”The cpu_task scenario hashes under the GIL, so for every single-process queue the task
body is serial on one core. What this measures is really per-task framing overhead
(serialization, broker round-trips, bookkeeping) on top of the constant hashing cost.
| Queue | Throughput (tasks/s) | Memory |
|---|---|---|
| ArdiQ 🦀 | 389.3 | 34 MB 🪶 |
| Taskiq | 388.1 | 94 MB |
| Streaq | 353.8 | 49 MB |
| arq | 317.6 | 30 MB |
| Celery | 13.8 | 52 MB |
| Dramatiq | 13.8 | 56 MB |
Again ArdiQ is effectively tied for the lead on throughput, at a third of the leader’s memory. (Celery and Dramatiq sit far lower here because their thread pools serialize on the GIL for this workload — see the caveats.)
The takeaways
Section titled “The takeaways”- ⚡ Best throughput-to-memory ratio — ArdiQ does the most work per megabyte of any queue in the suite.
- 🪶 Lightest of the fast queues — ~34 MB, the lowest footprint of anything at its performance level. (arq is marginally lighter in absolute terms but meaningfully slower.)
- 🏆 Among the fastest — within ~1% of the leader on both workloads.
- 📈 Near the theoretical ceiling on I/O work — practically network-bound, with nothing lost to scheduling.
- 🎯 Rock-steady — negligible variance run to run (low
std).
Honest caveats
Section titled “Honest caveats”- CPU parallelism isn’t measured here. All libraries run one worker; to scale CPU work
you’d run multiple worker processes (Celery’s prefork, Dramatiq’s
--processes N, or several async workers). This suite measures per-task overhead, not multi-core scaling. - Each queue uses its idiomatic dispatch path and the same Redis instance, one at a time. Latency, raw per-iteration samples, and the full methodology — including how tail-latency is measured — live in the benchmark repo.