Why We Built Sluice — Celery Monitoring Done Right
Why We Built Sluice: The Celery Monitoring Tool That Should Have Existed Years Ago
A task was sent two hours ago. Did it run? Is it stuck? Did it fail silently?
If you're running Celery in production, you've asked this question -- probably at an uncomfortable hour, probably while a customer was waiting, and probably without a good way to get the answer. We asked it enough times that we decided to build one.
This is the story of why Sluice exists, what we found when we went looking for a Celery monitoring tool, and the technical decisions we made along the way.
The Problem We Kept Running Into
Celery is everywhere. With tens of millions of monthly PyPI downloads and over 28,000 GitHub stars, it is -- by a wide margin -- the dominant task queue in the Python ecosystem. If your Django or Flask application processes anything in the background, there's a strong chance Celery is doing the work: payment processing, email delivery, image resizing, ML inference, data pipeline orchestration, report generation.
Monitoring it properly is surprisingly hard.
Celery's default configuration doesn't emit the events you'd need to track individual tasks through their lifecycle. The PENDING state -- which sounds like it means "waiting in the queue" -- actually means "Celery has no information about this task." A task in PENDING could be queued, could have been lost in transit, or could be a UUID that was never published in the first place. You genuinely can't tell the difference with stock Celery tooling.
That ambiguity is the root of the problem. Every Celery team we've talked to has a version of the same story: a critical task was published, nothing happened, nobody knew for hours, and the first signal was a customer reaching out. Not an alert. Not a dashboard. A support ticket.
For a deeper look at why PENDING is so deceptive, see Understanding Celery's PENDING State.
The State of Celery Monitoring in 2026
Before building anything, we surveyed every tool that claims to monitor Celery. What we found was a handful of partial solutions, each covering a slice of the problem but none covering the whole thing.
Flower: the default answer
Flower is the monitoring tool most Celery teams reach for first, and for good reason -- it's been around since 2012, it's recommended in Celery's own documentation, and with nearly 10 million monthly PyPI downloads, roughly one in four Celery installations also includes Flower. It gives you a real-time task list, worker status, and basic management actions. For local development, it's genuinely useful.
The problems show up in production. Flower stores everything in RAM. When it restarts -- a deployment, a crash, a pod reschedule in Kubernetes -- every task record disappears. There's no persistence layer, no alerting, no Beat schedule monitoring, and no way to search task history after the fact. The last PyPI release was August 2023, with nearly 200 open issues in the repository as of early 2026. Celery core contributors have begun work on a modernized fork, which is encouraging, but the original tool's architecture -- in-memory by default, Tornado-based, single-process -- makes the limitations structural, not incidental. (Flower does have a --persistent flag that uses Python's shelve module, but it's widely reported as unreliable and provides no queryable history.)
For a detailed comparison, see Sluice vs Flower.
Grafana + celery-exporter: the "serious" setup
If you've outgrown Flower, the next recommendation is usually to wire up Prometheus with a Celery exporter and build Grafana dashboards. This gives you aggregate metrics -- task counts over time, failure rates, runtime distributions, queue depths. Real infrastructure monitoring.
But it takes 2 to 4 hours to configure properly, requires maintaining three separate services (exporter, Prometheus, Grafana), and -- critically -- gives you zero individual task visibility. You can see that 47 tasks failed in the last hour, but you can't click through to a specific failure and read the traceback. You can see queue depth rising, but you can't identify which tasks are stuck. The celery-exporter project has 535 GitHub stars and exposes roughly 15 Prometheus metrics. That's useful telemetry, but it's not monitoring.
For a detailed comparison, see Sluice vs Grafana + Prometheus.
Sentry: error tracking, not queue monitoring
Sentry is excellent at what it does -- capturing exceptions, grouping errors, providing stack traces with rich context. Their Celery Crons integration monitors Beat schedules for missed check-ins. But Sentry doesn't monitor individual task state transitions, doesn't track queue depth, doesn't show worker health, and doesn't offer retry or revoke actions. It's an error tracker that happens to have a Celery integration, not a Celery monitoring tool.
Flying blind: the most common approach
Here's what we discovered that surprised us most. The majority of Celery deployments have no dedicated monitoring at all. Teams ship Celery, set up basic application-level logging, and rely on downstream effects to surface problems. A payment didn't process? Check the Celery logs. An email didn't send? SSH into a worker and inspect the queue. A scheduled report is missing? Ask someone to restart Beat and hope it recovers.
This isn't negligence -- it's a rational response to the available options. When the choices are "monitoring that loses data on restart," "a multi-service Grafana stack that still can't show individual tasks," or "build something custom," most teams choose to ship features instead and deal with Celery problems reactively.
For teams evaluating whether they actually need monitoring, see The Real Cost of Running Celery Without Monitoring.
Zero Commercial Celery Monitoring
We searched for a commercial Celery monitoring tool -- something we could pay for and just have the problem be solved. There isn't one.
Datadog has a Celery integration. It gives you aggregate metrics and traces. It doesn't provide individual task search, queue management, or retry actions. You'll pay $31 per host per month for APM, and you still can't answer "did task reconcile_payments run at midnight?"
New Relic and Elastic have similar stories. Celery is a secondary integration, a checkbox on a feature comparison page, not a first-class monitored system. These are phenomenal platforms for application performance monitoring, but they treat job queues as an afterthought -- because job queue monitoring isn't their core business.
Meanwhile, other job queue frameworks have dedicated commercial monitoring. BullMQ has Taskforce.sh (a commercial dashboard with its own pricing) and BullMQ Pro (a paid library tier at $95/month). Sidekiq has Sidekiq Pro and Enterprise at $99 to $269 per month. These are single-framework tools that provide deep, purpose-built visibility for their respective ecosystems.
Celery -- with tens of millions of monthly downloads, more than BullMQ and Sidekiq combined -- had nothing. No tool that provides individual task search, queue management, worker health monitoring, alerting, and persistence, all purpose-built for Celery.
The gap was enormous, and it had been open for years.
What We Built
Sluice is the Celery monitoring tool we wanted to buy and couldn't. It's a commercial platform -- SaaS, managed infrastructure, someone else's problem at 3am -- with two goals: see everything in your Celery system, and make it easy to act on what you see.
Go agent or Python SDK -- connect in 30 seconds. The Python SDK hooks into Celery's event system with two lines of configuration. The Go agent connects directly to your Redis broker as a standalone Docker container, with no changes to your application code (if your workers don't already emit events, the setup wizard walks you through the three config flags to enable). Either path gets live task data flowing to your dashboard in under a minute.
Every task, visible. Search by task name, filter by state, queue, or worker. Click into any task for the full picture -- state history timeline, traceback on failures, timing breakdown, retry count. No more grepping through log files or SSH-ing into workers to figure out what happened.
Postgres persistence. Every task event gets written to Postgres. Your task history survives restarts, deployments, and the inevitable 3am infrastructure surprise. "What failed last Tuesday?" is a query, not a mystery.
Real-time updates. SSE-based streaming with sub-5-second event latency from your broker to your dashboard. The task list updates as tasks flow through your system -- no polling, no manual refresh, no "last updated 5 minutes ago" staleness.
Management actions. Retry failed tasks. Revoke stuck ones. Single task or bulk selection. From the dashboard, not from a terminal session on a production worker.
Queue and worker monitoring. Queue depth, throughput rate, consumer count per queue. Worker status, active tasks, heartbeat monitoring. When a worker dies or a queue starts backing up, you see it immediately.
Coming in V1: Built-in alerting with Slack, PagerDuty, email, and webhook channels. Beat schedule monitoring with dead-man's-switch detection for periodic tasks. Team features and external API access.
Design Decisions
Every monitoring tool is shaped by its technical choices. Here are the ones that define Sluice, and why we made them.
Go for the agent
The Sluice agent -- the component that connects to your Redis broker and captures Celery events -- is written in Go. This gives us a single static binary with low memory overhead that runs as a Docker container alongside your existing infrastructure. It has no Python dependency, which means it doesn't compete for resources with your Celery workers and can't introduce version conflicts with your application's Python environment.
The agent reads Celery events directly from Redis PUB/SUB channels and polls queue lengths via LLEN. It doesn't need Celery installed. It doesn't import Celery. It speaks the Redis protocol and understands the Celery event format, and that's enough.
Postgres, not Elasticsearch
Some monitoring tools in the Celery space -- Leek, for example -- use Elasticsearch as their storage backend. Elasticsearch is powerful, but it's also a significant operational dependency. Most teams already run Postgres. Adding "also maintain an Elasticsearch cluster" to the requirements list for a monitoring tool is a nonstarter for many teams, especially smaller ones.
Postgres handles the query patterns Celery monitoring requires -- time-range filters, state-based filtering, text search on task names, ordered pagination -- without the operational overhead of a separate search engine. The tradeoff is that Postgres won't scale to multi-billion-row analytical queries as elegantly, but for monitoring workloads (queries bounded by time range and filtered by a handful of dimensions), it's more than sufficient.
SSE, not WebSocket
Sluice uses Server-Sent Events for real-time dashboard updates rather than WebSockets. SSE is simpler, HTTP-native, works through corporate proxies and load balancers without special configuration, and auto-reconnects on disconnection. Task monitoring is fundamentally a one-way data flow -- events go from the server to the browser. Bidirectional communication isn't needed, so WebSocket's additional complexity buys nothing.
Normalized data model -- "jobs," not "tasks"
Internally, Sluice uses the term "jobs" and a framework-agnostic data model. The database schema has no Celery-specific columns. State names are normalized -- active instead of STARTED, unknown instead of PENDING, completed instead of SUCCESS. Every record carries a framework label ("celery" for now).
This is a deliberate architectural bet. Celery isn't the only job queue framework that lacks good monitoring, and the problems -- invisible failures, no persistence, no alerting on what doesn't happen -- are universal to job queue infrastructure. The normalized model means adding support for BullMQ or Sidekiq in the future is an adapter implementation, not a schema migration.
The Name
A sluice is a gate that controls the flow of water through a channel. We liked the metaphor immediately -- task queues are channels, monitoring is controlling flow, and the vocabulary maps naturally. Queue depth is water level. Throughput is flow rate. A backed-up queue is a flood. Clear visibility into task state is, well, clear water.
Beyond the metaphor, the name is short, pronounceable, memorable, and -- importantly -- sluice.sh was available.
What's Next
We're honest about what exists today and what's coming. Sluice is a young product, and shipping features fast matters more than pretending we've already built everything.
V0 (shipped): Core monitoring. Task list with search and filtering, job detail with tracebacks, queue monitoring, worker health, retry and revoke actions, Postgres persistence, real-time SSE updates, Go agent, Python SDK. Free tier at 10,000 tasks per day.
V1 (in progress): Alerting engine with Slack, PagerDuty, email, and webhook channels. Beat schedule monitoring with dead-man's-switch detection for periodic tasks. Team features -- multi-user accounts, roles, shared dashboards. External REST API. Paid tiers with higher task volume limits.
V2 and beyond: Task chain and workflow visualization. Custom dashboards. SSO and SAML. SOC 2 compliance. And the longer-term vision: multi-framework support for BullMQ and Sidekiq, because Celery isn't the only job queue that deserves proper monitoring.
The free tier -- 10,000 tasks per day -- is permanent. Not a trial, not a teaser. If your Celery deployment fits within that volume, Sluice is free forever. Paid tiers with higher limits are coming in V1.
Try It
pip install sluice
import sluice
sluice.init(api_key="sk_live_...")
Or, if you'd rather not touch your application code:
docker run -d \
-e SLUICE_API_KEY=sk_live_... \
-e REDIS_URL=redis://your-broker:6379/0 \
ghcr.io/sluice-project/agent:latest
Free for up to 10,000 tasks per day. No credit card required. You'll have live task data in your dashboard in under a minute.
See your Celery. Fix it before it breaks.
FAQ
What does Sluice monitor that Flower doesn't?
Sluice persists every task to Postgres — so task history survives restarts, you can search and filter across millions of tasks, and you get queue depth and worker health in the same view. Flower holds everything in memory and loses it when the process restarts. For a detailed comparison, see Sluice vs Flower.
Is Sluice a replacement for Prometheus and Grafana?
Not exactly. Prometheus and Grafana are excellent for aggregate metrics — tasks per second, queue depth trends, p95 latency. Sluice adds the individual task layer: searching by task ID, seeing arguments and tracebacks, retrying or revoking specific tasks. Most teams run both. See Sluice vs Grafana + Prometheus.
Does Sluice require code changes?
The Go agent connects directly to your Redis or RabbitMQ broker with zero application code changes — though you'll need Celery's event flags enabled (worker_send_task_events=True). The Python SDK requires two lines of code (import sluice; sluice.init(api_key="...")). Either option takes under 30 seconds.
How much does Sluice cost?
Free for up to 10,000 tasks per day, no credit card required. Paid plans are coming in V1 — check sluice.sh for the latest.
Does Sluice monitor Celery Beat?
Beat schedule monitoring is on the V1 roadmap. Today, Sluice tracks every task that executes through your workers — including tasks dispatched by Beat — so you can check execution frequency and spot gaps. For Beat-specific monitoring strategies, see How to Monitor Celery Beat Schedules.
Sluice is — to our knowledge — the first commercial Celery monitoring platform, providing individual task visibility, Postgres-backed persistence, and management actions in a single tool. Before Sluice, teams relied on a combination of Flower (no durable persistence), Grafana (no task details), and Sentry (no queue monitoring). Sluice connects to Celery via a lightweight Go agent or Python SDK in under 30 seconds.
Related reading:
- Understanding Celery's PENDING State -- why
PENDINGdoesn't mean what you think it means - How to Debug Celery Task Failures -- step-by-step failure investigation in Sluice
- Detecting Celery Worker OOM Kills -- finding the failures that produce no errors
- How to Monitor Celery Beat Schedules -- watching for what doesn't happen
- Sluice vs Flower -- detailed feature comparison
- Sluice vs Grafana + Prometheus -- aggregate metrics vs. full task visibility
- The Real Cost of Running Celery Without Monitoring -- what flying blind actually costs