add flower monitoring #58431

harshit-anyscale · 2025-11-06T17:18:38Z

We've recently added asynchronous inference support in Ray Serve, but currently it lacks the capability of auto-scaling based on the number of tasks in the queues (which is being used in async inf)

This PR aims to merge code which should deduce the number of pending tasks in the queue(celery-queue) so that we can decide the number of replicas on the basis of that.

Celery itself doesn't provide any API which tells us the # of tasks in the queue, so we have to either

write our own logic for each broker separately
use the flower library https://github.com/mher/flower

This PR implements and propose to go with approach 2 as celery also mentions flower as their monitoring and observability tool. Link - https://docs.celeryq.dev/en/latest/userguide/monitoring.html#flower-real-time-celery-web-monitor

We will spin up the flower in a separate thread but in the same process of ray-serve. The thread will be containing an event loop. Whenever someone asks the queue length, we will add a coroutine to this event loop, and ask the queue lengths.

Whenever we try to stop the replica, the stop_consumer() function in CeleryTaskProcessorAdapter will stop the flower thread as well.

Signed-off-by: harshit <harshit@anyscale.com>

python/ray/serve/task_processor.py

Signed-off-by: harshit <harshit@anyscale.com>

cursor · 2025-11-07T17:40:38Z

python/ray/serve/task_processor.py

+            logger.info("Queue monitor stopped successfully")
+
+        except Exception as e:
+            logger.error(f"Error stopping queue monitor: {e}")


Bug: Executor Leaks Resources After Timeout

The stop() method fails to shut down the executor when _loop_thread.result(timeout=20) times out. The shutdown(wait=True) call at line 135 is inside the try block, so it's skipped when a TimeoutError is raised. This leaves the executor running with a potentially stuck thread, causing a resource leak. The executor shutdown should be in the finally block to ensure cleanup regardless of timeout.

Signed-off-by: harshit <harshit@anyscale.com>

cursor · 2025-11-08T06:19:47Z

python/ray/serve/task_processor.py

+                    self._loop.close()
+
+        # Start event loop in background thread
+        self._loop_thread = self._executor.submit(_run_event_loop)


Bug: Concurrent Initialization Race Condition

The start() method has a race condition where self._loop is checked in the main thread but assigned in the background thread. Multiple concurrent calls to start() can pass the if self._loop is not None check before any thread assigns the value, causing multiple event loops and threads to be created. The check should use a thread-safe flag or lock to prevent concurrent initialization.

cursor · 2025-11-08T06:19:47Z

python/ray/serve/task_processor.py

+        finally:
+            self._loop = None
+            self._loop_thread = None
+            self._loop_ready.clear()


Bug: Race Condition: Loop Shutdown Inconsistency

The stop() method has a race condition where get_queue_lengths() can be called after the event loop is closed but before _loop is set to None. After line 132 waits for the thread to complete, the loop is closed, but _loop remains non-None and _loop_ready remains set until the finally block executes. A concurrent call to get_queue_lengths() during this window will pass the check on line 150 and attempt to schedule a coroutine on the closed loop, causing an error.

aslonnie · 2025-11-08T07:08:03Z

python/setup.py

            setup_spec.extras["serve"]
            + [
                "celery",
+                "flower",


please update requirements_compiled.txt and/or depset lock files.

add flower monitoring

2fbfc73

Signed-off-by: harshit <harshit@anyscale.com>

harshit-anyscale self-assigned this Nov 6, 2025

harshit-anyscale marked this pull request as ready for review November 6, 2025 18:14

harshit-anyscale requested a review from a team as a code owner November 6, 2025 18:14

harshit-anyscale added the go add ONLY when ready to merge, run all tests label Nov 6, 2025

ray-gardener bot added serve Ray Serve Related Issue observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Nov 6, 2025

fix tests

077ee52

Signed-off-by: harshit <harshit@anyscale.com>

cursor bot reviewed Nov 7, 2025

View reviewed changes

python/ray/serve/task_processor.py Outdated Show resolved Hide resolved

fix tests

2444669

Signed-off-by: harshit <harshit@anyscale.com>

cursor bot reviewed Nov 7, 2025

View reviewed changes

review changes

943923c

Signed-off-by: harshit <harshit@anyscale.com>

harshit-anyscale requested review from aslonnie, edoakes and richardliaw as code owners November 8, 2025 06:17

cursor bot reviewed Nov 8, 2025

View reviewed changes

aslonnie reviewed Nov 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add flower monitoring #58431

add flower monitoring #58431

harshit-anyscale commented Nov 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

cursor bot Nov 7, 2025

Uh oh!

harshit-anyscale Nov 8, 2025

Uh oh!

cursor bot Nov 8, 2025

Uh oh!

cursor bot Nov 8, 2025

Uh oh!

aslonnie Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

add flower monitoring #58431

Are you sure you want to change the base?

add flower monitoring #58431

Conversation

harshit-anyscale commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cursor bot Nov 7, 2025

Choose a reason for hiding this comment

Bug: Executor Leaks Resources After Timeout

Uh oh!

harshit-anyscale Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

cursor bot Nov 8, 2025

Choose a reason for hiding this comment

Bug: Concurrent Initialization Race Condition

Uh oh!

cursor bot Nov 8, 2025

Choose a reason for hiding this comment

Bug: Race Condition: Loop Shutdown Inconsistency

Uh oh!

aslonnie Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

harshit-anyscale commented Nov 6, 2025 •

edited

Loading