Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

Copy link
Contributor Author

MasterPtato commented Nov 10, 2025


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@vercel
Copy link

vercel bot commented Nov 10, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Error Error Nov 13, 2025 10:47pm
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 13, 2025 10:47pm
rivet-inspector Ignored Ignored Preview Nov 13, 2025 10:47pm
rivet-site Ignored Ignored Preview Nov 13, 2025 10:47pm

@pkg-pr-new
Copy link

pkg-pr-new bot commented Nov 10, 2025

More templates

@rivetkit/actor

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/actor@3443

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3443

@rivetkit/core

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/core@3443

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3443

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3443

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3443

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3443

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3443

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3443

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3443

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3443

commit: 09cec26

@claude
Copy link

claude bot commented Nov 10, 2025

Pull Request Review

This PR improves SIGTERM handling across the runtime with a centralized signal handler and configurable graceful shutdown timeouts. Overall, the architecture is solid and the code quality is good. Here are my findings:

✅ Strengths

  1. Excellent architectural improvement: Centralizing signal handling in rivet-runtime eliminates duplication and provides consistent behavior across all components.

  2. Configurable shutdown timeouts: Making shutdown durations configurable (30s for workers, 1h for guard) provides operational flexibility.

  3. Cross-platform support: Clean handling of both Unix (SIGTERM/SIGINT) and Windows (Ctrl+C) signals.

  4. Progressive shutdown: The 3-SIGTERM threshold for force shutdown is a good UX pattern.

  5. Resource cleanup: Proper distinction between services requiring graceful vs immediate shutdown.

🐛 Bugs & Issues

Critical

1. Bug in worker shutdown remaining_workflows count (engine/packages/gasoline/src/worker.rs:322-326)

let remaining_workflows = wf_futs.into_iter().count();
if remaining_workflows == 0 {
    tracing::info!("all workflows evicted");
} else {
    tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");
}
  • Line 322 counts wf_futs iterator
  • Line 326 logs self.running_workflows.len() instead of remaining_workflows
  • These could be different values, making the log misleading
  • Fix: Change line 326 to use remaining_workflows variable

2. Potential infinite sleep in guard/worker shutdown loops (engine/packages/guard-core/src/server.rs:269, engine/packages/gasoline/src/worker.rs:315)

_ = tokio::time::sleep(shutdown_duration.saturating_sub(shutdown_start.elapsed())) => {
  • If shutdown_start.elapsed() >= shutdown_duration, saturating_sub returns Duration::ZERO
  • tokio::time::sleep(Duration::ZERO) completes immediately
  • This causes a tight busy loop until the next SIGTERM or task completion
  • Fix: Break the loop when timeout is reached instead of relying on sleep

Medium

3. Missing error propagation in TermSignalHandler::new (engine/packages/runtime/src/term_signal.rs:88)

let term_signal = TermSignalHandler::new()
    .expect("failed initializing termination signal handler");
  • Uses expect() which panics on signal registration failure
  • Should return Result from TermSignal::new() or provide better context
  • Signal registration can fail due to process limits or permissions

4. Inconsistent comment style (engine/packages/config/src/config/runtime.rs:11-13,15)

  • Lines 11-12 use // comments
  • Should use /// for documentation comments on public struct fields
  • Affects generated documentation quality

⚠️ Potential Issues

5. Service restart behavior during shutdown (engine/packages/service-manager/src/lib.rs:155-166)

  • Services that crash during shutdown still attempt to restart after 1-second delay
  • The shutting_down flag is only set when SIGTERM is received, not when services naturally exit
  • Could cause unwanted restarts if a service crashes before shutdown signal arrives

6. Race condition in TermSignalHandler (engine/packages/runtime/src/term_signal.rs:74-76)

if self.tx.send(self.count >= FORCE_CLOSE_THRESHOLD).is_err() {
    tracing::debug!("no sigterm subscribers");
}
  • If all subscribers drop between signal receipt and send, the signal is silently lost
  • Low impact since signals typically trigger during shutdown when subscribers exist
  • Consider logging at warn level for force-close threshold

7. AtomicBool SeqCst ordering may be excessive (engine/packages/service-manager/src/lib.rs:333,340)

  • Using SeqCst for simple boolean flag is overly conservative
  • Acquire/Release ordering would suffice and be more performant
  • Low impact in practice

🎯 Performance Considerations

8. Unnecessary shutdown loop iteration (worker.rs:301-320, server.rs:257-274)

  • Both shutdown loops continue iterating after join_fut/graceful_fut complete
  • Once these futures complete, no need to check timeout or additional SIGTERMs
  • The break on line 307/260 exits correctly, but the loop structure is confusing

9. FuturesUnordered overhead (gasoline/src/worker.rs:294-298)

  • Creates FuturesUnordered for potentially many workflows
  • Only used to check completion, not to process results
  • Consider using futures::future::join_all() or checking running_workflows directly

📝 Style & Convention Issues

10. Hard tabs vs spaces (engine/packages/service-manager/Cargo.toml:19)

  • Line 19 has trailing inconsistency (no newline at EOF)
  • Per CLAUDE.md, Rust uses hard tabs but this is a TOML file
  • Should verify TOML formatting convention

11. Glob imports removed

  • Good catch removing use anyhow::*; in favor of explicit imports
  • Follows CLAUDE.md guidelines

12. Structured logging compliance

  • Properly uses tracing's structured logging throughout (e.g., ?err, ?duration)
  • Follows CLAUDE.md patterns

🧪 Test Coverage

13. No tests for new TermSignal functionality

  • No tests found for engine/packages/runtime/src/term_signal.rs
  • Signal handling is notoriously difficult to test but critical for reliability
  • Consider adding integration tests or manual test procedures

14. No tests for shutdown behavior

  • Worker and guard shutdown logic is complex with multiple edge cases
  • No tests for timeout behavior, force-shutdown, or graceful completion
  • Per CLAUDE.md: "Run tests with output displayed" - but there are no tests to run

🔐 Security Concerns

No security issues identified. Signal handling is appropriate for the threat model.

📋 Recommendations

Must fix before merge:

  1. Fix bug [SVC-2555] Set up issue templates #1 (remaining_workflows count mismatch)
  2. Fix bug [SVC-2479] Send cluster events to PostHog #2 (potential busy loop in shutdown timeouts)

Should fix:
3. Fix error propagation in TermSignalHandler initialization
4. Update comment style to use /// for doc comments
5. Add basic tests for signal handling behavior

Nice to have:
6. Simplify shutdown loop logic for clarity
7. Consider relaxed atomic ordering
8. Add integration test documentation for shutdown behavior

📍 Code References

Great work on improving the signal handling architecture! The centralized approach is much cleaner than the previous implementation.

@claude
Copy link

claude bot commented Nov 11, 2025

Pull Request Review: Improve SIGTERM Handling for the Entire Runtime

Summary

This PR improves graceful shutdown handling across the Rivet runtime by introducing a centralized TermSignal handler, configurable shutdown timeouts, and better coordination between services during shutdown. Overall, this is a well-thought-out improvement to shutdown reliability.

Code Quality & Best Practices

Strengths ✅

  1. Centralized signal handling: The new TermSignal handler in runtime/src/term_signal.rs provides a clean, reusable abstraction across all services
  2. Configurable timeouts: Adding Runtime config with separate timeouts for worker and guard is excellent for operational flexibility
  3. Cross-platform support: Proper handling of both Unix (SIGTERM/SIGINT) and Windows (Ctrl+C) signals
  4. Graceful degradation: The 3-signal threshold for forced shutdown is a good safety valve
  5. Improved logging: Lowercase log messages follow CLAUDE.md conventions properly

Issues Found 🔴

1. Critical: Incorrect remaining workflows count (gasoline/src/worker.rs:326)

let remaining_workflows = wf_futs.into_iter().count();
if remaining_workflows == 0 {
    tracing::info\!("all workflows evicted");
} else {
    tracing::warn\!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");
}

Problem: You're logging self.running_workflows.len() but you should log remaining_workflows. Also, into_iter().count() on FuturesUnordered doesn't give you remaining futures - it consumes the iterator and counts items.

Fix: Track remaining count before the loop or use self.running_workflows.len() which hasn't been modified.

2. Comment style inconsistency (config/src/config/runtime.rs:11-16)

/// Time (in seconds) to allow for guard to wait for pending requests after receiving SIGTERM. Defaults
// to 1 hour.
guard_shutdown_duration: Option<u32>,
/// Whether or not to allow running the engine when the previous version that was run is higher than
// the current version.
allow_version_rollback: Option<bool>,

Problem: Doc comments should use /// consistently, not mix with //

Fix: Change // to /// on lines 12 and 15.

3. Potential race condition (service-manager/src/lib.rs:333-343)

abort = term_signal.recv() => {
    shutting_down.store(true, Ordering::SeqCst);
    
    // Abort services that don't require graceful shutdown
    running_services.retain(|(requires_graceful_shutdown, handle)| {
        if \!requires_graceful_shutdown {
            handle.abort();
        }
        *requires_graceful_shutdown
    });

Problem: Between setting shutting_down and calling abort(), a service might exit gracefully and log "service exited" instead of getting aborted. This is minor but could cause confusion in logs.

Suggestion: Consider setting the flag and aborting in a single atomic operation, or at least document this behavior.

4. Missing error context (runtime/src/term_signal.rs:88)

let term_signal = TermSignalHandler::new()
    .expect("failed initializing termination signal handler");

Problem: Per CLAUDE.md, we should avoid panics when possible and provide better error context.

Suggestion: Since this is in an async init context, consider propagating the error or providing more detailed panic message about what might cause this (e.g., "failed initializing termination signal handler - this may indicate the process doesn't have signal handling permissions").

5. Clippy warning potential (gasoline/src/worker.rs:303)

let join_fut = async { while let Some(_) = wf_futs.next().await {} };

Issue: Clippy will warn about while let Some(_) - the underscore binding is unnecessary.

Fix: Use while wf_futs.next().await.is_some() {} or keep the binding as Some(res) and use it.

6. Shutdown duration calculation issue (gasoline/src/worker.rs:315)

_ = tokio::time::sleep(shutdown_duration.saturating_sub(shutdown_start.elapsed())) => {

Problem: If shutdown_start.elapsed() > shutdown_duration, this will sleep for 0 duration and immediately fire, potentially in a tight loop if the other futures don't complete.

Better approach: Check if time is already elapsed before the select, or use a deadline-based approach with tokio::time::sleep_until.

Security Concerns 🔒

Low Risk Issues

  1. Signal handler abuse: The 3-SIGTERM threshold is good, but a malicious actor could potentially keep a process alive by sending exactly 2 SIGTERMs repeatedly. This is a very minor concern in practice since they could just not send any signals.

  2. No maximum timeout enforcement: The config allows setting arbitrarily long shutdown durations. Consider documenting reasonable limits or adding validation (e.g., max 24 hours).

Performance Considerations ⚡

Good Improvements

  1. Removed redundant ctrl_c handling: The old code had both ctrl_c() and signal handlers - good cleanup
  2. Better async coordination: Using FuturesUnordered for concurrent workflow shutdown is more efficient than the old polling approach

Minor Concerns

  1. Tight select loop (service-manager/src/lib.rs:316-352): The main service loop doesn't have any explicit yield points between iterations. While tokio's select handles this well, consider if there's a need for explicit yielding.

  2. Unbounded watch channel (runtime/src/term_signal.rs:40): The watch channel created for shutdown signaling is fine for this use case, but document that this is intentional for broadcast semantics.

Test Coverage 📊

Concerning Gaps

  1. No tests for TermSignal: The new centralized signal handler has no unit tests
  2. No tests for shutdown timeout behavior: Should test that:
    • Worker shutdown respects the timeout
    • Guard shutdown respects the timeout
    • Multiple SIGTERMs trigger abort
  3. Modified test config (engine/tests/common/ctx.rs): Tests now use requires_graceful_shutdown flags but there are no tests validating this behavior

Recommendations

  • Add unit tests for TermSignal::new() and recv() behavior
  • Add integration tests that send actual signals (or mock them) and verify shutdown timing
  • Test the 3-SIGTERM abort threshold
  • Test configuration parsing for the new Runtime config

Additional Observations

Documentation

  1. Good: JSDoc-style comments on the Runtime struct fields
  2. Missing: No migration guide for users who might have relied on old shutdown behavior
  3. Missing: No explanation of why guard gets 1 hour vs worker gets 30 seconds (presumably because HTTP requests can be long-lived)

Code Removal

The removal of engine/packages/util/src/signal.rs is good - consolidating this logic makes sense. Confirm that no other code depended on this module.

Pegboard Actor Change (pegboard/src/workflows/actor/mod.rs:533-567)

Moving the "allocate pending actors" logic is a good refactoring - makes the shutdown path cleaner. The comment at line 597 explains the rationale well.

Recommendations

Must Fix Before Merge 🔴

  1. Fix the remaining workflows count bug in gasoline worker shutdown
  2. Fix doc comment consistency in runtime.rs
  3. Fix the clippy warning for underscore binding

Should Fix 🟡

  1. Improve shutdown duration timeout handling to avoid potential tight loops
  2. Add basic unit tests for TermSignal
  3. Propagate or better document the expect() in TermSignal initialization

Nice to Have 🟢

  1. Add config validation for maximum shutdown durations
  2. Document the rationale for different default timeouts
  3. Add integration tests for shutdown behavior
  4. Consider the service abort race condition

Overall Assessment

This is a solid improvement to the shutdown handling infrastructure. The centralization of signal handling and configurable timeouts are excellent architectural decisions. The main concerns are:

  • Testing coverage is inadequate for such critical infrastructure
  • A few minor bugs that should be fixed before merge
  • Some edge cases around timing could be handled more robustly

Recommendation: Request changes for the critical bugs, then approve once those are addressed. The testing gaps are concerning but not blocking if this is being monitored in production.


Review generated with Claude Code

@MasterPtato MasterPtato force-pushed the 11-10-fix_improve_sigterm_handling_for_the_entire_runtime branch from 7aff577 to 9f71238 Compare November 11, 2025 01:44
@MasterPtato MasterPtato force-pushed the 11-07-feat_gas_implement_smart_pulling_for_horizontally_scaled_workers branch from ff49fad to 728f761 Compare November 11, 2025 01:44
@claude
Copy link

claude bot commented Nov 11, 2025

Pull Request Review: Improve SIGTERM Handling for the Entire Runtime

This PR introduces a centralized termination signal handler and improves graceful shutdown across the runtime. Overall, this is a well-structured improvement to signal handling. Below are my observations and recommendations:


Strengths

  1. Centralized Signal Handling: The new TermSignal handler in rivet-runtime provides a singleton pattern for signal management across the entire runtime, avoiding duplicate signal handlers.

  2. Configurable Shutdown Timeouts: The new Runtime config allows operators to customize shutdown durations for different components (worker, guard), which is excellent for production flexibility.

  3. Graceful Degradation: The multi-signal handling (counting up to 3 SIGTERMs before forcing shutdown) provides a good balance between graceful and forced termination.

  4. Cross-platform Support: Proper handling of Unix (SIGTERM/SIGINT) and Windows (Ctrl+C) signals.

  5. Better Logging: Improved structured logging throughout the shutdown process makes debugging much easier.


🐛 Potential Bugs & Issues

Critical Issues

  1. Bug in gasoline/worker.rs:326 - Incorrect variable reference:

    tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");

    This should use remaining_workflows (calculated on line 322) instead of self.running_workflows.len() which may be stale. The workflows map wasn't cleaned up before this log.

  2. Comment Inconsistency in config/runtime.rs: Lines 12-16 use // instead of /// for doc comments on public struct fields:

    // to 1 hour.  // ← Should be ///
    guard_shutdown_duration: Option<u32>,
    /// Whether or not to allow running the engine when the previous version that was run is higher than
    // the current version.  // ← Should be ///
    allow_version_rollback: Option<bool>,

Moderate Issues

  1. Potential Race Condition in service-manager/lib.rs: The shutting_down AtomicBool is set (line 333), then services are aborted (lines 336-342), but services check the flag in their loops. There's a window where a service might restart before being aborted. Consider aborting services first, then setting the flag, or use a more explicit coordination mechanism.

  2. FuturesUnordered Not Consumed Properly in gasoline/worker.rs:303: The pattern creates a future that waits for all workflows:

    let join_fut = async { while let Some(_) = wf_futs.next().await {} };

    However, this future is used in a tokio::select! that may break early on timeout or abort. After the select, line 322 tries to count remaining workflows by consuming wf_futs, but if join_fut was selected and completed, wf_futs would be empty. The count would be correct, but the logic is confusing.

  3. Unbounded Retry in Service Manager: Services that crash will retry indefinitely (lines 161-183). While shutdown detection exists, during normal operation a failing service could spam logs and consume resources. Consider a circuit breaker or exponential backoff.


🔒 Security Concerns

  1. Panic in Critical Path (term_signal.rs:88):

    .expect("failed initializing termination signal handler")

    While unlikely, if signal handler initialization fails, the process panics. Consider returning a Result to the caller and handling initialization failures more gracefully at the application level.

  2. No Validation on Shutdown Durations: The worker_shutdown_duration and guard_shutdown_duration accept any u32 value. Consider adding validation or warnings for unreasonable values (e.g., 0 seconds or years).


Performance Considerations

  1. Cloning in service-manager: Each service clones config and pools on every restart (lines 162, 199, 235, 279). While Rust's Arc-based cloning is cheap, ensure these types are indeed cheap to clone.

  2. Busy Loop in Guard Shutdown (guard-core/server.rs:257-274): The shutdown loop continuously polls the term signal and timeout. While not critical, this could be restructured to avoid unnecessary CPU cycles.


🧪 Test Coverage

  1. No Tests for New Code: The new TermSignal handler and shutdown logic lack unit tests. Consider adding:

    • Tests for signal counting and threshold behavior
    • Tests for timeout behavior in worker shutdown
    • Integration tests for graceful shutdown scenarios
    • Tests for the requires_graceful_shutdown flag behavior
  2. No Tests for Configuration: The new Runtime config struct and its accessor methods lack tests for default values and edge cases.


📝 Code Quality & Best Practices

  1. Inconsistent Error Handling: Some places use anyhow::Result with glob imports (anyhow::*), others import explicitly. Per CLAUDE.md: "Do not glob import (::*) from anyhow". The guard-core changes correctly import anyhow::Result explicitly, but ensure consistency across all modified files.

  2. Magic Numbers: FORCE_CLOSE_THRESHOLD = 3 in term_signal.rs:14 could be configurable or at least documented why 3 is the right number.

  3. Unused Variable Pattern (service-manager/lib.rs:324):

    while let Some(_) = handle_futs.next().await {}

    Consider while handle_futs.next().await.is_some() {} for clarity, though both are idiomatic.

  4. Documentation Missing: The new TermSignal::recv() method documents its return value well, but TermSignal::new() and TermSignal::stop() lack doc comments.


💡 Recommendations

  1. Fix the critical bug on gasoline/worker.rs:326
  2. Fix doc comment formatting in config/runtime.rs
  3. Add validation for shutdown duration config values
  4. Add unit tests for TermSignal behavior, especially the force-shutdown threshold
  5. Add integration tests for graceful shutdown scenarios
  6. Consider exponential backoff for service restarts
  7. Document the magic number (3) for force shutdown threshold
  8. Add doc comments to all public APIs

📊 Summary

Overall Assessment: Good improvement to signal handling infrastructure, but needs bug fixes and test coverage before merging.

Blockers:

  • Bug in worker.rs:326 (incorrect variable)
  • Doc comment formatting in runtime.rs

Recommended Before Merge:

  • Add tests for TermSignal
  • Add tests for shutdown behavior
  • Validate shutdown duration configs

Nice to Have:

  • Exponential backoff for service restarts
  • More comprehensive documentation
  • Circuit breaker for failing services

Let me know if you'd like me to elaborate on any of these points or if you have questions about the recommendations!

@claude
Copy link

claude bot commented Nov 11, 2025

PR Review: Improve SIGTERM Handling for the Entire Runtime

Summary

This PR significantly improves graceful shutdown handling across the Rivet runtime. The changes are well-structured and follow good practices for distributed system lifecycle management.

Positive Aspects ✅

1. Excellent Architecture

  • Centralized signal handling: The new TermSignal implementation in rivet-runtime provides a clean, reusable abstraction for signal handling across all services
  • Configurable timeouts: Moving shutdown durations to configuration is a great improvement for operational flexibility
  • Graceful degradation: The force-close threshold (3 SIGTERMs) provides a good escape hatch

2. Well-Implemented Shutdown Logic

  • Worker shutdown (gasoline/worker.rs:270-330): Clean implementation using FuturesUnordered to track workflow completion
  • Guard shutdown (guard-core/server.rs:252-278): Proper handling of pending HTTP requests with configurable timeout
  • Service manager (service-manager/lib.rs:316-353): Smart differentiation between services requiring graceful shutdown vs. immediate abort

3. Code Quality

  • Follows repository conventions (structured logging with ? syntax, lowercase messages)
  • Good use of saturating_sub to prevent underflow in timeout calculations
  • Proper atomic boolean usage for coordinating shutdown state

Issues & Recommendations 🔍

Critical Issues

1. Bug in worker shutdown (gasoline/worker.rs:322-326)

let remaining_workflows = wf_futs.into_iter().count();
if remaining_workflows == 0 {
    tracing::info!("all workflows evicted");
} else {
    tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");
}

Problem: After the loop, wf_futs has been consumed by wf_futs.next().await, so into_iter().count() always returns 0. Additionally, the warning message uses self.running_workflows.len() instead of remaining_workflows.

Fix: Track remaining workflows before the iterator is consumed:

let remaining_workflows = wf_futs.len();
// ... loop ...
if remaining_workflows == 0 {
    tracing::info!("all workflows evicted");
} else {
    tracing::warn!(?remaining_workflows, "not all workflows evicted");
}

2. Comment inconsistency (config/runtime.rs:11-12, 14-15)
Comments use // instead of /// for doc comments. Should be:

/// Time (in seconds) to allow for guard to wait for pending requests after receiving SIGTERM. Defaults
/// to 1 hour.
guard_shutdown_duration: Option<u32>,
/// Whether or not to allow running the engine when the previous version that was run is higher than
/// the current version.
allow_version_rollback: Option<bool>,

Medium Priority Issues

3. Timeout calculation edge case (gasoline/worker.rs:315)

_ = tokio::time::sleep(shutdown_duration.saturating_sub(shutdown_start.elapsed())) => {

If shutdown_start.elapsed() exceeds shutdown_duration, this creates a zero-duration sleep that will fire immediately on every loop iteration. While functionally correct (will break on next iteration), it's inefficient.

Consider: Break immediately if timeout already exceeded:

let elapsed = shutdown_start.elapsed();
if elapsed >= shutdown_duration {
    tracing::warn!("worker shutdown timed out");
    break;
}
_ = tokio::time::sleep(shutdown_duration - elapsed) => {

4. Missing newline at end of file (service-manager/Cargo.toml)
The file ends without a newline after tracing.workspace = true - this violates POSIX standards.

Minor Issues

5. Inconsistent import style (guard-core/server.rs:10)

use anyhow::Result;

This imports a single type, while the CLAUDE.md emphasizes not to glob import from anyhow. This is good, but consider consistency - other files use anyhow::* patterns that should be avoided per the style guide.

6. Windows signal handling (runtime/term_signal.rs:61-64)
The Windows implementation only handles Ctrl+C, not other termination signals. This is likely fine for the use case, but worth documenting that SIGTERM-equivalent behavior on Windows is limited.

Performance Considerations ⚡

Positive:

  • Using FuturesUnordered for concurrent task management is efficient
  • Atomic bool for shutdown signaling has minimal overhead
  • Graceful shutdown prevents dropped connections and lost work

No concerns about performance impact from these changes.

Security Considerations 🔒

Low Risk: This PR primarily affects shutdown behavior, which has minimal security implications.

Positive: Force-close threshold prevents indefinite hangs that could be exploited in DoS scenarios.

Test Coverage 🧪

Observation: No new tests added for the shutdown behavior. While testing shutdown logic can be complex, consider:

  • Unit tests for TermSignal to verify signal counting and force-close threshold
  • Integration tests that verify graceful shutdown actually waits for running workflows/requests
  • Tests for the timeout behavior

Architecture Considerations 🏗️

Question about pegboard changes (pegboard/workflows/actor/mod.rs:533-547):
The removal of the AllocatePendingActors logic from the actor cleanup path seems unrelated to SIGTERM handling. The comment at line 536-537 mentions "the actor is not allocated" - but this appears to be a behavioral change beyond shutdown improvements. Can you clarify if this is:

  1. A bug fix that was discovered during shutdown testing?
  2. An optimization to move this logic elsewhere?
  3. Related to shutdown behavior in a non-obvious way?

Additional Notes

Great work on:

  • Clean separation of concerns (signal handling, configuration, shutdown logic)
  • Making timeouts configurable for different operational needs
  • Comprehensive handling across all service types (worker, guard, cron, oneshot)

Documentation: Consider adding a section to docs about:

  • The new runtime configuration options
  • Expected shutdown behavior and timeouts
  • What happens when force-close threshold is reached

Recommendation

Approve with minor fixes. Please address the critical bug in worker.rs:322-326 and the comment style in runtime.rs. The other issues are lower priority but worth considering for a follow-up PR.

Great improvement to the runtime's operational reliability! 🎉


🤖 Generated with Claude Code

@MasterPtato MasterPtato force-pushed the 11-10-fix_improve_sigterm_handling_for_the_entire_runtime branch from 9f71238 to 09cec26 Compare November 11, 2025 22:35
@MasterPtato MasterPtato force-pushed the 11-07-feat_gas_implement_smart_pulling_for_horizontally_scaled_workers branch from 728f761 to 09a68bc Compare November 11, 2025 22:35
@claude
Copy link

claude bot commented Nov 11, 2025

Code Review for PR #3443: Improve SIGTERM Handling

I've reviewed this PR which improves graceful shutdown handling across the Rivet runtime. Overall, this is a solid improvement to the shutdown infrastructure with good architecture. Here are my findings:


✅ Strengths

Architecture & Design

  • Centralized signal handling: The new TermSignal in rivet-runtime provides a clean, reusable abstraction for SIGTERM/SIGINT handling across all services
  • Configurable timeouts: Moving shutdown durations to config (runtime.rs) allows operators to tune behavior per deployment
  • Graceful degradation: The 3-SIGTERM threshold provides a good escape hatch for forcing shutdown
  • Cross-platform support: Proper handling of both Unix signals and Windows Ctrl+C

Code Quality

  • Structured logging: Good use of tracing with structured fields (?err, ?attempt, etc.) following project conventions
  • Error handling: Proper use of anyhow's Result type and context methods
  • Task naming: Named tasks (tokio::task::Builder::new().name()) greatly improve observability

🐛 Issues Found

Critical

1. Bug in worker.rs:326 - Incorrect variable used

tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");

Should use remaining_workflows variable instead of self.running_workflows.len():

tracing::warn!(remaining_workflows=%remaining_workflows, "not all workflows evicted");

Location: engine/packages/gasoline/src/worker.rs:326

Medium Priority

2. Documentation inconsistency in runtime.rs
Lines 11-12 and 14-15 use single-line comments (//) instead of doc comments (///):

/// Time (in seconds) to allow for guard to wait for pending requests after receiving SIGTERM. 
/// Defaults to 1 hour.
guard_shutdown_duration: Option<u32>,
/// Whether or not to allow running the engine when the previous version that was run is higher 
/// than the current version.
allow_version_rollback: Option<bool>,

Location: engine/packages/config/src/config/runtime.rs:11-16

3. Unused variable pattern in loops
Multiple locations use while let Some(_) = ... with unused bindings. Consider:

// Current
while let Some(_) = wf_futs.next().await {}

// Suggested
while wf_futs.next().await.is_some() {}

Locations:

  • engine/packages/gasoline/src/worker.rs:303
  • engine/packages/service-manager/src/lib.rs:324

4. Potential race condition in shutdown flag
In service-manager, the shutting_down atomic flag is checked after service returns, but there's a brief window where a service could restart before the flag is set. Consider setting the flag before services are notified of shutdown.

Location: engine/packages/service-manager/src/lib.rs:332-333


💡 Suggestions

Performance

5. FuturesUnordered efficiency
In worker.rs:294-298, converting to FuturesUnordered and then immediately collecting back to count remaining is inefficient:

let remaining_workflows = wf_futs.into_iter().count();

Consider tracking this separately or using len() if available.

Code Style

6. Import organization
Good use of workspace dependencies and following project patterns. One minor note: service-manager/Cargo.toml is missing a newline at EOF (line 339).

7. Consistent error messages
Most error messages properly use lowercase (following CLAUDE.md conventions), but double-check:

  • worker.rs:286: "stop channel closed, workflow likely already stopped" ✅
  • guard-core/src/server.rs:234: "accept error on HTTPS port" ✅

🔒 Security Considerations

8. Signal handler initialization
The TermSignalHandler::new() uses .expect() at initialization (term_signal.rs:88), which will panic if signal setup fails. While this is likely intentional (can't proceed without signal handling), consider:

  • Adding a clear comment explaining why panic is acceptable here
  • Or propagating the error and handling it at the top level

9. Watch channel unbounded growth
The watch::channel in TermSignal doesn't bound the number of subscribers. While unlikely to be an issue, consider documenting expected subscriber count.


⚡ Performance Notes

10. Shutdown timing precision
Good use of saturating_sub for duration calculations. However, be aware that on very slow systems, the tokio::time::sleep might not fire exactly when expected if shutdown_duration has already elapsed.

11. Task spawning overhead
The PR properly uses tokio::spawn instead of tokio::task::spawn (cleaner). The named tasks add negligible overhead but huge debugging value.


📋 Test Coverage

The PR doesn't include new tests. Consider adding:

  1. Unit tests for TermSignal behavior (multiple signals, threshold crossing)
  2. Integration tests for graceful shutdown timing
  3. Tests for the abort path (3+ SIGTERMs)

🎯 Summary

Overall Assessment: This is a well-architected improvement that meaningfully enhances the runtime's shutdown behavior. The centralized signal handling is a solid pattern.

Required Changes:

  • Fix the critical bug in worker.rs:326 (wrong variable)
  • Update documentation comments to use /// instead of //

Recommended Changes:

  • Add test coverage for the new shutdown paths
  • Consider the race condition in service manager shutdown ordering
  • Review the expect() in TermSignal initialization

Estimated Risk: Low - The changes are well-isolated and improve existing functionality without breaking APIs.

Great work on improving the reliability of the shutdown sequence! 🚀

@claude
Copy link

claude bot commented Nov 11, 2025

Code Review: Improved SIGTERM Handling

This PR significantly improves the shutdown behavior across the Rivet runtime. The changes are well-structured and demonstrate a thoughtful approach to graceful shutdown handling. Below is my comprehensive review:


✅ Strengths

  1. Centralized Signal Handling: The new TermSignal implementation in rivet-runtime provides a clean, singleton-based approach that prevents duplicate signal handlers and ensures consistent behavior across all services.

  2. Configurable Timeouts: Moving shutdown durations to configuration (runtime.rs) is excellent - it allows operators to tune these values based on their workload characteristics without code changes.

  3. Graceful Degradation: The multi-SIGTERM approach (3 signals before force abort) gives operators clear control over shutdown urgency while protecting against accidental terminations.

  4. Cross-Platform Support: Proper handling of both Unix (SIGTERM/SIGINT) and Windows (Ctrl+C) signals with conditional compilation.

  5. Improved Service Manager: The distinction between services requiring graceful shutdown vs. those that can be immediately aborted is a smart optimization.


🐛 Potential Issues

Critical: Bug in Worker Shutdown (line 326)

tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");

This references self.running_workflows.len() but should use remaining_workflows (the variable calculated on line 322). The running_workflows HashMap was already consumed by iter_mut() at line 296, making this misleading.

Fix:

tracing::warn!(?remaining_workflows, "not all workflows evicted");

Logic Issue: Timeout Calculation (lines 315, 269)

The saturating_sub pattern for timeout could cause the sleep to complete immediately if the duration has already elapsed:

tokio::time::sleep(shutdown_duration.saturating_sub(shutdown_start.elapsed()))

If processing takes longer than expected before entering the loop, this could timeout immediately. Consider checking if time remains before entering the sleep:

let remaining = shutdown_duration.saturating_sub(shutdown_start.elapsed());
if remaining.is_zero() {
    tracing::warn!("worker shutdown timed out");
    break;
}
_ = tokio::time::sleep(remaining) => { ... }

Minor: Inconsistent Comment Style (lines 11-12, 14-15 in runtime.rs)

Uses // instead of /// for doc comments:

/// Time (in seconds) to allow for guard to wait for pending requests after receiving SIGTERM. Defaults
// to 1 hour.  // ← Should be ///
guard_shutdown_duration: Option<u32>,
/// Whether or not to allow running the engine when the previous version that was run is higher than
// the current version.  // ← Should be ///
allow_version_rollback: Option<bool>,

🔒 Security Considerations

  1. Signal Handler Safety: The implementation correctly uses async-safe operations. The OnceCell pattern prevents race conditions during initialization.

  2. Graceful Shutdown Window: The 1-hour default for guard_shutdown_duration is generous but reasonable for long-running requests. Operators should tune this based on their SLAs.

  3. No Forced Cleanup: When workflows/requests don't complete in time, they're simply logged but not forcefully killed. This is safe but may leave resources hanging. Consider documenting cleanup expectations.


⚡ Performance Considerations

  1. FuturesUnordered Usage: Good choice for waiting on multiple workflows/services concurrently (lines 298, 322).

  2. Watch Channel Overhead: The watch::channel for shutdown signaling is efficient - subscribers only wake when the value changes.

  3. Avoid Polling: The removal of the 2-second polling loop (old code) in favor of proper async waiting is a nice performance improvement.


🧪 Test Coverage

Missing:

  • No tests for the new TermSignal implementation
  • No tests for timeout behavior in worker/guard shutdown
  • No tests for the multi-SIGTERM abort logic

Recommendation: Add integration tests that:

  • Verify graceful shutdown completes when workflows finish quickly
  • Verify timeout triggers when workflows take too long
  • Verify abort logic on multiple SIGTERMs
  • Test the configuration defaults

📝 Code Quality & Style

  1. Logging: Excellent use of structured logging with proper context (e.g., ?workflow_id, ?err, ?remaining_workflows). Follows CLAUDE.md guidelines.

  2. Error Handling: Appropriate use of Result and anyhow. The .expect() on line 88 is acceptable for initialization-time failures.

  3. Documentation: The code is generally well-commented. The comment on lines 283-285 explaining why allocation happens here is helpful context.


🎯 Recommendations

  1. Fix the critical bug on line 326 of gasoline/src/worker.rs
  2. Add a time-remaining check before the timeout sleep to avoid edge cases
  3. Fix doc comment style in config/runtime.rs
  4. Add unit/integration tests for shutdown behavior
  5. Document the expected behavior when shutdown times out (what happens to in-flight work?)

Summary

This is a solid improvement to the shutdown handling with good architectural decisions. The bugs identified are fixable and relatively minor. The centralized TermSignal approach is clean and reusable. Once the issues are addressed, this should provide much more reliable shutdown behavior for the Rivet runtime.

Recommendation: Approve with requested changes to fix the identified bugs.

@NathanFlurry NathanFlurry force-pushed the 11-10-fix_improve_sigterm_handling_for_the_entire_runtime branch from 09cec26 to f6632bf Compare November 13, 2025 17:20
@NathanFlurry NathanFlurry force-pushed the 11-07-feat_gas_implement_smart_pulling_for_horizontally_scaled_workers branch from 09a68bc to 74f8abc Compare November 13, 2025 17:20
@claude
Copy link

claude bot commented Nov 13, 2025

Pull Request Review: Improved SIGTERM Handling

This PR introduces a centralized termination signal handling system with configurable graceful shutdown timeouts. Overall, the implementation is solid and follows good patterns, but there are several issues that should be addressed.


🐛 Critical Issues

1. Bug in worker.rs:326 - Incorrect variable used

// Line 326
tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");

Issue: Uses self.running_workflows.len() instead of the remaining_workflows variable calculated on line 322.

Fix: Change to remaining_workflows=?remaining_workflows

Location: engine/packages/gasoline/src/worker.rs:326


2. Comment formatting inconsistency in runtime.rs

// Lines 11-12 and 14-15
/// Time (in seconds) to allow for guard to wait for pending requests after receiving SIGTERM. Defaults
// to 1 hour.

/// Whether or not to allow running the engine when the previous version that was run is higher than
// the current version.

Issue: Inconsistent use of /// vs // in multi-line doc comments. The continuation lines use // instead of ///.

Fix: Use /// consistently for all doc comment lines:

/// Time (in seconds) to allow for guard to wait for pending requests after receiving SIGTERM. 
/// Defaults to 1 hour.

Location: engine/packages/config/src/config/runtime.rs:11-15


3. Potential timing issue in shutdown loop

In both worker.rs:315 and guard-core/src/server.rs:269, the timeout calculation uses saturating_sub:

_ = tokio::time::sleep(shutdown_duration.saturating_sub(shutdown_start.elapsed())) => {

Issue: If shutdown_start.elapsed() exceeds shutdown_duration, saturating_sub returns Duration::ZERO, causing tokio::time::sleep(Duration::ZERO) to resolve immediately. This creates a busy loop that will continuously fire the timeout branch until another event occurs.

Recommendation: Add a check to break immediately if the timeout has already elapsed:

let remaining = shutdown_duration.saturating_sub(shutdown_start.elapsed());
if remaining.is_zero() {
    tracing::warn!("shutdown timed out");
    break;
}
_ = tokio::time::sleep(remaining) => {
    tracing::warn!("shutdown timed out");
    break;
}

Locations:

  • engine/packages/gasoline/src/worker.rs:315
  • engine/packages/guard-core/src/server.rs:269

⚠️ Security & Performance Concerns

4. Missing error handling in TermSignal initialization

// term_signal.rs:88
.expect("failed initializing termination signal handler");

Issue: Uses .expect() which will panic if signal handler initialization fails. This could happen in containerized environments with restricted signal handling capabilities.

Recommendation: Return a Result from TermSignal::new() and handle errors gracefully at call sites, possibly with a fallback mechanism.

Location: engine/packages/runtime/src/term_signal.rs:88


5. Unbounded signal handler task

The TermSignalHandler::run() spawned task runs in an infinite loop and is only stopped by aborting the task.

Concern: The loop in term_signal.rs:52 has no exit condition, relying on task abortion for cleanup.

Recommendation: While this is acceptable for a global signal handler, consider adding a shutdown channel for cleaner lifecycle management.

Location: engine/packages/runtime/src/term_signal.rs:51-78


📋 Code Quality Issues

6. Unused import warning

use anyhow::{Context, Result, ensure};

Issue: The ensure macro is imported but may not be used in service-manager/src/lib.rs (only in replace_service which uses it).

Note: Verify with cargo clippy - if unused, remove it.

Location: engine/packages/service-manager/src/lib.rs:11


7. Inconsistent spacing in service initialization

The run_config.rs file shows inconsistent spacing in the Service::new() calls. Some have line breaks before parameters, some don't.

Recommendation: Apply consistent formatting across all Service::new() calls. Consider running cargo fmt on this file specifically (though I understand the note about not running it automatically).

Location: engine/packages/engine/src/run_config.rs:6-51


8. Missing newline at end of file

// service-manager/Cargo.toml
-tracing.workspace = true
+tracing.workspace = true

Issue: The Cargo.toml file is missing a trailing newline.

Location: engine/packages/service-manager/Cargo.toml:19


📝 Documentation & Testing

9. No tests for TermSignal

The new TermSignal implementation lacks unit tests. Testing signal handling is tricky but important.

Recommendation: Add tests that:

  • Verify the watch channel broadcasts correctly
  • Test the force-close threshold behavior (3 signals)
  • Verify cross-platform compilation (Unix vs Windows)

10. Missing documentation for public API

The TermSignal::recv() method has a doc comment, but TermSignal::new() and TermSignal::stop() don't.

Recommendation: Add doc comments explaining:

  • new(): Singleton behavior via OnceCell
  • stop(): When and why you'd want to manually stop the handler

Location: engine/packages/runtime/src/term_signal.rs:84-112


What's Done Well

  1. Excellent separation of concerns - Moving signal handling to a centralized TermSignal type in the runtime package
  2. Good configurability - Making shutdown timeouts configurable via Runtime config struct
  3. Graceful degradation - The multi-stage shutdown (graceful → timeout → force abort) is well designed
  4. Cross-platform support - Properly handling Unix and Windows signal differences
  5. Proper use of structured logging - Good use of tracing with structured fields
  6. Clear intent with requires_graceful_shutdown flag - Allows fine-grained control over which services need graceful shutdown

🔍 Additional Observations

Removed functionality

The PR removes packages/util/src/signal.rs (49 lines). Verify that no other code was depending on this module.

Workflow allocation logic moved

The comment in pegboard/src/workflows/actor/mod.rs:594-596 explains why actor allocation is done from the actor workflow rather than the runner workflow. Good documentation!


📊 Summary

Must Fix:

Should Fix:

Nice to Have:

The core architecture is sound and this is a valuable improvement to the codebase's shutdown behavior. Once the critical bugs are addressed, this should be good to merge.

@claude
Copy link

claude bot commented Nov 13, 2025

PR Review: Improve SIGTERM Handling for the Entire Runtime

This is a well-structured PR that significantly improves graceful shutdown handling across the Rivet runtime. The changes are comprehensive and demonstrate good systems programming practices.

✅ Strengths

  1. Centralized Signal Handling: The new TermSignal implementation in rivet-runtime provides a clean, cross-platform abstraction that's properly shared across the codebase via a global handler.

  2. Configurable Timeouts: Adding Runtime config with separate timeouts for workers (30s) and guard (1h) is sensible and reflects the different shutdown needs of these components.

  3. Progressive Shutdown: The 3-SIGTERM threshold for forced shutdown gives operators control without being overly aggressive.

  4. Improved Clarity: The service manager's requires_graceful_shutdown flag makes shutdown behavior explicit and easier to reason about.

🐛 Potential Issues

Critical: Race Condition in Worker Shutdown (gasoline/worker.rs:322-326)

let remaining_workflows = wf_futs.into_iter().count();
if remaining_workflows == 0 {
    tracing::info!("all workflows evicted");
} else {
    tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");
}

Issue: You're counting wf_futs (the futures collection) but logging self.running_workflows.len(). After the shutdown loop, these may be out of sync. The warning message will show the wrong count.

Fix: Use remaining_workflows consistently:

tracing::warn!(remaining_workflows, "not all workflows evicted");

Medium: Comment Inconsistency (config/runtime.rs:11-12, 14-15)

The comments use // instead of /// for doc comments and have inconsistent comment style:

/// Time (in seconds) to allow for guard to wait for pending requests after receiving SIGTERM. Defaults
// to 1 hour.  ← Should be ///
guard_shutdown_duration: Option<u32>,
/// Whether or not to allow running the engine when the previous version that was run is higher than
// the current version.Should be ///
allow_version_rollback: Option<bool>,

Low: Unused Variable (guard-core/server.rs:303)

let join_fut = async { while let Some(_) = wf_futs.next().await {} };

The pattern Some(_) triggers clippy warnings. Use explicit ignore:

while wf_futs.next().await.is_some() {}

Same issue in service-manager/src/lib.rs:324.

Low: Windows Signal Handling

On Windows, CtrlC only handles Ctrl+C, not service stop signals. Consider documenting this limitation or using tokio::signal::windows::ctrl_break() as well for better service integration.

🎯 Performance Considerations

  1. FuturesUnordered Efficiency: Good choice for concurrent shutdown tracking - it's more efficient than polling each future individually.

  2. Watch Channel Overhead: The watch channel broadcasts to all subscribers on every SIGTERM. With many workers, this could create a small spike, but it's acceptable given the shutdown scenario.

🔒 Security Concerns

No security issues identified. The shutdown handling properly:

  • Prevents resource leaks by tracking all tasks
  • Allows graceful completion of in-flight work
  • Provides forced shutdown as a safety valve

📊 Test Coverage

The PR doesn't show test updates. Consider:

  1. Unit tests for TermSignal behavior (single/multiple signals, threshold)
  2. Integration tests verifying graceful shutdown completes workflows
  3. Timeout tests ensuring forced shutdown works

🧹 Code Quality

Positive:

  • Follows Rust idioms and the repo's error handling patterns
  • Good use of structured logging
  • Proper separation of concerns

Minor suggestions:

  • runtime.rs:21,25: The as u64 cast is safe but could use u64::from() for clarity
  • term_signal.rs:88: expect() with panic is acceptable here since initialization failure is fatal, but consider logging the error before panicking
  • service-manager: The shutting_down AtomicBool could be a Mutex for SeqCst consistency, but the performance difference is negligible and SeqCst ordering is correct

📝 Documentation

Good inline comments explaining the shutdown flow. Consider adding:

  1. Module-level docs for term_signal.rs explaining the singleton pattern
  2. Example in Runtime config showing how to configure timeouts

🎨 Style Compliance

Follows the repo's style guide:

  • ✅ Uses structured logging with field syntax
  • ✅ Lowercase log messages
  • ✅ Proper error context with anyhow
  • ✅ Workspace dependencies correctly used

🔄 Architecture Impact

Breaking Changes:

  • Service::new() signature changed (added requires_graceful_shutdown parameter)
  • allow_version_rollback moved to config.runtime.allow_version_rollback()

Both are internal APIs, so acceptable.

Final Recommendation: Approve with Minor Changes

The core implementation is solid. Please address:

  1. ⚠️ Critical: Fix the race condition in worker shutdown logging
  2. ⚠️ Medium: Fix doc comment syntax in runtime.rs
  3. ℹ️ Low: Address clippy warnings for unused variables

The PR significantly improves operational robustness and is ready to merge after these fixes.

@claude
Copy link

claude bot commented Nov 13, 2025

Code Review: Improved SIGTERM Handling for Runtime

This PR centralizes and improves SIGTERM signal handling across the entire runtime. Overall, this is a well-designed improvement to the codebase with good architectural decisions. Below are my findings:


Strengths

  1. Excellent Architecture: Centralizing signal handling in rivet-runtime is the right move. The singleton pattern with OnceCell ensures all services share the same signal handler.

  2. Cross-platform Support: Properly handles both Unix (SIGTERM/SIGINT) and Windows (Ctrl+C) signals.

  3. Graceful Degradation: The 3-SIGTERM threshold for forced shutdown is a good UX pattern that gives operators control.

  4. Configurable Timeouts: Adding Runtime config for shutdown durations is flexible and follows best practices.

  5. Clean Service Abstraction: The requires_graceful_shutdown flag on services is a clean way to differentiate shutdown behavior.

  6. Good Logging: Structured logging throughout shutdown sequences aids debugging.


Issues Found

1. Bug in gasoline/worker.rs:326 ⚠️ HIGH PRIORITY

tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");

Problem: This uses self.running_workflows.len() but should use remaining_workflows (computed on line 322).

Fix:

tracing::warn!(remaining_workflows, "not all workflows evicted");

2. Inconsistent Comment Style in config/runtime.rs

Lines 11-12 and 14-15 use // instead of /// for doc comments:

/// Time (in seconds) to allow for guard to wait for pending requests after receiving SIGTERM. Defaults
// to 1 hour.  // ← Should be ///
guard_shutdown_duration: Option<u32>,
/// Whether or not to allow running the engine when the previous version that was run is higher than
// the current version.  // ← Should be ///

Per CLAUDE.md conventions, use lowercase for log messages, so these should also be lowercase.


3. Potential Race Condition in TermSignalHandler::run() ⚠️

Lines 74-76:

if self.tx.send(self.count >= FORCE_CLOSE_THRESHOLD).is_err() {
    tracing::debug!("no sigterm subscribers");
}

Problem: If all subscribers drop, the handler continues running in a loop, spamming logs on each subsequent SIGTERM.

Suggestion: Consider breaking the loop if there are no subscribers:

if self.tx.send(self.count >= FORCE_CLOSE_THRESHOLD).is_err() {
    tracing::debug!("no sigterm subscribers, stopping handler");
    break;
}

4. Memory Leak in TermSignalHandler

The TermSignalHandler::run() spawned task (line 91) is never properly cleaned up. TermSignal::stop() calls abort(), but:

  • The JoinHandle is never awaited
  • On abort, resources may not be properly cleaned up
  • The OnceCell remains initialized forever

Recommendation: Document this as intentional if the handler should live for the process lifetime, or implement proper cleanup.


5. Misleading Variable in guard/server.rs:303

let join_fut = async { while let Some(_) = wf_futs.next().await {} };

Issue: The while let Some(_) ignores task results. If a workflow panics, the JoinHandle returns an error, but it's silently dropped.

Suggestion: Log errors from joined tasks:

let join_fut = async {
    while let Some(result) = wf_futs.next().await {
        if let Err(err) = result {
            tracing::error!(?err, "workflow task panicked during shutdown");
        }
    }
};

Same issue exists in service-manager/lib.rs:324.


6. Potential Division by Zero Edge Case

Lines like gasoline/worker.rs:315:

_ = tokio::time::sleep(shutdown_duration.saturating_sub(shutdown_start.elapsed())) => {

If shutdown_duration is 0 (unlikely but possible with config), saturating_sub returns Duration::ZERO, causing immediate timeout.

Recommendation: Add validation in Runtime accessors to enforce minimum durations (e.g., 1 second).


7. Inconsistent Logging Capitalization

Per CLAUDE.md:

Log messages should be lowercase unless mentioning specific code symbols

Several violations:

  • guard/server.rs:144: "Accept error on HTTP port""accept error on HTTP port" ✅ (fixed)
  • runtime/term_signal.rs:36: "initialized termination signal handler" ✅ (correct)

Good adherence overall!


Performance Considerations

  1. Busy Loop in Shutdown: The loop { tokio::select! { ... } } pattern in shutdown sequences is efficient. No concerns.

  2. FuturesUnordered Overhead: Using FuturesUnordered for joining tasks is optimal for this use case.


Security Concerns

  1. Signal Handler Panics: If TermSignalHandler::new() fails (line 87-88), it panics via expect(). This could be triggered by resource exhaustion.

    • Mitigation: Document that this is a fatal error since signal handling is critical.
  2. No Privilege Escalation Risk: Signal handling is process-scoped, no issues.


Test Coverage

⚠️ No new tests were added for the new term_signal.rs module.

Recommendations:

  1. Add unit tests for TermSignal to verify:

    • Correct signal counting
    • Threshold behavior (3rd signal triggers abort)
    • Multiple subscribers receiving notifications
  2. Add integration tests for graceful shutdown with configurable timeouts.


Suggested Changes Summary

  1. Fix bug in gasoline/worker.rs:326 (use remaining_workflows variable)
  2. Fix comment style in config/runtime.rs (use /// instead of //)
  3. Consider breaking loop in term_signal.rs:74 when no subscribers remain
  4. Log panicked tasks in shutdown futures
  5. Add minimum duration validation in Runtime config
  6. Add test coverage for TermSignal

Verdict

Approve with minor changes recommended. The architecture is solid and the implementation is clean. The bug in line 326 should be fixed before merge. Other issues are minor or stylistic.

Great work on this refactor! The centralized signal handling will make the codebase much more maintainable.

@NathanFlurry NathanFlurry force-pushed the 11-07-feat_gas_implement_smart_pulling_for_horizontally_scaled_workers branch from 0014809 to 4dad8e2 Compare November 13, 2025 22:27
@NathanFlurry NathanFlurry force-pushed the 11-10-fix_improve_sigterm_handling_for_the_entire_runtime branch from f53a47c to 2990334 Compare November 13, 2025 22:27
@claude
Copy link

claude bot commented Nov 13, 2025

Code Review: Improved SIGTERM Handling for Runtime

This PR improves graceful shutdown handling across the runtime by centralizing signal handling and adding configurable shutdown durations. Overall, this is a solid improvement to the shutdown logic. Below are my findings:

✅ Strengths

  1. Centralized Signal Handling: The new TermSignal implementation in rivet-runtime provides a clean, reusable abstraction with a singleton pattern using OnceCell.

  2. Cross-platform Support: Properly handles both Unix (SIGTERM/SIGINT) and Windows (Ctrl+C) signals with conditional compilation.

  3. Graceful vs Force Shutdown: The 3-signal threshold (FORCE_CLOSE_THRESHOLD) allows operators to force shutdown if graceful shutdown is stuck.

  4. Configurable Timeouts: New Runtime config struct allows customizing shutdown durations for workers (30s default) and guard (1hr default).

  5. Better Async Handling: Using FuturesUnordered for concurrent workflow shutdown in gasoline worker is more efficient than the previous polling approach.

🐛 Potential Issues

Critical

  1. Bug in Worker Shutdown (gasoline/src/worker.rs:326):

    tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");

    This logs self.running_workflows.len() but should log remaining_workflows which was just computed on line 322. The running_workflows HashMap was never cleaned up, so this will show the wrong count.

  2. Signal Handler Resource Leak (runtime/src/term_signal.rs:108-112):
    The stop() function only aborts the handler task but doesn't clean up the HANDLER_CELL. If someone calls TermSignal::new() again after stop(), they'll get a receiver connected to an aborted task. Consider:

    pub fn stop() {
        if let Some((_, join_handle)) = HANDLER_CELL.take() {
            join_handle.abort();
        }
    }

    Note: This requires making HANDLER_CELL a OnceLock or using interior mutability.

Major

  1. Incomplete Shutdown in Guard (guard-core/src/server.rs:255-274):
    The guard shutdown creates a boxed future but doesn't handle the case where the graceful shutdown completes in subsequent loop iterations. After the first term_signal.recv() returns false (non-abort), the loop continues but the graceful_fut has already been consumed. Consider restructuring to avoid the loop or properly handle completion.

  2. Service Manager Join Logic (service-manager/src/lib.rs:318-325):
    The join_fut is recreated on every loop iteration, which could be inefficient. More importantly, after services are aborted on shutdown, the join logic doesn't account for the fact that some handles have been removed from running_services.

  3. Timeout Calculation Edge Case (gasoline/src/worker.rs:315 & guard-core/src/server.rs:269):
    Using saturating_sub(shutdown_start.elapsed()) in a sleep means if the timeout has already expired, this will sleep for Duration::ZERO, causing a busy loop. Consider breaking out of the loop if the timeout has expired:

    let remaining = shutdown_duration.saturating_sub(shutdown_start.elapsed());
    if remaining.is_zero() {
        tracing::warn!("shutdown timed out");
        break;
    }
    _ = tokio::time::sleep(remaining) => { /* ... */ }

Minor

  1. Comment Style Inconsistency (config/runtime.rs:11-16):
    Use /// for doc comments instead of //:

    /// Time (in seconds) to allow for guard to wait for pending requests
    /// after receiving SIGTERM. Defaults to 1 hour.
    guard_shutdown_duration: Option<u32>,
    /// Whether or not to allow running the engine when the previous version
    /// that was run is higher than the current version.
    allow_version_rollback: Option<bool>,
  2. Unused Variable Pattern (gasoline/src/worker.rs:303 & service-manager/src/lib.rs:324):

    while let Some(_) = wf_futs.next().await {}

    Use Some(_res) or Some(()) to be more explicit, or just:

    while wf_futs.next().await.is_some() {}
  3. Removed Code in Pegboard (pegboard/src/workflows/actor/mod.rs:533-565):
    The removal of the allocation logic after destroy seems intentional based on the comment, but ensure this is tested to confirm that pending actors are still being allocated correctly from the handle_stopped path.

🔒 Security Concerns

  • No Critical Issues: The signal handling changes don't introduce security vulnerabilities.
  • The abortion of non-graceful services is appropriate and prevents hanging.

⚡ Performance Considerations

  1. Watch Channel Broadcasting: Using watch::channel for signal distribution is efficient for the broadcast pattern.
  2. FuturesUnordered: Good choice for concurrent workflow completion checking.
  3. AtomicBool: Appropriate use of atomics for shutdown state in service-manager.

🧪 Test Coverage

  • Missing: No tests found for the new TermSignal implementation.
  • Recommendation: Add unit tests for:
    • Signal counting and force shutdown threshold
    • Multiple subscribers receiving signals
    • stop() cleanup behavior

📝 Documentation

  • Add module-level documentation to term_signal.rs explaining the singleton pattern and multi-subscriber behavior.
  • Consider adding examples of proper usage in the module docs.

🎯 Recommendations

  1. Fix the critical bug in worker shutdown logging (line 326).
  2. Fix the signal handler resource leak in TermSignal::stop().
  3. Simplify the guard shutdown loop to handle completion correctly.
  4. Add timeout checks before sleep operations to prevent busy loops.
  5. Add unit tests for TermSignal.
  6. Fix doc comment style in runtime.rs.

Overall, this is a valuable improvement to shutdown handling. The main issues are fixable and mostly related to edge cases in the shutdown logic. Once the critical bugs are addressed, this will significantly improve the runtime's reliability during restarts and deployments.

@claude
Copy link

claude bot commented Nov 13, 2025

PR Review: Improved SIGTERM Handling

I've reviewed this PR and the improvements to SIGTERM handling across the runtime look solid overall. Here's my detailed feedback:


Strengths

  1. Centralized Signal Handling: The new TermSignal implementation in rivet-runtime provides a clean, singleton-based approach that prevents duplicate signal handlers across the codebase.

  2. Configurable Shutdown Durations: Moving shutdown timeouts to configuration (runtime.rs) is excellent - it gives operators control over graceful shutdown behavior without code changes.

  3. Progressive Signal Handling: The 3-signal threshold for forced shutdown provides a good escape hatch if graceful shutdown hangs.

  4. Cross-Platform Support: Windows Ctrl+C and Unix signal handling are both properly implemented.

  5. Improved Worker Shutdown: The gasoline worker shutdown logic using FuturesUnordered is much cleaner than the previous polling approach.


🐛 Critical Issues

1. Bug in worker.rs:322-326 - Inconsistent workflow count

let remaining_workflows = wf_futs.into_iter().count();
if remaining_workflows == 0 {
    tracing::info!("all workflows evicted");
} else {
    tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");
}

Problem: You're logging self.running_workflows.len() but computing remaining_workflows from wf_futs. These may differ. Use remaining_workflows consistently:

tracing::warn!(remaining_workflows, "not all workflows evicted");

2. Comment typo in runtime.rs:11-12, 14-15

/// Time (in seconds) to allow for guard to wait for pending requests after receiving SIGTERM. Defaults
// to 1 hour.          <-- Single slash instead of triple slash
guard_shutdown_duration: Option<u32>,
/// Whether or not to allow running the engine when the previous version that was run is higher than
// the current version. <-- Single slash instead of triple slash
allow_version_rollback: Option<bool>,

These should use /// for doc comments, not //.

3. Potential race condition in TermSignal::stop()

pub fn stop() {
    if let Some((_, join_handle)) = HANDLER_CELL.get() {
        join_handle.abort();
    }
}

If stop() is called while get_or_init() is running in another task, this could miss the handler. Consider using synchronization or documenting that this should only be called during final cleanup.


⚠️ Potential Issues

4. Unused variable in worker.rs:303

let join_fut = async { while let Some(_) = wf_futs.next().await {} };

Clippy will warn about Some(_). Use:

while wf_futs.next().await.is_some() {}

5. Missing error context in term_signal.rs:88

.expect("failed initializing termination signal handler")

This panic will lose the underlying error. Consider:

.unwrap_or_else(|e| panic!("failed initializing termination signal handler: {e:?}"))

6. service-manager/Cargo.toml formatting (line 360)

The file ends without a newline. Add one for consistency.


💡 Suggestions

7. Guard shutdown timeout might be too aggressive

The 1-hour default for guard_shutdown_duration seems very long. For most deployments, 30-60 seconds is more typical. Consider:

  • Documenting why 1 hour was chosen
  • Whether this is intended for long-running uploads/downloads
  • If most users should override this

8. Missing timeout logging in guard server

The guard shutdown loop (guard-core/src/server.rs:231-274) doesn't log periodic progress. For a 1-hour timeout, consider logging every 10-30 seconds to show the system is responsive.

9. Service shutdown behavior documentation

The requires_graceful_shutdown flag is important but not well-documented. Add doc comments to Service::new() explaining:

  • What "graceful shutdown" means in this context
  • When to use true vs false
  • Examples: guard/workflow_worker need graceful (true), bootstrap doesn't (false)

10. Workflow join timeout calculation

In worker.rs:315:

_ = tokio::time::sleep(shutdown_duration.saturating_sub(shutdown_start.elapsed())) => {

If elapsed() >= shutdown_duration, this becomes sleep(Duration::ZERO) which completes immediately - correct behavior, but might fire on the same iteration as the actual timeout. Consider checking elapsed time before the select.


📝 Style & Convention Issues

11. Inconsistent import grouping in worker.rs

use futures_util::{StreamExt, stream::FuturesUnordered};

Per Rust convention, capitalize nested imports:

use futures_util::{StreamExt, stream::FuturesUnordered};

This is actually fine - ignore this point.

12. Log message capitalization

Following the CLAUDE.md pattern ("Log messages should be lowercase unless mentioning specific code symbols"):

  • ✅ Correct: tracing::info!("starting worker shutdown")
  • Some messages already follow this, good consistency

🔒 Security Considerations

No security issues identified. The SIGTERM handling appropriately:

  • Prevents signal handler races with singleton pattern
  • Provides forced shutdown escape hatch
  • Marks workers inactive before shutdown

🧪 Test Coverage

Concern: This PR doesn't add tests for the new SIGTERM handling logic. Consider adding:

  1. Unit tests for TermSignal (mock signal delivery)
  2. Integration tests for graceful shutdown (timeout scenarios)
  3. Tests verifying requires_graceful_shutdown behavior

Without tests, regressions in shutdown logic are hard to catch.


📊 Summary

Category Count
Critical bugs 2
Potential issues 4
Suggestions 4
Style issues 1

Recommendation: Fix the critical bugs (inconsistent workflow count logging, doc comment syntax) before merging. The other issues are nice-to-haves but not blockers.

Overall, this is a substantial improvement to runtime reliability. Great work centralizing the signal handling! 🎉

@NathanFlurry NathanFlurry force-pushed the 11-10-fix_improve_sigterm_handling_for_the_entire_runtime branch from 2990334 to 0688959 Compare November 13, 2025 22:35
@NathanFlurry NathanFlurry force-pushed the 11-07-feat_gas_implement_smart_pulling_for_horizontally_scaled_workers branch from 4dad8e2 to 3df125e Compare November 13, 2025 22:35
@claude
Copy link

claude bot commented Nov 13, 2025

Code Review: Improved SIGTERM Handling for Runtime

This PR significantly improves the graceful shutdown behavior across the Rivet engine. The changes are well-structured and address an important operational concern. Here's my detailed feedback:

✅ Strengths

  1. Centralized Signal Handling: The new TermSignal implementation in rivet-runtime provides a clean, reusable abstraction for handling termination signals across all services. The use of a watch channel to broadcast signals to multiple subscribers is elegant.

  2. Configurable Shutdown Timeouts: Adding Runtime config with separate timeouts for worker and guard is excellent for production flexibility. The defaults (30s for workers, 1h for guard) are reasonable.

  3. Cross-Platform Support: Proper handling of both Unix (SIGTERM/SIGINT) and Windows (Ctrl+C) signals with conditional compilation.

  4. Graceful Degradation: The 3-signal threshold before forcing shutdown gives operators a way to override graceful shutdown when needed.

  5. Better Service Management: The requires_graceful_shutdown flag allows fine-grained control over which services need time to shut down cleanly.

🐛 Potential Issues

  1. Bug in worker.rs:326 - Inconsistent variable usage:

    let remaining_workflows = wf_futs.into_iter().count();
    if remaining_workflows == 0 {
        tracing::info!("all workflows evicted");
    } else {
        tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");
        // ^^^ Should use 'remaining_workflows' instead of 'self.running_workflows.len()'
    }
  2. Potential Race Condition in TermSignal: The TermSignalHandler::run() loop (lines 51-78) could theoretically miss signals if multiple arrive rapidly, though this is unlikely in practice. The count tracking is not atomic, but since it runs in a single task, this is probably fine.

  3. Missing Error Context: In term_signal.rs:88, using .expect() will panic if signal handlers can't be registered. While rare, this could happen in constrained environments. Consider returning a Result from TermSignal::new() instead.

  4. Inconsistent Comment Style:

    • runtime.rs:11 uses // instead of /// for doc comments
    • runtime.rs:14 same issue

    Should be:

    /// Time (in seconds) to allow for guard to wait for pending requests...
    /// Whether or not to allow running the engine when...
    

⚠️ Potential Issues

  1. Shutdown Logic in gasoline/worker.rs (lines 294-320):

    • The join_fut async block at line 303 iterates with a pattern match that ignores the result: while let Some(_) = wf_futs.next().await {}
    • This means workflow task failures are silently ignored during shutdown
    • Consider logging if workflows panic/fail during shutdown
  2. Service Manager Loop at line 324: Similar issue - while let Some(_) = handle_futs.next().await {} ignores service task results during shutdown.

  3. Cron Dummy Task (service-manager:307): std::future::pending() creates a task that never completes. This is intentional but could be confusing. Consider adding a comment explaining why this exists.

  4. Removed Code from pegboard/workflows/actor/mod.rs: The PR removes the pending actor allocation logic from the error path (lines 546-566 deleted). The comment at line 594 explains this should happen in handle_stopped instead, but ensure this doesn't break actor scheduling when actors fail before being allocated.

🔒 Security Considerations

  1. No Credential Handling: The changes don't introduce security issues. Signal handling is properly isolated.

  2. Resource Cleanup: The shutdown logic properly marks workers as inactive and attempts to clean up running workflows, which is good for preventing resource leaks.

⚡ Performance Considerations

  1. FuturesUnordered: Good choice for waiting on multiple tasks concurrently (worker.rs:294, service-manager.rs:319).

  2. Watch Channel Overhead: The broadcast nature of watch channels is efficient for signal notification, but each TermSignal::new() call clones the receiver, which is cheap (just an Arc clone).

  3. Sleep Duration Calculation: The use of saturating_sub() at lines 315 and 349 is correct and prevents panic from underflow.

📝 Test Coverage

⚠️ No tests found for the new TermSignal functionality. Consider adding:

  • Unit tests for signal counting logic
  • Tests for graceful shutdown timeout behavior
  • Integration tests for service manager shutdown sequencing

📖 Documentation

  • Consider adding module-level docs to term_signal.rs explaining the design (singleton handler, watch channel broadcast, force shutdown threshold)
  • The Runtime config struct could use an example in its module docs

🎯 Code Style & Conventions

✅ Follows Rust conventions and repository patterns
✅ Proper use of tracing for structured logging
✅ Lowercase log messages as per CLAUDE.md
✅ Hard tabs used correctly
✅ Imports added at file top (not inline)

Recommendations

  1. Fix the bug in worker.rs:326 - Use remaining_workflows consistently
  2. Change // to /// for doc comments in runtime.rs
  3. Add test coverage for TermSignal, especially the 3-signal force shutdown logic
  4. Consider logging workflow failures during shutdown instead of silently ignoring them
  5. Add error handling instead of .expect() in TermSignal::new()
  6. Document the cron dummy task with an inline comment explaining its purpose

Overall Assessment

This is a solid improvement to the runtime's reliability and operational characteristics. The centralized signal handling and configurable timeouts are valuable additions. With the minor bug fixes and test coverage additions, this will be production-ready.

Recommendation: Approve with requested changes (fix bug + add tests).

@claude
Copy link

claude bot commented Nov 13, 2025

PR Review: Improve SIGTERM Handling for the Entire Runtime

This is a well-structured PR that significantly improves graceful shutdown handling across the Rivet runtime. The changes introduce configurable shutdown durations and a centralized signal handling mechanism.


Strengths

1. Centralized Signal Handling

The new TermSignal implementation (engine/packages/runtime/src/term_signal.rs) is excellent:

  • Uses a singleton pattern with OnceCell to ensure a single global signal handler
  • Cross-platform support (Unix SIGTERM/SIGINT and Windows Ctrl+C)
  • Implements a force-close mechanism after 3 signals
  • Uses watch::channel for efficient broadcast to multiple subscribers

2. Configuration Management

New Runtime config module provides good flexibility:

  • Configurable shutdown durations for different components
  • Sensible defaults (30s for workers, 1 hour for guard)
  • Maintains backward compatibility by moving allow_version_rollback into the runtime config

3. Improved Worker Shutdown (engine/packages/gasoline/src/worker.rs)

  • Properly signals all running workflows to stop
  • Waits for workflows to complete with timeout
  • Handles abort signals during graceful shutdown
  • Good use of FuturesUnordered to track multiple workflow tasks

4. Guard Server Improvements (engine/packages/guard-core/src/server.rs)

  • Integrates with hyper's GracefulShutdown properly
  • Allows pending requests to complete before shutdown
  • Handles multi-SIGTERM scenarios correctly

5. Service Manager Enhancements (engine/packages/service-manager/src/lib.rs)

  • Introduces requires_graceful_shutdown flag per service
  • Uses AtomicBool to coordinate shutdown across services
  • Aborts non-critical services immediately while allowing critical ones to finish

Issues & Concerns

🐛 Critical: Inconsistent Remaining Workflows Count

Location: engine/packages/gasoline/src/worker.rs:326

let remaining_workflows = wf_futs.into_iter().count();
if remaining_workflows == 0 {
    tracing::info!("all workflows evicted");
} else {
    tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");
}

Issue: The log statement uses self.running_workflows.len() instead of the remaining_workflows variable. This will show incorrect counts.

Fix:

tracing::warn!(?remaining_workflows, "not all workflows evicted");

⚠️ Bug: Redundant FuturesUnordered Iteration

Location: engine/packages/gasoline/src/worker.rs:303

let join_fut = async { while let Some(_) = wf_futs.next().await {} };

Issue: After the loop completes or breaks, you still call wf_futs.into_iter().count() at line 322. However, if join_fut completes naturally, wf_futs will be exhausted and the count will be 0 even if workflows were aborted earlier.

Suggested Fix: Track the count before the select loop or check is_finished() on the handles directly.


💡 Comment Inconsistency

Location: engine/packages/config/src/config/runtime.rs:11-15

Issue: Inconsistent comment style - line 11-12 use /// (doc comments), but lines 14-15 use // (regular comments).

Fix: Use /// consistently for all field documentation:

/// Time (in seconds) to allow for guard to wait for pending requests after receiving SIGTERM. Defaults
/// to 1 hour.
guard_shutdown_duration: Option<u32>,
/// Whether or not to allow running the engine when the previous version that was run is higher than
/// the current version.
allow_version_rollback: Option<bool>,

🔍 Potential Issue: Guard HTTPS Connections Don't Use Graceful Shutdown

Location: engine/packages/guard-core/src/server.rs:202-203

// Serve the connection (no graceful shutdown in spawned task)
if let Err(err) = conn_server.serve_connection_with_upgrades(io, service).await {

Issue: The comment explicitly states HTTPS connections don't participate in graceful shutdown. This means HTTPS requests might be abruptly terminated during shutdown.

Recommendation: Either:

  1. Document this limitation clearly in the PR description
  2. Or refactor to include HTTPS connections in graceful shutdown tracking

📝 Minor: Removed Actor Allocation Logic

Location: engine/packages/pegboard/src/workflows/actor/mod.rs:533-567

The PR removes actor allocation logic from the workflow. The new comment states this happens elsewhere, but:

  • It's unclear why this was moved in a PR focused on signal handling
  • The comment at line 534 says "At this point, the actor is not allocated so no cleanup related to alloc idx/desired slots needs to be done" - this suggests the logic may have been premature rather than moved

Recommendation: Clarify in the PR description why this change was included.


🎯 Code Style: Log Message Casing

Per CLAUDE.md conventions, log messages should be lowercase unless referencing code symbols.

Good examples in this PR:

  • "starting worker shutdown"
  • "accept error on HTTP port"

Inconsistent example:

  • Line 69: "received SIGTERM" - should this be "received sigterm" since SIGTERM is a signal name, not a code symbol?

Performance Considerations

Positive:

  • ✅ Reuses signal handler across multiple TermSignal instances
  • ✅ Uses watch::channel for efficient broadcast
  • ✅ Properly uses FuturesUnordered for concurrent workflow tracking

Potential Concern:

The guard shutdown timeout defaults to 1 hour. While this protects long-running requests, it could delay deployments significantly. Consider:

  • Adding metrics to track actual shutdown durations
  • Documenting the trade-offs in configuration docs

Security Considerations

No security concerns identified. The changes improve reliability which indirectly enhances security by ensuring proper cleanup during shutdowns.


Test Coverage

⚠️ No test changes included in this PR. Recommended additions:

  1. Unit tests for TermSignal singleton behavior
  2. Integration tests for worker shutdown with varying timeouts
  3. Tests for the force-close behavior (3+ signals)
  4. Tests for service-manager's graceful vs. non-graceful service handling

Code Quality

Overall: High quality

Strengths:

  • Clear separation of concerns
  • Good use of Rust async patterns
  • Comprehensive logging
  • Cross-platform compatibility

Minor improvements needed:

  • Fix the logging inconsistency in worker.rs:326
  • Consistent documentation comment style
  • Clarify HTTPS graceful shutdown limitation

Recommendations

Before Merge:

  1. ✅ Fix the remaining_workflows logging bug
  2. ✅ Fix comment style inconsistency in runtime.rs
  3. 📝 Document why actor allocation logic was removed
  4. 📝 Clarify HTTPS connection handling during shutdown

Post-Merge:

  1. Add integration tests for shutdown scenarios
  2. Add metrics for shutdown duration tracking
  3. Consider making guard shutdown timeout environment-aware (shorter in dev)

Summary

This is a solid improvement to runtime reliability. The centralized TermSignal implementation and configurable timeouts are well-designed. With the bugs fixed and clarifications added, this will significantly improve the shutdown experience.

Recommendation: Request changes for the critical logging bug, then approve.

Great work! 🚀


Generated with Claude Code

@NathanFlurry NathanFlurry force-pushed the 11-07-feat_gas_implement_smart_pulling_for_horizontally_scaled_workers branch from 3df125e to ff8b85c Compare November 13, 2025 22:42
@NathanFlurry NathanFlurry force-pushed the 11-10-fix_improve_sigterm_handling_for_the_entire_runtime branch from 0688959 to 4fa0007 Compare November 13, 2025 22:42
@graphite-app
Copy link
Contributor

graphite-app bot commented Nov 13, 2025

Merge activity

@claude
Copy link

claude bot commented Nov 13, 2025

PR Review: Improved SIGTERM Handling

This PR makes significant improvements to how the runtime handles termination signals. Overall, the changes are well-structured and address important operational concerns. Here's my detailed review:


✅ Strengths

1. Centralized Signal Handling
The new TermSignal implementation (runtime/src/term_signal.rs) is excellent:

  • Singleton pattern via OnceCell prevents multiple signal handlers
  • Cross-platform support (Unix SIGTERM/SIGINT + Windows Ctrl+C)
  • Force-close threshold (3 signals) is a great UX improvement
  • Watch channel pattern allows multiple subscribers efficiently

2. Configurable Shutdown Durations
New runtime config (config/src/config/runtime.rs) is well-designed:

  • Sensible defaults (30s for workers, 1h for guard)
  • Proper encapsulation with getter methods returning Duration
  • Migrated allow_version_rollback logically into runtime config

3. Improved Worker Shutdown (gasoline/src/worker.rs:268-330)

  • Properly signals all workflows before waiting
  • Uses FuturesUnordered to wait for completion efficiently
  • Timeout calculation with saturating_sub prevents panics
  • Clear logging throughout the shutdown process

4. Guard Graceful Shutdown (guard-core/src/server.rs:252-278)

  • Waits for pending HTTP requests via hyper's graceful shutdown
  • Respects configurable timeout
  • Can abort early on repeated signals

🐛 Issues Found

1. Bug in Worker Shutdown (gasoline/src/worker.rs:322-326)

let remaining_workflows = wf_futs.into_iter().count();
if remaining_workflows == 0 {
    tracing::info!("all workflows evicted");
} else {
    tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");
}

Issues:

  • wf_futs.into_iter().count() always returns 0 because the futures were already consumed in the loop above
  • The warning log uses self.running_workflows.len() which hasn't been updated
  • This will always log "all workflows evicted" even when some remain

Fix needed:

// Before the loop, save the count
let total_workflows = wf_futs.len();

// After the loop
let remaining_workflows = wf_futs.into_iter().count();
if remaining_workflows == 0 {
    tracing::info!("all workflows evicted");
} else {
    tracing::warn!(?remaining_workflows, "not all workflows evicted");
}

2. Missing Comment Style Fix (config/src/config/runtime.rs:11-16)

/// Time (in seconds) to allow for guard to wait for pending requests after receiving SIGTERM. Defaults
// to 1 hour.                                      ← Should be /// not //
guard_shutdown_duration: Option<u32>,
/// Whether or not to allow running the engine when the previous version that was run is higher than
// the current version.Should be /// not //
allow_version_rollback: Option<bool>,

The continuation lines should use /// for proper doc comments.


⚠️ Potential Issues

1. Missing Imports (gasoline/src/worker.rs:7)

use futures_util::{StreamExt, stream::FuturesUnordered};

Per CLAUDE.md guidelines, imports should not use inline module paths within functions. This import looks correct (it's at the top), but verify futures-util is in the workspace dependencies.

2. Service Manager Shutdown Race (service-manager/src/lib.rs:344-349)

if abort {
    // Give time for services to handle final abort
    tokio::time::sleep(Duration::from_millis(50)).await;
    rivet_runtime::shutdown().await;
    break;
}

The 50ms sleep seems arbitrary. Consider if this is sufficient or if services should have a configurable abort grace period.

3. Removed Functionality (pegboard/src/workflows/actor/mod.rs:533-548)
The PR removes the "allocate pending actors" logic from the non-allocated failure path but keeps it in handle_stopped. The comment at line 536 says "no cleanup related to alloc idx/desired slots needs to be done" which seems correct, but verify this doesn't break actor scheduling in edge cases.


📝 Style & Convention Issues

1. Error Import Pattern (guard-core/src/server.rs:10)

use anyhow::Result;

Good! This follows the CLAUDE.md guideline to not glob import from anyhow.

2. Logging Style (guard-core/src/server.rs:147,234)

tracing::debug!(?err, "accept error on HTTP port");
tracing::debug!(?err, "accept error on HTTPS port");

✅ Correctly uses lowercase messages per CLAUDE.md

3. Inconsistent Error Handling (service-manager/src/lib.rs:254)

tracing::error!(cron=%service.name, "cron failed all restart attempts");

Should this exit or continue? It logs an error but then falls through without any recovery.


🔒 Security Concerns

No security issues identified. The signal handling is sound and doesn't introduce vulnerabilities.


⚡ Performance Considerations

1. FuturesUnordered Usage
Both worker and service manager use FuturesUnordered correctly for concurrent waiting. Good choice.

2. Shutdown Duration Calculation

tokio::time::sleep(shutdown_duration.saturating_sub(shutdown_start.elapsed()))

Using saturating_sub is correct to prevent underflow panics, though if the duration has already elapsed, this creates a completed future that triggers immediately - which is the desired behavior. ✅


🧪 Test Coverage

The PR doesn't include new tests for the TermSignal implementation. Consider adding:

  • Unit tests for force-close threshold behavior
  • Integration tests for graceful vs forced shutdown scenarios
  • Tests for the timeout behavior in worker/guard shutdown

📊 Summary

Must Fix:

  1. Worker shutdown remaining workflows counting bug (line 322)
  2. Doc comment style in runtime config (lines 11-16)

Should Consider:

  1. Add tests for TermSignal behavior
  2. Review 50ms abort grace period in service manager
  3. Verify pegboard actor allocation removal is correct

Nice to Have:

  1. Handle cron failure case more explicitly

The core architecture is solid and this is a valuable improvement to operational reliability. Once the counting bug is fixed, this should be good to merge.


Review generated by Claude Code

@NathanFlurry NathanFlurry changed the base branch from 11-07-feat_gas_implement_smart_pulling_for_horizontally_scaled_workers to graphite-base/3443 November 13, 2025 22:45
@NathanFlurry NathanFlurry force-pushed the 11-10-fix_improve_sigterm_handling_for_the_entire_runtime branch from 4fa0007 to 2ac1b8e Compare November 13, 2025 22:46
@NathanFlurry NathanFlurry changed the base branch from graphite-base/3443 to main November 13, 2025 22:46
@NathanFlurry NathanFlurry merged commit 7f53a33 into main Nov 13, 2025
6 of 10 checks passed
@NathanFlurry NathanFlurry deleted the 11-10-fix_improve_sigterm_handling_for_the_entire_runtime branch November 13, 2025 22:47
@claude
Copy link

claude bot commented Nov 13, 2025

Pull Request Review: Improve SIGTERM Handling

I've reviewed this PR that improves SIGTERM handling across the Rivet runtime. Overall, this is a well-structured improvement that centralizes signal handling and adds configurable graceful shutdown behavior. Here's my detailed feedback:

✅ Strengths

  1. Centralized Signal Handling: The new TermSignal implementation in rivet-runtime is a great architectural improvement. Using a global OnceCell with a watch channel ensures all components receive shutdown signals consistently.

  2. Cross-Platform Support: Good handling of both Unix (SIGTERM/SIGINT) and Windows (Ctrl+C) signals.

  3. Configurable Timeouts: Adding worker_shutdown_duration and guard_shutdown_duration to the config provides operators with control over graceful shutdown behavior.

  4. Graceful Shutdown Logic: The shutdown sequences in both worker and guard properly wait for in-flight work to complete before terminating.

  5. Service Manager Improvements: The distinction between services requiring graceful shutdown vs. those that can be aborted immediately is a good design pattern.

🐛 Potential Issues

Critical: Bug in worker.rs:322-326

let remaining_workflows = wf_futs.into_iter().count();
if remaining_workflows == 0 {
    tracing::info!("all workflows evicted");
} else {
    tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");
}

Issue: After into_iter().count(), the FuturesUnordered is consumed, but then the warning logs self.running_workflows.len() instead of remaining_workflows. This could log incorrect information.

Fix: Use remaining_workflows consistently:

tracing::warn!(remaining_workflows, "not all workflows evicted");

Medium: Inconsistent Comment Style in runtime.rs:11-15

Lines 11-13 use // while line 15 uses the inconsistent style. Should be /// for doc comments:

/// Time (in seconds) to allow for guard to wait for pending requests after receiving SIGTERM. 
/// Defaults to 1 hour.
guard_shutdown_duration: Option<u32>,
/// Whether or not to allow running the engine when the previous version that was run is higher than
/// the current version.
allow_version_rollback: Option<bool>,

⚠️ Design Concerns

1. Force Shutdown Threshold

FORCE_CLOSE_THRESHOLD = 3 in term_signal.rs:14 means users need to send SIGTERM 3 times to force abort. This might be unintuitive. Consider:

  • Documenting this behavior
  • Making it configurable
  • Or reducing to 2 (first signal = graceful, second = force)

2. Error Handling in TermSignal::new()

let term_signal = TermSignalHandler::new()
    .expect("failed initializing termination signal handler");

(term_signal.rs:88)

This panics if signal handler setup fails. While rare, this could be more gracefully handled, especially for testing scenarios.

3. Shutdown Duration Edge Cases

In worker.rs:315 and guard.rs:269:

_ = tokio::time::sleep(shutdown_duration.saturating_sub(shutdown_start.elapsed())) => {

If shutdown_start.elapsed() exceeds shutdown_duration, saturating_sub returns 0, causing immediate timeout. While this works, it could race with the actual completion. Consider checking if we're already past the deadline before entering the loop.

🔒 Security Considerations

  • Good: Signal handling doesn't introduce privilege escalation risks
  • Good: Shutdown timeouts prevent indefinite hangs
  • Consideration: Ensure that force-abort (3x SIGTERM) properly cleans up sensitive resources

🎯 Performance Considerations

  1. Watch Channel Overhead: Using a watch channel for signal broadcast is efficient and appropriate for this use case.

  2. FuturesUnordered in Shutdown: Good choice for waiting on multiple tasks. However, in worker.rs:303, reconstructing the future in every loop iteration might be inefficient:

    let join_fut = async { while let Some(_) = wf_futs.next().await {} };

    This is fine since it breaks on first completion, but the comment should clarify this is intentional.

🧪 Test Coverage

Missing: This PR doesn't add tests for the new shutdown behavior. Consider adding:

  • Unit tests for TermSignal behavior
  • Integration tests verifying graceful shutdown timeouts
  • Tests for force-abort behavior (3x SIGTERM)
  • Tests for the requires_graceful_shutdown flag in service-manager

📝 Code Quality

  1. Logging: Good use of structured logging throughout. Log messages follow the repository's lowercase convention (e.g., "starting worker shutdown").

  2. Import Organization: ✅ Imports are properly organized at the top of files.

  3. Unused Import: In gasoline/worker.rs:8, there's:

    use tokio::{sync::watch, task::JoinHandle};

    But watch is no longer used after removing the shutdown_rx parameter from start(). Can be cleaned up.

📋 Additional Suggestions

  1. Documentation: Add doc comments to TermSignal methods explaining:

    • The force-abort threshold behavior
    • That recv() returns true for forced shutdown
    • Thread-safety guarantees
  2. Pegboard Workflow Comment: The comment move in pegboard/src/workflows/actor/mod.rs:594-597 is good - it makes more sense in the context where allocation actually happens.

  3. Configuration Validation: Consider adding validation in Runtime to ensure shutdown durations are reasonable (e.g., >= 1 second, <= 24 hours).

🎬 Summary

This is a solid improvement to shutdown handling. The main issues are:

  • Fix the bug in worker.rs:326 (wrong variable logged)
  • Add test coverage for the new shutdown behavior
  • Consider documentation for the force-abort threshold

The architecture is sound and the implementation follows Rivet's conventions well. Nice work on improving reliability!


Generated with Claude Code

@claude
Copy link

claude bot commented Nov 13, 2025

Code Review: Improve SIGTERM Handling for the Entire Runtime

Overview

This PR significantly improves graceful shutdown handling across the runtime by centralizing signal handling and adding configurable shutdown durations. The implementation is solid overall, but I've identified several issues that should be addressed.


Critical Issues

1. Incorrect Remaining Workflows Count (engine/packages/gasoline/src/worker.rs:326)

Severity: High

let remaining_workflows = wf_futs.into_iter().count();
if remaining_workflows == 0 {
    tracing::info!("all workflows evicted");
} else {
    tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");
}

Problem: After consuming wf_futs with into_iter().count(), you're logging self.running_workflows.len() instead of the computed remaining_workflows variable. This will always show the original count, not the actual remaining workflows.

Fix:

tracing::warn!(remaining_workflows, "not all workflows evicted");

2. Inconsistent Error Import Pattern (engine/packages/service-manager/src/lib.rs:11)

Severity: Medium

According to CLAUDE.md:

Do not glob import (::*) from anyhow. Instead, import individual types and traits

The code currently uses:

use anyhow::{Context, Result, ensure};

This is correct! However, ensure this pattern is consistent throughout the codebase.


3. Typo in Documentation (engine/packages/config/src/config/runtime.rs:12)

Severity: Low

/// Time (in seconds) to allow for guard to wait for pending requests after receiving SIGTERM. Defaults
// to 1 hour.

Line 12 uses // instead of /// for the doc comment continuation. Should be:

/// Time (in seconds) to allow for guard to wait for pending requests after receiving SIGTERM. Defaults
/// to 1 hour.

Potential Bugs

4. Potential Race Condition in TermSignal Initialization (engine/packages/runtime/src/term_signal.rs:88)

Severity: Medium

let term_signal = TermSignalHandler::new()
    .expect("failed initializing termination signal handler");

Using .expect() inside get_or_init can cause a panic if signal handler initialization fails. While this is probably acceptable for signal handling (the process can't continue without it), consider whether this should propagate the error more gracefully or at least document why a panic here is acceptable.

Recommendation: Add a comment explaining why panic on signal handler initialization is acceptable, or make TermSignal::new() return a Result that must be handled by the caller.


5. Unused Error Handling (engine/packages/runtime/src/term_signal.rs:104)

Severity: Low

pub async fn recv(&mut self) -> bool {
    let _ = self.0.changed().await;
    *self.0.borrow()
}

You're silently ignoring errors from changed().await. While this might be intentional (receiver continues working even if sender is dropped), add a comment explaining this behavior.


6. Missing Newline at EOF (engine/packages/service-manager/Cargo.toml)

Severity: Low

The file is missing a newline at the end:

tracing.workspace = true
\ No newline at end of file

Should end with a newline per Unix conventions.


Design & Architecture

7. Graceful Shutdown Logic is Sound

The new centralized TermSignal handler with configurable shutdown durations is a significant improvement:

  • Properly handles multiple SIGTERM signals with force-abort threshold
  • Uses watch::channel for broadcast semantics
  • Singleton pattern via OnceCell prevents multiple handlers

8. Good Configuration Design

The new Runtime config structure (engine/packages/config/src/config/runtime.rs) provides sensible defaults:

  • Worker shutdown: 30 seconds
  • Guard shutdown: 1 hour (appropriate for long-running HTTP requests)
  • Consolidates allow_version_rollback into the runtime config

9. Service Manager Improvements

The requires_graceful_shutdown flag on services allows fine-grained control over which services need to complete gracefully vs. which can be immediately aborted.


Performance Considerations

10. FuturesUnordered Usage

Good use of FuturesUnordered in gasoline/src/worker.rs:294-298 and service-manager/src/lib.rs:319-324 for concurrent shutdown of multiple workflows/services.

11. Shutdown Timeout Implementation

The shutdown timeout logic using saturating_sub is correct:

_ = tokio::time::sleep(shutdown_duration.saturating_sub(shutdown_start.elapsed())) => {

However, consider that if shutdown_start.elapsed() exceeds shutdown_duration, this creates a zero-duration sleep that immediately fires. This is probably the intended behavior, but worth noting.


Testing & Error Handling

12. Missing Tests

Severity: Medium

No tests are included for:

  • TermSignal behavior (especially the force-abort threshold)
  • Configuration parsing for Runtime struct
  • Graceful shutdown timeout behavior

Recommendation: Add unit tests for the new TermSignal module and integration tests for shutdown behavior.

13. Error Logging Follows Conventions

Good adherence to CLAUDE.md logging patterns:

  • Using structured logging with tracing
  • Lowercase messages (e.g., "starting worker shutdown" not "Starting worker shutdown")
  • Proper use of field interpolation

Code Quality

14. Timestamp Naming Inconsistency

Per CLAUDE.md:

When storing timestamps, name them *_at with past tense verb. For example, created_at, destroyed_at.

The code uses shutdown_start and connection_start which follow the pattern, but consider using *_at suffix for consistency if these were persisted (they're not, so this is fine).

15. Import Organization

Imports are properly organized at the top of files, not inline within functions (per CLAUDE.md).


Security Concerns

16. Signal Handler Security

The centralized signal handler properly prevents signal handler re-registration, which could be a security issue in some contexts. The singleton pattern is appropriate here.


Summary

Strengths:

  • Significant improvement to graceful shutdown handling
  • Well-structured configuration with sensible defaults
  • Good use of async primitives (watch::channel, FuturesUnordered)
  • Follows most codebase conventions

Required Changes:

  1. Fix incorrect remaining_workflows logging (Critical)
  2. Fix doc comment typo
  3. Add missing newline to Cargo.toml

Recommended Changes:

  1. Add tests for TermSignal and shutdown behavior
  2. Document panic behavior in signal handler initialization
  3. Document intentional error ignoring in recv()

Overall, this is a solid improvement to the codebase's shutdown handling. The architecture is sound, but please address the logging bug before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants