-
Notifications
You must be signed in to change notification settings - Fork 133
fix: improve sigterm handling for the entire runtime #3443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: improve sigterm handling for the entire runtime #3443
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
3 Skipped Deployments
|
More templates
@rivetkit/actor
@rivetkit/cloudflare-workers
@rivetkit/core
@rivetkit/db
@rivetkit/framework-base
@rivetkit/next-js
@rivetkit/react
rivetkit
@rivetkit/sql-loader
@rivetkit/engine-runner
@rivetkit/engine-runner-protocol
commit: |
Pull Request ReviewThis PR improves SIGTERM handling across the runtime with a centralized signal handler and configurable graceful shutdown timeouts. Overall, the architecture is solid and the code quality is good. Here are my findings: ✅ Strengths
🐛 Bugs & IssuesCritical1. Bug in worker shutdown remaining_workflows count ( let remaining_workflows = wf_futs.into_iter().count();
if remaining_workflows == 0 {
tracing::info!("all workflows evicted");
} else {
tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");
}
2. Potential infinite sleep in guard/worker shutdown loops ( _ = tokio::time::sleep(shutdown_duration.saturating_sub(shutdown_start.elapsed())) => {
Medium3. Missing error propagation in TermSignalHandler::new ( let term_signal = TermSignalHandler::new()
.expect("failed initializing termination signal handler");
4. Inconsistent comment style (
|
660df24 to
7aff577
Compare
Pull Request Review: Improve SIGTERM Handling for the Entire RuntimeSummaryThis PR improves graceful shutdown handling across the Rivet runtime by introducing a centralized Code Quality & Best PracticesStrengths ✅
Issues Found 🔴1. Critical: Incorrect remaining workflows count (gasoline/src/worker.rs:326)let remaining_workflows = wf_futs.into_iter().count();
if remaining_workflows == 0 {
tracing::info\!("all workflows evicted");
} else {
tracing::warn\!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");
}Problem: You're logging Fix: Track remaining count before the loop or use 2. Comment style inconsistency (config/src/config/runtime.rs:11-16)/// Time (in seconds) to allow for guard to wait for pending requests after receiving SIGTERM. Defaults
// to 1 hour.
guard_shutdown_duration: Option<u32>,
/// Whether or not to allow running the engine when the previous version that was run is higher than
// the current version.
allow_version_rollback: Option<bool>,Problem: Doc comments should use Fix: Change 3. Potential race condition (service-manager/src/lib.rs:333-343)abort = term_signal.recv() => {
shutting_down.store(true, Ordering::SeqCst);
// Abort services that don't require graceful shutdown
running_services.retain(|(requires_graceful_shutdown, handle)| {
if \!requires_graceful_shutdown {
handle.abort();
}
*requires_graceful_shutdown
});Problem: Between setting Suggestion: Consider setting the flag and aborting in a single atomic operation, or at least document this behavior. 4. Missing error context (runtime/src/term_signal.rs:88)let term_signal = TermSignalHandler::new()
.expect("failed initializing termination signal handler");Problem: Per CLAUDE.md, we should avoid panics when possible and provide better error context. Suggestion: Since this is in an async init context, consider propagating the error or providing more detailed panic message about what might cause this (e.g., "failed initializing termination signal handler - this may indicate the process doesn't have signal handling permissions"). 5. Clippy warning potential (gasoline/src/worker.rs:303)let join_fut = async { while let Some(_) = wf_futs.next().await {} };Issue: Clippy will warn about Fix: Use 6. Shutdown duration calculation issue (gasoline/src/worker.rs:315)_ = tokio::time::sleep(shutdown_duration.saturating_sub(shutdown_start.elapsed())) => {Problem: If Better approach: Check if time is already elapsed before the select, or use a deadline-based approach with Security Concerns 🔒Low Risk Issues
Performance Considerations ⚡Good Improvements
Minor Concerns
Test Coverage 📊Concerning Gaps
Recommendations
Additional ObservationsDocumentation
Code RemovalThe removal of Pegboard Actor Change (pegboard/src/workflows/actor/mod.rs:533-567)Moving the "allocate pending actors" logic is a good refactoring - makes the shutdown path cleaner. The comment at line 597 explains the rationale well. RecommendationsMust Fix Before Merge 🔴
Should Fix 🟡
Nice to Have 🟢
Overall AssessmentThis is a solid improvement to the shutdown handling infrastructure. The centralization of signal handling and configurable timeouts are excellent architectural decisions. The main concerns are:
Recommendation: Request changes for the critical bugs, then approve once those are addressed. The testing gaps are concerning but not blocking if this is being monitored in production. Review generated with Claude Code |
7aff577 to
9f71238
Compare
ff49fad to
728f761
Compare
Pull Request Review: Improve SIGTERM Handling for the Entire RuntimeThis PR introduces a centralized termination signal handler and improves graceful shutdown across the runtime. Overall, this is a well-structured improvement to signal handling. Below are my observations and recommendations: ✅ Strengths
🐛 Potential Bugs & IssuesCritical Issues
Moderate Issues
🔒 Security Concerns
⚡ Performance Considerations
🧪 Test Coverage
📝 Code Quality & Best Practices
💡 Recommendations
📊 SummaryOverall Assessment: Good improvement to signal handling infrastructure, but needs bug fixes and test coverage before merging. Blockers:
Recommended Before Merge:
Nice to Have:
Let me know if you'd like me to elaborate on any of these points or if you have questions about the recommendations! |
PR Review: Improve SIGTERM Handling for the Entire RuntimeSummaryThis PR significantly improves graceful shutdown handling across the Rivet runtime. The changes are well-structured and follow good practices for distributed system lifecycle management. Positive Aspects ✅1. Excellent Architecture
2. Well-Implemented Shutdown Logic
3. Code Quality
Issues & Recommendations 🔍Critical Issues1. Bug in worker shutdown (gasoline/worker.rs:322-326) let remaining_workflows = wf_futs.into_iter().count();
if remaining_workflows == 0 {
tracing::info!("all workflows evicted");
} else {
tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");
}Problem: After the loop, Fix: Track remaining workflows before the iterator is consumed: let remaining_workflows = wf_futs.len();
// ... loop ...
if remaining_workflows == 0 {
tracing::info!("all workflows evicted");
} else {
tracing::warn!(?remaining_workflows, "not all workflows evicted");
}2. Comment inconsistency (config/runtime.rs:11-12, 14-15) /// Time (in seconds) to allow for guard to wait for pending requests after receiving SIGTERM. Defaults
/// to 1 hour.
guard_shutdown_duration: Option<u32>,
/// Whether or not to allow running the engine when the previous version that was run is higher than
/// the current version.
allow_version_rollback: Option<bool>,Medium Priority Issues3. Timeout calculation edge case (gasoline/worker.rs:315) _ = tokio::time::sleep(shutdown_duration.saturating_sub(shutdown_start.elapsed())) => {If Consider: Break immediately if timeout already exceeded: let elapsed = shutdown_start.elapsed();
if elapsed >= shutdown_duration {
tracing::warn!("worker shutdown timed out");
break;
}
_ = tokio::time::sleep(shutdown_duration - elapsed) => {4. Missing newline at end of file ( Minor Issues5. Inconsistent import style ( use anyhow::Result;This imports a single type, while the CLAUDE.md emphasizes not to glob import from anyhow. This is good, but consider consistency - other files use 6. Windows signal handling ( Performance Considerations ⚡Positive:
No concerns about performance impact from these changes. Security Considerations 🔒Low Risk: This PR primarily affects shutdown behavior, which has minimal security implications. Positive: Force-close threshold prevents indefinite hangs that could be exploited in DoS scenarios. Test Coverage 🧪Observation: No new tests added for the shutdown behavior. While testing shutdown logic can be complex, consider:
Architecture Considerations 🏗️Question about pegboard changes (
Additional NotesGreat work on:
Documentation: Consider adding a section to docs about:
RecommendationApprove with minor fixes. Please address the critical bug in worker.rs:322-326 and the comment style in runtime.rs. The other issues are lower priority but worth considering for a follow-up PR. Great improvement to the runtime's operational reliability! 🎉 🤖 Generated with Claude Code |
9f71238 to
09cec26
Compare
728f761 to
09a68bc
Compare
Code Review for PR #3443: Improve SIGTERM HandlingI've reviewed this PR which improves graceful shutdown handling across the Rivet runtime. Overall, this is a solid improvement to the shutdown infrastructure with good architecture. Here are my findings: ✅ StrengthsArchitecture & Design
Code Quality
🐛 Issues FoundCritical1. Bug in tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");Should use tracing::warn!(remaining_workflows=%remaining_workflows, "not all workflows evicted");Location: Medium Priority2. Documentation inconsistency in /// Time (in seconds) to allow for guard to wait for pending requests after receiving SIGTERM.
/// Defaults to 1 hour.
guard_shutdown_duration: Option<u32>,
/// Whether or not to allow running the engine when the previous version that was run is higher
/// than the current version.
allow_version_rollback: Option<bool>,Location: 3. Unused variable pattern in loops // Current
while let Some(_) = wf_futs.next().await {}
// Suggested
while wf_futs.next().await.is_some() {}Locations:
4. Potential race condition in shutdown flag Location: 💡 SuggestionsPerformance5. FuturesUnordered efficiency let remaining_workflows = wf_futs.into_iter().count();Consider tracking this separately or using Code Style6. Import organization 7. Consistent error messages
🔒 Security Considerations8. Signal handler initialization
9. Watch channel unbounded growth ⚡ Performance Notes10. Shutdown timing precision 11. Task spawning overhead 📋 Test CoverageThe PR doesn't include new tests. Consider adding:
🎯 SummaryOverall Assessment: This is a well-architected improvement that meaningfully enhances the runtime's shutdown behavior. The centralized signal handling is a solid pattern. Required Changes:
Recommended Changes:
Estimated Risk: Low - The changes are well-isolated and improve existing functionality without breaking APIs. Great work on improving the reliability of the shutdown sequence! 🚀 |
Code Review: Improved SIGTERM HandlingThis PR significantly improves the shutdown behavior across the Rivet runtime. The changes are well-structured and demonstrate a thoughtful approach to graceful shutdown handling. Below is my comprehensive review: ✅ Strengths
🐛 Potential IssuesCritical: Bug in Worker Shutdown (line 326)tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");This references Fix: tracing::warn!(?remaining_workflows, "not all workflows evicted");Logic Issue: Timeout Calculation (lines 315, 269)The tokio::time::sleep(shutdown_duration.saturating_sub(shutdown_start.elapsed()))If processing takes longer than expected before entering the loop, this could timeout immediately. Consider checking if time remains before entering the sleep: let remaining = shutdown_duration.saturating_sub(shutdown_start.elapsed());
if remaining.is_zero() {
tracing::warn!("worker shutdown timed out");
break;
}
_ = tokio::time::sleep(remaining) => { ... }Minor: Inconsistent Comment Style (lines 11-12, 14-15 in runtime.rs)Uses /// Time (in seconds) to allow for guard to wait for pending requests after receiving SIGTERM. Defaults
// to 1 hour. // ← Should be ///
guard_shutdown_duration: Option<u32>,
/// Whether or not to allow running the engine when the previous version that was run is higher than
// the current version. // ← Should be ///
allow_version_rollback: Option<bool>,🔒 Security Considerations
⚡ Performance Considerations
🧪 Test CoverageMissing:
Recommendation: Add integration tests that:
📝 Code Quality & Style
🎯 Recommendations
SummaryThis is a solid improvement to the shutdown handling with good architectural decisions. The bugs identified are fixable and relatively minor. The centralized Recommendation: Approve with requested changes to fix the identified bugs. |
09cec26 to
f6632bf
Compare
09a68bc to
74f8abc
Compare
Pull Request Review: Improved SIGTERM HandlingThis PR introduces a centralized termination signal handling system with configurable graceful shutdown timeouts. Overall, the implementation is solid and follows good patterns, but there are several issues that should be addressed. 🐛 Critical Issues1. Bug in
|
PR Review: Improve SIGTERM Handling for the Entire RuntimeThis is a well-structured PR that significantly improves graceful shutdown handling across the Rivet runtime. The changes are comprehensive and demonstrate good systems programming practices. ✅ Strengths
🐛 Potential IssuesCritical: Race Condition in Worker Shutdown (gasoline/worker.rs:322-326)let remaining_workflows = wf_futs.into_iter().count();
if remaining_workflows == 0 {
tracing::info!("all workflows evicted");
} else {
tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");
}Issue: You're counting Fix: Use tracing::warn!(remaining_workflows, "not all workflows evicted");Medium: Comment Inconsistency (config/runtime.rs:11-12, 14-15)The comments use /// Time (in seconds) to allow for guard to wait for pending requests after receiving SIGTERM. Defaults
// to 1 hour. ← Should be ///
guard_shutdown_duration: Option<u32>,
/// Whether or not to allow running the engine when the previous version that was run is higher than
// the current version. ← Should be ///
allow_version_rollback: Option<bool>,Low: Unused Variable (guard-core/server.rs:303)let join_fut = async { while let Some(_) = wf_futs.next().await {} };The pattern while wf_futs.next().await.is_some() {}Same issue in Low: Windows Signal HandlingOn Windows, 🎯 Performance Considerations
🔒 Security ConcernsNo security issues identified. The shutdown handling properly:
📊 Test CoverageThe PR doesn't show test updates. Consider:
🧹 Code QualityPositive:
Minor suggestions:
📝 DocumentationGood inline comments explaining the shutdown flow. Consider adding:
🎨 Style ComplianceFollows the repo's style guide:
🔄 Architecture ImpactBreaking Changes:
Both are internal APIs, so acceptable. Final Recommendation: Approve with Minor ChangesThe core implementation is solid. Please address:
The PR significantly improves operational robustness and is ready to merge after these fixes. |
f6fbd63 to
f53a47c
Compare
Code Review: Improved SIGTERM Handling for RuntimeThis PR centralizes and improves SIGTERM signal handling across the entire runtime. Overall, this is a well-designed improvement to the codebase with good architectural decisions. Below are my findings: Strengths
Issues Found1. Bug in
|
0014809 to
4dad8e2
Compare
f53a47c to
2990334
Compare
Code Review: Improved SIGTERM Handling for RuntimeThis PR improves graceful shutdown handling across the runtime by centralizing signal handling and adding configurable shutdown durations. Overall, this is a solid improvement to the shutdown logic. Below are my findings: ✅ Strengths
🐛 Potential IssuesCritical
Major
Minor
🔒 Security Concerns
⚡ Performance Considerations
🧪 Test Coverage
📝 Documentation
🎯 Recommendations
Overall, this is a valuable improvement to shutdown handling. The main issues are fixable and mostly related to edge cases in the shutdown logic. Once the critical bugs are addressed, this will significantly improve the runtime's reliability during restarts and deployments. |
PR Review: Improved SIGTERM HandlingI've reviewed this PR and the improvements to SIGTERM handling across the runtime look solid overall. Here's my detailed feedback: ✅ Strengths
🐛 Critical Issues1. Bug in worker.rs:322-326 - Inconsistent workflow countlet remaining_workflows = wf_futs.into_iter().count();
if remaining_workflows == 0 {
tracing::info!("all workflows evicted");
} else {
tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");
}Problem: You're logging tracing::warn!(remaining_workflows, "not all workflows evicted");2. Comment typo in runtime.rs:11-12, 14-15/// Time (in seconds) to allow for guard to wait for pending requests after receiving SIGTERM. Defaults
// to 1 hour. <-- Single slash instead of triple slash
guard_shutdown_duration: Option<u32>,
/// Whether or not to allow running the engine when the previous version that was run is higher than
// the current version. <-- Single slash instead of triple slash
allow_version_rollback: Option<bool>,These should use 3. Potential race condition in TermSignal::stop()pub fn stop() {
if let Some((_, join_handle)) = HANDLER_CELL.get() {
join_handle.abort();
}
}If
|
| Category | Count |
|---|---|
| Critical bugs | 2 |
| Potential issues | 4 |
| Suggestions | 4 |
| Style issues | 1 |
Recommendation: Fix the critical bugs (inconsistent workflow count logging, doc comment syntax) before merging. The other issues are nice-to-haves but not blockers.
Overall, this is a substantial improvement to runtime reliability. Great work centralizing the signal handling! 🎉
2990334 to
0688959
Compare
4dad8e2 to
3df125e
Compare
Code Review: Improved SIGTERM Handling for RuntimeThis PR significantly improves the graceful shutdown behavior across the Rivet engine. The changes are well-structured and address an important operational concern. Here's my detailed feedback: ✅ Strengths
🐛 Potential Issues
|
PR Review: Improve SIGTERM Handling for the Entire RuntimeThis is a well-structured PR that significantly improves graceful shutdown handling across the Rivet runtime. The changes introduce configurable shutdown durations and a centralized signal handling mechanism. Strengths1. Centralized Signal HandlingThe new
2. Configuration ManagementNew
3. Improved Worker Shutdown (
|
3df125e to
ff8b85c
Compare
0688959 to
4fa0007
Compare
Merge activity
|
PR Review: Improved SIGTERM HandlingThis PR makes significant improvements to how the runtime handles termination signals. Overall, the changes are well-structured and address important operational concerns. Here's my detailed review: ✅ Strengths1. Centralized Signal Handling
2. Configurable Shutdown Durations
3. Improved Worker Shutdown (
4. Guard Graceful Shutdown (
🐛 Issues Found1. Bug in Worker Shutdown ( let remaining_workflows = wf_futs.into_iter().count();
if remaining_workflows == 0 {
tracing::info!("all workflows evicted");
} else {
tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");
}Issues:
Fix needed: // Before the loop, save the count
let total_workflows = wf_futs.len();
// After the loop
let remaining_workflows = wf_futs.into_iter().count();
if remaining_workflows == 0 {
tracing::info!("all workflows evicted");
} else {
tracing::warn!(?remaining_workflows, "not all workflows evicted");
}2. Missing Comment Style Fix ( /// Time (in seconds) to allow for guard to wait for pending requests after receiving SIGTERM. Defaults
// to 1 hour. ← Should be /// not //
guard_shutdown_duration: Option<u32>,
/// Whether or not to allow running the engine when the previous version that was run is higher than
// the current version. ← Should be /// not //
allow_version_rollback: Option<bool>,The continuation lines should use
|
ff8b85c to
47d0825
Compare
4fa0007 to
2ac1b8e
Compare
Pull Request Review: Improve SIGTERM HandlingI've reviewed this PR that improves SIGTERM handling across the Rivet runtime. Overall, this is a well-structured improvement that centralizes signal handling and adds configurable graceful shutdown behavior. Here's my detailed feedback: ✅ Strengths
🐛 Potential IssuesCritical: Bug in worker.rs:322-326let remaining_workflows = wf_futs.into_iter().count();
if remaining_workflows == 0 {
tracing::info!("all workflows evicted");
} else {
tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");
}Issue: After Fix: Use tracing::warn!(remaining_workflows, "not all workflows evicted");Medium: Inconsistent Comment Style in runtime.rs:11-15Lines 11-13 use /// Time (in seconds) to allow for guard to wait for pending requests after receiving SIGTERM.
/// Defaults to 1 hour.
guard_shutdown_duration: Option<u32>,
/// Whether or not to allow running the engine when the previous version that was run is higher than
/// the current version.
allow_version_rollback: Option<bool>,
|
Code Review: Improve SIGTERM Handling for the Entire RuntimeOverviewThis PR significantly improves graceful shutdown handling across the runtime by centralizing signal handling and adding configurable shutdown durations. The implementation is solid overall, but I've identified several issues that should be addressed. Critical Issues1. Incorrect Remaining Workflows Count (engine/packages/gasoline/src/worker.rs:326)Severity: High let remaining_workflows = wf_futs.into_iter().count();
if remaining_workflows == 0 {
tracing::info!("all workflows evicted");
} else {
tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");
}Problem: After consuming Fix: tracing::warn!(remaining_workflows, "not all workflows evicted");2. Inconsistent Error Import Pattern (engine/packages/service-manager/src/lib.rs:11)Severity: Medium According to CLAUDE.md:
The code currently uses: use anyhow::{Context, Result, ensure};This is correct! However, ensure this pattern is consistent throughout the codebase. 3. Typo in Documentation (engine/packages/config/src/config/runtime.rs:12)Severity: Low /// Time (in seconds) to allow for guard to wait for pending requests after receiving SIGTERM. Defaults
// to 1 hour.Line 12 uses /// Time (in seconds) to allow for guard to wait for pending requests after receiving SIGTERM. Defaults
/// to 1 hour.
Potential Bugs4. Potential Race Condition in TermSignal Initialization (engine/packages/runtime/src/term_signal.rs:88)Severity: Medium let term_signal = TermSignalHandler::new()
.expect("failed initializing termination signal handler");Using Recommendation: Add a comment explaining why panic on signal handler initialization is acceptable, or make 5. Unused Error Handling (engine/packages/runtime/src/term_signal.rs:104)Severity: Low pub async fn recv(&mut self) -> bool {
let _ = self.0.changed().await;
*self.0.borrow()
}You're silently ignoring errors from 6. Missing Newline at EOF (engine/packages/service-manager/Cargo.toml)Severity: Low The file is missing a newline at the end: tracing.workspace = true
\ No newline at end of fileShould end with a newline per Unix conventions. Design & Architecture7. Graceful Shutdown Logic is Sound ✅The new centralized
8. Good Configuration Design ✅The new
9. Service Manager Improvements ✅The Performance Considerations10. FuturesUnordered Usage ✅Good use of 11. Shutdown Timeout ImplementationThe shutdown timeout logic using _ = tokio::time::sleep(shutdown_duration.saturating_sub(shutdown_start.elapsed())) => {However, consider that if Testing & Error Handling12. Missing TestsSeverity: Medium No tests are included for:
Recommendation: Add unit tests for the new 13. Error Logging Follows Conventions ✅Good adherence to CLAUDE.md logging patterns:
Code Quality14. Timestamp Naming InconsistencyPer CLAUDE.md:
The code uses 15. Import Organization ✅Imports are properly organized at the top of files, not inline within functions (per CLAUDE.md). Security Concerns16. Signal Handler Security ✅The centralized signal handler properly prevents signal handler re-registration, which could be a security issue in some contexts. The singleton pattern is appropriate here. SummaryStrengths:
Required Changes:
Recommended Changes:
Overall, this is a solid improvement to the codebase's shutdown handling. The architecture is sound, but please address the logging bug before merging. |

No description provided.