-
Notifications
You must be signed in to change notification settings - Fork 131
fix(pb): dont immediately reallocate actors on runner stopping #3413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(pb): dont immediately reallocate actors on runner stopping #3413
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
3 Skipped Deployments
|
PR Review: Fix actor reallocation on runner stoppingSummaryThis PR introduces a new Positive ChangesArchitecture Improvements✅ New ✅ Better state tracking: Adding ✅ Cleaner separation of concerns: The refactoring of Code Quality✅ Removed unnecessary ✅ Better logging: The debug log in ✅ Idempotency improvements: Added guards to prevent duplicate signal sends (e.g., checking Issues and Concerns1. Potential Race Condition in
|
More templates
@rivetkit/actor
@rivetkit/cloudflare-workers
@rivetkit/core
@rivetkit/db
@rivetkit/framework-base
@rivetkit/next-js
@rivetkit/react
rivetkit
@rivetkit/sql-loader
@rivetkit/engine-runner
@rivetkit/engine-runner-protocol
commit: |
79bbca9 to
87f1c1a
Compare
c0c9897 to
44cfde4
Compare
PR Review: fix(pb): dont immediately reallocate actors on runner stoppingSummaryThis PR refactors the actor lifecycle management in Pegboard to introduce a new Code Quality & Best PracticesStrengths:
Concerns:
Potential Bugs & IssuesCritical:
Medium Priority:
Minor:
Logic & Correctness
However, at
State Machine Clarity:
Performance Considerations
Security Concerns
Test CoverageMajor Concern:
Specific Code Issues
|
Code Review: Fix Pegboard Actor Immediate Reallocation on Runner StoppingOverviewThis PR addresses an issue where actors were being immediately reallocated when their runner was stopping, which is undesirable behavior. The solution introduces a new Positive Changes ✅1. Improved State Management 2. Separation of Concerns
This is conceptually sound and addresses the core issue. 3. Code Cleanup 4. Better Error Handling Concerns & Issues 🔍1. State Management Complexity
2. Idempotency Check Missing 3. Race Condition Potential if !state.going_away {
let Some(runner_workflow_id) = state.runner_workflow_id else {
return Ok(Loop::Continue);
};
// ... state modifications ...
}If 4.
5. Documentation Gaps 📝
6. Force Stop on Destroy
7. Formatting Inconsistencies 🎨
While these follow the hard tab convention, they differ from the original style. Consider consistency with the existing codebase style. 8. Test Coverage ❗
Potential Bugs 🐛1. Duplicate Stop Commands 2. Unconditional Performance Considerations ⚡
Security Concerns 🔒No security issues identified. The changes are internal to the actor lifecycle management system. Recommendations 📋
Architecture Suggestion 💡Consider modeling the actor state more explicitly: enum ActorRunState {
Allocated { runner_id: Id, runner_workflow_id: Id },
Sleeping { will_wake: bool },
Stopping { runner_id: Id, runner_workflow_id: Id },
GoingAway { runner_id: Id, runner_workflow_id: Id },
Rescheduling,
}This would make invalid states unrepresentable and clarify the state machine. VerdictThis PR addresses a real issue and the approach is sound. However, the increasing complexity of boolean state flags is concerning and aligns with the existing TODO comment. The changes work but would benefit from the refactoring mentioned in runtime.rs:20-21 to make the state machine more transparent. Recommendation: Approve with suggestions for follow-up work on state machine refactoring and test coverage. |
44cfde4 to
851e83c
Compare
87f1c1a to
378e6a2
Compare
Code ReviewOverviewThis PR addresses an important issue where actors were being immediately reallocated when a runner was stopping. The fix introduces a new Key Changes Summary
Positive AspectsArchitecture & Design✅ Good separation of concerns: The new ✅ Simplified destroy workflow: Removing the ✅ Better code organization: The Code Quality✅ Consistent error handling: Proper use of ✅ Good use of structured logging: Following the tracing patterns correctly (e.g., ✅ Idiomatic Rust: Proper use of pattern matching, enums, and Concerns & Suggestions1. State Management Complexity
|
378e6a2 to
f5922d4
Compare
|
Test comment |
Pull Request Review: fix(pb): dont immediately reallocate actors on runner stoppingSummaryThis PR introduces a new Code Quality & Best PracticesStrengths
|
Areas for Improvement
Potential IssuesCriticalNone identified - The logic appears sound. Minor
|
Performance Considerations
Security ConcernsNone identified - The changes do not introduce any new security concerns. The signal-based architecture already handles authorization/routing. Test CoverageMissing Test CasesWhile I cannot see the test files in this PR, consider adding tests for:
Suggested Test LocationsBased on the CLAUDE.md patterns, you should add tests at:
|
Architecture AlignmentThe changes align well with the codebase patterns:
Additional Notes
Recommendation✅ Approve with minor suggestions The core logic is solid and addresses a real issue with unnecessary actor rescheduling during runner shutdown. The suggested improvements are mostly cosmetic (documentation, minor optimizations) and do not block merging. Before Merging
Great work on improving the actor lifecycle management! The distinction between "lost" and "going away" actors is an important one for system reliability. |
f5922d4 to
0bfb5d7
Compare
47e1dfc to
05fd9f6
Compare
Code Review: PR #3413 - fix(pb): dont immediately reallocate actors on runner stoppingSummaryThis PR refactors the actor lifecycle management to introduce a new Positive Changes1. Better Separation of Concerns
2. Simplified
3. Enhanced State Tracking
4. Improved Code Organization
Issues & Concerns1. Potential Race Condition in
|
Code Review: PR #3413 - Don't immediately reallocate actors on runner stoppingSummaryThis PR introduces a significant improvement to the Pegboard actor lifecycle management by distinguishing between a runner stopping gracefully ( Positive Changes ✅1. Better State Modeling
2. Improved Code Organization
3. Bug FixThe core issue is properly addressed - when a runner goes away gracefully, actors are marked appropriately without forcing immediate reallocation, which should reduce unnecessary churn. Issues & Concerns
|
0bfb5d7 to
7ab4b5a
Compare
05fd9f6 to
4b7b6b5
Compare
Code Review: PR #3413 - Fix actor reallocation on runner stoppingSummaryThis PR introduces a new Positive Changes1. Improved State Modeling ✅
2. Cleaner Separation of Concerns ✅
3. Better Error Recovery ✅
Issues and Concerns1. Missing Idempotency Guards The Main::GoingAway(sig) => {
// ...
if !state.going_away {
// Multiple signal sends could arrive before state is persisted
state.going_away = true;
// ... send CommandStopActor signal
}
}Issue: If multiple Recommendation: Consider adding a trace log when the signal is ignored due to 2. Generation Mismatch Handling At if sig.generation != state.generation {
return Ok(Loop::Continue);
}Concern: What happens if a Recommendation: Add a trace warning for generation mismatches to aid debugging: if sig.generation != state.generation {
tracing::debug!(?sig.generation, current_generation=?state.generation, "ignoring GoingAway signal for different generation");
return Ok(Loop::Continue);
}3. Comment Accuracy 📝 At // We don't know the state of the previous generation of this actor actor if it becomes lost
// ^^^^^^ duplicate word4. State Cleanup Order 🤔 In state.gc_timeout_ts = None;
state.going_away = false;
state.stopping = false;
state.runner_id = None;
let old_runner_workflow_id = state.runner_workflow_id.take();Question: Is there a reason to clear these flags before deallocating? If deallocation fails, the state might be inconsistent. Recommendation: Consider whether cleanup should happen after successful deallocation, or add comments explaining why early cleanup is correct. 5. Missing Test Coverage No test files were found in the pegboard package for this functionality. Recommendation: Add integration or unit tests covering:
6. Force Reschedule Logic 🤔 At let force_reschedule = match &variant {
StoppedVariant::Normal { code: protocol::StopCode::Ok } => {
state.reschedule_state = Default::default();
false
}
StoppedVariant::Lost { force_reschedule } => *force_reschedule,
_ => false,
};Concern: The pattern matching doesn't explicitly handle all Recommendation: Make this more explicit: let force_reschedule = match &variant {
StoppedVariant::Normal { code } => {
if matches!(code, protocol::StopCode::Ok) {
state.reschedule_state = Default::default();
}
false
}
StoppedVariant::Lost { force_reschedule } => *force_reschedule,
};7. Potential for Lost CommandStopActor At Main::Destroy(_) => {
if let Some(runner_workflow_id) = state.runner_workflow_id {
ctx.signal(crate::workflows::runner::Command {
inner: protocol::Command::CommandStopActor(protocol::CommandStopActor {
actor_id: input.actor_id.to_string(),
generation: state.generation,
}),
})
.to_workflow_id(runner_workflow_id)
.send()
.await?;
}
// ...
}Concern: If the signal send fails (network partition, runner workflow gone), the actor workflow will error and potentially retry, but the runner might already be stopped. Recommendation: Consider whether this should use a best-effort send pattern, or if the current error propagation is intentional. Style and Convention Issues1. Logging Consistency 📝 At tracing::debug!(?variant, "actor stopped");Good adherence to CLAUDE.md guidelines! ✅ 2. Import Organization ✅ Imports are properly organized at the top of files, following the style guide. 3. Serialization ✅ Proper use of Performance Considerations1. Signal Storm Prevention When a runner stops, it sends Consideration: The current implementation is fine for moderate actor counts, but might need batching or rate limiting for very large runners. 2. No New Database Queries ✅ The changes don't introduce additional database roundtrips, which is good for performance. Security Considerations1. No Security Issues Identified ✅ The changes don't introduce obvious security vulnerabilities. Actor isolation and generation tracking prevent cross-actor interference. Recommendations SummaryHigh Priority:
Medium Priority: Low Priority: Overall AssessmentThis is a solid improvement to the actor lifecycle management system. The code is well-structured and addresses a real issue with unnecessary reallocations. The main concerns are around edge cases, test coverage, and operational observability. The changes follow the project's style guidelines and maintain backward compatibility well. Recommendation: Approve with minor changes requested for test coverage and logging improvements. Generated with Claude Code |
7ab4b5a to
ca5973c
Compare
4b7b6b5 to
70e0557
Compare
Pull Request ReviewSummaryThis PR introduces a new Code Quality & Best Practices✅ Strengths
|
PR Review: Fix actor reallocation on runner stoppingSummaryThis PR addresses an issue where actors were being immediately reallocated when a runner was stopping, which could lead to unnecessary churn and resource waste. The fix introduces a new Code Quality & Best Practices ✅Strengths:
Code style observations:
Architecture & Logic 🎯Key improvements:
Potential concerns:
Security Concerns 🔒No security issues identified. The changes maintain proper generation tracking and idempotency checks. Performance Considerations ⚡Positive impacts:
Potential Bugs 🐛Minor issue found:
All other checks passed:
Test Coverage 🧪Missing from this PR:
Documentation 📝Suggestions:
Summary & RecommendationOverall assessment: This is a solid fix that addresses a real problem with actor reallocation during runner shutdown. The code quality is good and follows existing patterns. Recommendation: ✅ Approve with minor suggestions Before merging:
Nice to have (follow-up):
Great work on identifying and fixing this issue! 🎉 |
Code ReviewThis PR addresses an important issue where actors were being immediately reallocated when a runner was stopping, which could cause unnecessary churn. The solution introduces a new ✅ Strengths
🔍 Potential Issues
💡 Suggestions
🧪 Testing Recommendations
🔒 Security & Performance
📝 Minor Code Quality Notes
SummaryThis is a solid improvement to the actor lifecycle management. The main concerns are potential duplicate stop commands and ensuring the state machine handles all edge cases correctly. The refactoring simplifies the destroy workflow and adds clearer semantics for different shutdown scenarios. Recommendation: Approve with minor suggestions. The potential duplicate stop command issue should be verified, but likely won't cause functional problems. |
ca5973c to
86c0f99
Compare
70e0557 to
277e2db
Compare
Merge activity
|
PR Review: WebSocket Hibernation & Runner Allocation FixSummaryThis PR introduces WebSocket hibernation for the gateway and fixes an issue where actors were immediately reallocated when their runner was stopping. The changes span multiple components including Guard, Pegboard Gateway, Pegboard actor lifecycle, and the Gasoline workflow engine. Code Quality & Best Practices ✅Strengths:
Code Structure:
Potential Issues & Concerns 🔍1. Critical: Unreachable Error Check (packages/guard-core/src/proxy_service.rs:1911-1920)ensure!(
!ws_hibernation_close,
"should not be hibernating again after receiving a close frame during hibernation"
);Issue: The comment says "This should be unreachable" but uses Recommendation: Either change to 2. Potential Race Condition (packages/pegboard/src/workflows/actor/mod.rs:472-508)The Main::GoingAway(sig) => {
if !state.going_away {
let Some(runner_workflow_id) = state.runner_workflow_id else {
return Ok(Loop::Continue);
};
// ... sets state.going_away = true and sends stop command
}
}Question: What happens if multiple Recommendation: Add debug logging when ignoring duplicate } else {
tracing::debug!("ignoring duplicate GoingAway signal");
}3. Missing Error Context (packages/pegboard-gateway/src/lib.rs:529-535)if let Err(err) = self
.shared_state
.send_message(request_id, close_message)
.await
{
tracing::error!(?err, "error sending close message");
}Issue: Error is logged but swallowed. Is this intentional? Should this affect the return value? Recommendation: Add context about why it's safe to ignore this error, or return early if it's critical. 4. Configuration Value Change (packages/gasoline/src/worker.rs:17)-const PING_INTERVAL: Duration = Duration::from_secs(20);
+const PING_INTERVAL: Duration = Duration::from_secs(10);Question: Why was ping interval reduced from 20s to 10s? This doubles the ping frequency. Is this related to the hibernation feature or a separate optimization? Recommendation: Add a comment explaining the rationale or mention in PR description. Performance Considerations 🚀Positive:
Concerns:
Security Concerns 🔒Generally Secure:
Minor Concerns:
Test Coverage ✅Excellent Coverage:
Gaps:
Recommendation: Add tests for: #[tokio::test]
async fn test_websocket_hibernation_with_client_close() {
// Send close during hibernation, verify proper cleanup
}
#[tokio::test]
async fn test_websocket_hibernation_timeout() {
// Verify timeout handling during hibernation
}Specific File Comments
|
Pull Request ReviewOverviewThis PR addresses a critical issue where actors were being immediately reallocated when their runner was stopping, which could lead to unnecessary churn and resource waste. The fix introduces a websocket hibernation mechanism to gracefully handle runner transitions. Code Quality & Architecture✅ Strong Points
Potential Issues & Concerns🔴 Critical1. Race condition in hibernation logic ( ensure!(
!ws_hibernation_close,
"should not be hibernating again after receiving a close frame during hibernation"
);This
2. Missing timeout on hibernation ( tokio::time::sleep(HIBERNATION_TIMEOUT).await;The test implementation sleeps for 2 seconds, but there's no timeout enforcement in the actual 3. Unbounded retry loop potential (
|

No description provided.