Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@vercel
Copy link

vercel bot commented Nov 6, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Error Error Nov 13, 2025 10:42pm
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 13, 2025 10:42pm
rivet-inspector Ignored Ignored Preview Nov 13, 2025 10:42pm
rivet-site Ignored Ignored Preview Nov 13, 2025 10:42pm

Copy link
Contributor Author

MasterPtato commented Nov 6, 2025


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Nov 6, 2025

More templates

@rivetkit/actor

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/actor@3400

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3400

@rivetkit/core

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/core@3400

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3400

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3400

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3400

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3400

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3400

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3400

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3400

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3400

commit: 7239ed4

@claude
Copy link

claude bot commented Nov 6, 2025

Code Review: WebSocket Hibernation Feature

Summary

This PR implements WebSocket hibernation functionality for the pegboard gateway, allowing WebSocket connections to sleep when actors are unavailable and wake up when they become ready. Overall, the implementation is well-structured with good separation of concerns.

✅ Strengths

  1. Clear Intent: The rename from WebSocketServiceRetry to WebSocketServiceHibernate better communicates the feature's purpose
  2. Good Abstraction: The handle_websocket_hibernation trait method provides a clean interface for different implementations
  3. Proper Error Handling: The code maintains consistent error handling patterns throughout
  4. Documentation: Inline comments explain the hibernation lifecycle well (proxy_service.rs:1908-1912)

🐛 Potential Bugs & Issues

Critical Issues

1. Missing Peek Consumption After Non-Message Events (pegboard-gateway/src/lib.rs:586-611)

When Ping, Pong, or Frame messages are encountered in the hibernate_ws function, they are peeked but never consumed from the stream. This creates an infinite loop - the same message will be peeked repeatedly.

The code currently has:

match msg {
    Message::Binary(_) | Message::Text(_) => return Ok(HibernationResult::Message),
    Message::Close(_) => return Ok(HibernationResult::Close),
    // Ignore rest
    _ => {}  // BUG: Peeked message not consumed\!
}

Fix: Call pinned.try_next().await? before continuing the loop for ignored message types.

2. Unreachable Code Path (pegboard-gateway/src/lib.rs:595-596)

The bail!("should be unreachable") will never execute since try_next().await? already propagates the error. Either remove the bail! or replace it with unreachable!().

Moderate Issues

3. Race Condition in Hibernation (pegboard-gateway/src/lib.rs:561-575)

The tokio::select! will cancel the losing branch. If a message arrives at the exact moment the actor becomes ready, there's potential for edge cases. This may be acceptable behavior, but worth documenting.

4. TypeScript Typo Fixed (rivetkit-typescript/packages/rivetkit/src/actor/config.ts:74)

Good catch fixing canHibernatWebSocket → canHibernateWebSocket. Verify that documentation and examples using this API are also updated if needed.

⚡ Performance Considerations

Lock Contention: The receiver is wrapped in Arc<Mutex<>>, and during hibernation, the mutex will be held for potentially long periods while waiting for messages. This is likely acceptable since WebSocket handling is typically single-threaded per connection.

Busy Loop Potential: Due to bug #1, ignored message types will cause a tight busy loop. Once fixed, the loop will properly block on peek().

🔒 Security Concerns

1. Close Frame Handling During Hibernation

When a client sends a Close frame during hibernation, the connection terminates gracefully. Ensure that:

  • Resources are properly cleaned up when returning HibernationResult::Close
  • The actor lifecycle is notified that the connection was closed
  • No state corruption occurs if the actor becomes ready after the close

2. Timeout Handling

I don't see an explicit timeout for the hibernation period. Consider:

  • What happens if an actor never becomes ready?
  • Should there be a maximum hibernation duration?

🧪 Test Coverage

Missing Test Coverage (concerning):

  • No tests for the new hibernation functionality
  • No tests for hibernate_ws function behavior
  • No tests for the peek/consume logic
  • No tests for TypeScript config changes

Recommendations:

  1. Add unit tests for hibernate_ws with various message types
  2. Add integration tests for:
    • Actor becomes ready during hibernation
    • Client sends message during hibernation
    • Client closes connection during hibernation
    • Multiple hibernation cycles
  3. Test edge cases (actor never becomes ready, rapid hibernate/wake cycles)

📝 Code Quality & Best Practices

Import Style (guard/src/routing/api_public.rs:4)

The code uses use anyhow::{Context, Result}; but CLAUDE.md states: "Import anyhow using use anyhow::* instead of importing individual types"

Consider changing to use anyhow::*; for consistency.

Formatting & Logging: Good adherence to project standards with hard tabs and structured logging.

🎯 Summary of Action Items

Must Fix (Blocking):

  1. Fix infinite loop bug when ignoring Ping/Pong messages
  2. Remove or fix unreachable code in error handling path

Should Fix (High Priority):

  1. Add comprehensive test coverage for hibernation functionality
  2. Document or handle the race condition in tokio::select!
  3. Consider adding hibernation timeout limits

Nice to Have:

  1. Fix import style to match CLAUDE.md conventions
  2. Add explicit resource cleanup verification for close-during-hibernation
  3. Document the hibernation lifecycle in code comments or docs

Overall Assessment

This is a well-architected feature with good separation of concerns. The main concerns are:

  1. The critical bug in message consumption
  2. Lack of test coverage
  3. Potential edge cases around timing and resource cleanup

Once these issues are addressed, this will be a solid addition to the codebase. Great work on improving the semantics and structure of the WebSocket retry/hibernation logic!

@claude
Copy link

claude bot commented Nov 6, 2025

PR Review: WebSocket Hibernation Feature

Summary

This PR implements WebSocket hibernation functionality, allowing WebSocket connections to pause when the downstream actor is not ready and resume when it becomes available. This is a significant architectural improvement for handling actor lifecycle management.

✅ Strengths

1. Clean Architecture

  • The new handle_websocket_hibernation trait method provides a clear extension point for custom hibernation behavior
  • Separation of concerns between retry logic and hibernation logic is well-designed
  • The rename from WebSocketServiceRetry to WebSocketServiceHibernate better reflects the actual semantics

2. Smart Peekable Implementation

The use of Peekable stream wrapper (websocket_handle.rs:11-12) is clever:

pub type WebSocketReceiver =
	Peekable<futures_util::stream::SplitStream<WebSocketStream<TokioIo<Upgraded>>>>;

This allows peeking at messages during hibernation without consuming them, which is critical for the hibernation behavior.

3. Proper Error Handling

The hibernate_ws function (pegboard-gateway/src/lib.rs:586-611) correctly handles edge cases:

  • Differentiates between message arrival, close frames, and errors
  • Uses peek to avoid consuming messages that will be processed after hibernation ends

4. TypeScript Integration

The changes to actor/instance.ts:2072-2073 properly implement the hibernatable connection check, enabling actors to sleep when only hibernatable WebSockets are active.


🐛 Potential Issues

1. Logic Bug in hibernate_ws Function ⚠️

Location: pegboard-gateway/src/lib.rs:590-611

The current implementation has an infinite loop issue:

loop {
    if let Some(msg) = pinned.as_mut().peek().await {
        let Ok(msg) = msg else {
            pinned.try_next().await?;
            bail!("should be unreachable");
        };
        match msg {
            Message::Binary(_) | Message::Text(_) => return Ok(HibernationResult::Message),
            Message::Close(_) => return Ok(HibernationResult::Close),
            // Ignore rest
            _ => {}  // ⚠️ This continues the loop without consuming the message!
        }
    } else {
        return Ok(HibernationResult::Close);
    }
}

Problem: When receiving Message::Ping, Message::Pong, or Message::Frame, the code hits the _ => {} branch and loops infinitely on the same peeked message.

Suggested Fix:

loop {
    if let Some(msg) = pinned.as_mut().peek().await {
        let Ok(msg) = msg else {
            pinned.try_next().await?;
            bail!("should be unreachable");
        };
        match msg {
            Message::Binary(_) | Message::Text(_) => return Ok(HibernationResult::Message),
            Message::Close(_) => return Ok(HibernationResult::Close),
            // Consume and ignore ping/pong/frame messages
            _ => {
                pinned.try_next().await?;
            }
        }
    } else {
        return Ok(HibernationResult::Close);
    }
}

2. Typo in TypeScript 🔤

Location: rivetkit-typescript/packages/rivetkit/src/actor/config.ts:74

The option name was corrected from canHibernatWebSocket to canHibernateWebSocket, but this is a breaking change for any existing code using the old spelling. Consider:

  • Adding a deprecation notice
  • Supporting both spellings temporarily with a console warning
  • Documenting this in release notes

3. Missing Test Implementation 🧪

Location: guard-core/tests/custom_serve.rs:106-108

async fn handle_websocket_hibernation(&self, _websocket: WebSocketHandle) -> Result<()> {
    todo!();
}

The test implementation uses todo!(), which means tests will panic if hibernation is actually triggered. This should either:

  • Be implemented with a no-op or test-specific behavior
  • Have tests added that explicitly test hibernation behavior

Recommendation: Add a test that:

  1. Establishes a WebSocket connection
  2. Triggers hibernation (returns WebSocketServiceHibernate error)
  3. Verifies the connection stays open
  4. Sends a message and verifies the connection resumes

🔒 Security Considerations

1. Resource Exhaustion

During hibernation, the WebSocket connection stays open indefinitely while waiting for either:

  • Actor to become ready
  • Client to send a message

Concern: A malicious client could open many WebSocket connections and leave them idle, consuming server resources.

Recommendation: Consider adding a configurable maximum hibernation timeout to prevent indefinite resource holding.

2. Actor Ready Event Subscription

let mut ready_sub = self
    .ctx
    .subscribe::<pegboard::workflows::actor::Ready>(("actor_id", self.actor_id))
    .await?;

Question: What happens if the actor was already ready before the subscription? Could this cause a race condition where the hibernation never ends?

Recommendation: Check actor readiness before subscribing to ensure we don't miss the ready event.


⚡ Performance Considerations

1. Lock Contention on WebSocket Receiver

The hibernate_ws function holds the ws_rx lock for the entire hibernation period:

let mut guard = ws_rx.lock().await;

This prevents other parts of the code from accessing the receiver. This appears intentional and correct for this use case, but worth noting.

2. Multiple Route Resolutions

After hibernation, the code re-resolves the route (proxy_service.rs:1955-2014). This is correct behavior, but for high-frequency hibernation scenarios, consider caching route resolution results with a short TTL.


📝 Code Quality

1. Logging Consistency

Good use of structured logging throughout:

tracing::debug!("actor has become ready during hibernation");

Follows the CLAUDE.md guidelines properly.

2. Error Context

Good use of .context() in api_public.rs:

.context("failed to call api-public service")?;

Follows error handling patterns correctly.

3. Comment Quality 👍

Excellent inline documentation:

// After this function returns:
// - the route will be resolved again
// - the websocket will connect to the new downstream target
// - the gateway will continue reading messages from the client ws
//   (starting with the message that caused the hibernation to end)

🧪 Test Coverage

Current: Basic smoke tests exist but hibernation-specific tests are missing.

Recommended Additional Tests:

  1. Test hibernation triggered by actor not ready
  2. Test wake-up on actor ready event
  3. Test wake-up on client message during hibernation
  4. Test close frame during hibernation
  5. Test timeout scenarios (if implemented)
  6. Test concurrent hibernation of multiple WebSockets

📊 Overall Assessment

Code Quality: 8/10
Architecture: 9/10
Security: 7/10 (needs timeout protection)
Test Coverage: 5/10 (needs hibernation-specific tests)
Documentation: 8/10

Verdict: Approve with Minor Changes

This is a well-architected feature with good separation of concerns. The primary issue is the infinite loop bug in hibernate_ws that needs to be fixed before merging. Once that's addressed, this is a solid implementation.


📋 Action Items

Must Fix Before Merge:

  • Fix infinite loop in hibernate_ws for ping/pong messages

Should Address:

  • Implement test hibernation behavior (remove todo!())
  • Consider adding maximum hibernation timeout
  • Add hibernation-specific integration tests

Nice to Have:

  • Document breaking change in TypeScript config option name
  • Add race condition check for actor ready state
  • Consider caching route resolution results

Great work on this feature! The hibernation mechanism is a clever solution to the actor lifecycle problem. 🎉

@claude
Copy link

claude bot commented Nov 6, 2025

WebSocket Hibernation PR Review

This PR implements WebSocket hibernation, allowing actors to sleep while maintaining WebSocket connections by pausing message processing until the actor becomes ready again. Overall, the implementation is well-structured and follows good practices. Here are my findings:

✅ Strengths

  1. Clean Architecture: The hibernation abstraction is well-designed with the handle_websocket_hibernation trait method, allowing different services to implement their own hibernation logic.

  2. Proper Error Handling: Changed from WebSocketServiceRetry to WebSocketServiceHibernate error provides better semantic clarity about what's happening.

  3. Smart Use of peek(): Using Peekable stream allows inspecting messages without consuming them, which is exactly what's needed for hibernation - check if there's a message to wake up, but don't consume it yet.

  4. Good Logging: Structured logging throughout makes debugging easier (e.g., tracing::debug!(?err, "websocket handler error")).

  5. Configuration Flexibility: The TypeScript side allows both boolean and function-based canHibernateWebSocket configuration for fine-grained control.

  6. Documentation Update: Good catch updating CLAUDE.md to discourage glob imports from anyhow.

🐛 Potential Issues

  1. Grammar in Comment (pegboard-gateway/src/lib.rs:601)

    // We don't care about the close frame because were currently hibernating

    Should be: we're not were

  2. Unreachable Code Logic (pegboard-gateway/src/lib.rs:595-596)

    pinned.try_next().await?;
    bail!("should be unreachable");

    The bail! is indeed unreachable since try_next().await? will propagate the error. Consider either:

    • Remove the bail (rely on ? operator)
    • Add a comment explaining why you're being defensive
    • Use unreachable!() macro instead
  3. Missing Error Context in Glob Import Check
    The CLAUDE.md update is good, but there are still 5 files using use anyhow::* in the codebase. While not part of this PR's scope, it might be worth addressing in a follow-up.

  4. Test Coverage
    The test in guard-core/tests/custom_serve.rs:106 uses todo!() for handle_websocket_hibernation. While this is acceptable for tests that don't exercise this path, consider adding a comment explaining when this needs proper implementation.

🔍 Code Style & Conventions

Good adherence to repository conventions:

  • ✅ Uses structured logging correctly (e.g., tracing::debug!(?err, ...) not tracing::debug!("foo {err}")
  • ✅ Proper error handling with anyhow Result types
  • ✅ Imports follow the pattern (individual imports, not glob)
  • ✅ Lowercase log messages
  • ✅ Good variable naming

Minor observations:

  • Consider adding a workspace-level lint to prevent use anyhow::* in the future
  • The refactoring in proxy_service.rs where handlershandler improves clarity

🎯 Performance Considerations

  1. Lock Holding: In hibernate_ws(), the mutex guard is held throughout the entire peek/next loop. This is appropriate since we need exclusive access, but worth noting that this will block any concurrent attempts to read from the WebSocket.

  2. Polling Strategy: The function loops with peek() which could be CPU-intensive if called frequently. However, since this only runs during hibernation (a relatively rare state), this should be fine.

🔒 Security Considerations

No security concerns identified. The hibernation mechanism properly:

  • Handles close frames during hibernation
  • Validates message types before processing
  • Uses proper error propagation

✨ Suggestions for Future Enhancements

  1. Add metrics/telemetrics for hibernation events (hibernate start/end, duration, etc.)
  2. Consider timeout for hibernation to prevent indefinite waiting
  3. Document the message ordering guarantees during hibernation handoff

📊 Test Coverage Assessment

The PR modifies critical WebSocket handling logic but doesn't add new tests. Consider adding:

  • Test for hibernation → wake cycle
  • Test for client disconnection during hibernation
  • Test for messages arriving during hibernation

Summary

This is a solid implementation of an important feature. The code is clean, well-structured, and follows repository conventions. The main issues are minor (grammar, unreachable code comment) and easily addressed.

Recommendation: Approve with minor suggested fixes.

Great work! 🚀

@claude
Copy link

claude bot commented Nov 6, 2025

PR Review: WebSocket Hibernation Implementation

Thank you for implementing WebSocket hibernation! This is a well-structured feature that enables actors to sleep while maintaining WebSocket connections. Here's my detailed review:

✅ Strengths

  1. Clean Error Handling Refactor: The renaming from WebSocketServiceRetry to WebSocketServiceHibernate better reflects the actual behavior and makes the intent clearer.

  2. Smart Design: The hibernation logic properly handles the lifecycle:

    • Uses Peekable stream to inspect messages without consuming them
    • Subscribes to actor Ready events to wake up when the backend is available
    • Properly distinguishes between messages that should wake the connection vs. ignore (ping/pong)
  3. TypeScript Integration: The typo fix (canHibernatWebSocketcanHibernateWebSocket) and proper implementation in the actor instance shows good attention to detail.

  4. Code Style Compliance: Changes follow the CLAUDE.md guidelines (structured logging, lowercase log messages, etc.)

🔍 Issues & Concerns

Critical: anyhow Glob Imports Still Present

While CLAUDE.md was updated to prohibit glob imports from anyhow, several files still use use anyhow::*;:

// Found in these files:
engine/packages/guard-core/src/websocket_handle.rs:1
engine/packages/guard-core/src/custom_serve.rs:1  
engine/packages/guard-core/src/server.rs:10

Recommendation: Update these to explicit imports like use anyhow::{Context, Result, bail}; to match the updated style guide and the good example in proxy_service.rs:1.

Potential Race Condition in hibernate_ws

In pegboard-gateway/src/lib.rs:592-614, the hibernation loop peeks at messages and consumes non-data messages:

_ => {
    pinned.try_next().await?;  // Line 607
}

Issue: If the stream is Peekable, calling try_next() consumes the peeked message. However, there's a subtle race: if a Binary/Text message arrives between the peek() check and when we return to the main handler, we might miss it or process it twice.

Recommendation: Consider whether the logic should:

  1. Always consume the peeked message after inspecting it, OR
  2. Document clearly that the main handler will see the same message we peeked at

Missing Test Coverage

The test in guard-core/tests/custom_serve.rs:106 implements the new trait method with todo!():

async fn handle_websocket_hibernation(&self, _websocket: WebSocketHandle) -> Result<bool> {
    todo!();
}

Issue: This will panic if the test path exercises hibernation.

Recommendation: Either:

  1. Implement a proper test stub (return Ok(true) or Ok(false))
  2. Add integration tests that verify the hibernation behavior end-to-end
  3. At minimum, return a reasonable default like Ok(true) with a comment

TypeScript TODO Comment

In rivetkit-typescript/packages/rivetkit/src/actor/instance.ts:2065, there's a stale TODO:

// TODO: When WS hibernation is ready, update this to only count non-hibernatable websockets

Recommendation: Since hibernation is now implemented (line 2072 checks !conn.isHibernatable), this TODO should be removed or updated to reflect that hibernation IS ready but raw WebSockets still need work.

🎯 Minor Suggestions

  1. Documentation: Consider adding inline comments in handle_websocket_hibernation explaining the lifecycle:

    • What triggers hibernation end (actor ready OR client message)
    • What happens when we return false vs true
    • The comment at proxy_service.rs:1908-1912 is good but could be referenced from the trait definition
  2. Logging Consistency: In pegboard-gateway/src/lib.rs:563, the log uses "has become ready" but line 570 uses "received message during hibernation, reconnecting". Consider making both explicitly mention "hibernation" for easier log filtering:

    tracing::debug!("actor became ready during hibernation, reconnecting");
  3. Error Context: The hibernate_ws function could benefit from more context on errors:

    pinned.try_next().await.context("failed to consume non-data websocket message during hibernation")?;

🔒 Security & Performance

  • No security concerns identified: The hibernation doesn't introduce auth bypasses or data leaks
  • Performance: Using Peekable adds minimal overhead and the hibernation pattern should significantly reduce actor resource usage
  • Resource Management: Properly handles cleanup on close and timeout scenarios

📊 Overall Assessment

This is a solid implementation of a complex feature. The core logic is sound and the integration across Rust and TypeScript layers is well-executed. The main concerns are:

  1. Must fix: Remove glob imports from anyhow per updated style guide
  2. Should fix: Address the todo!() in tests and verify peek/consume behavior
  3. Nice to have: Update stale TODOs and improve documentation

The feature appears production-ready once the anyhow imports are cleaned up and test coverage is verified.

Recommendation: Approve with minor changes requested (fix anyhow imports and test stub).

@claude
Copy link

claude bot commented Nov 6, 2025

Code Review: WebSocket Hibernation Feature

This PR implements WebSocket hibernation to allow actors to sleep while maintaining client connections. Overall, the implementation is solid with good attention to detail. Here are my findings:

✅ Strengths

  1. Well-structured feature: The hibernation mechanism cleanly separates concerns between detecting hibernation, handling it, and reconnecting.

  2. Good error handling: Proper use of custom error types (WebSocketServiceHibernate) following the project's error conventions in engine/packages/guard-core/src/errors.rs.

  3. Smart use of Peekable: The switch to Peekable<WebSocketReceiver> in engine/packages/guard-core/src/websocket_handle.rs:11-12 allows peeking at messages without consuming them - crucial for hibernation.

  4. Proper lifecycle management: The handle_websocket_hibernation trait method returns bool to indicate if the connection should close, giving implementations full control.

  5. Documentation: Good inline comments explaining the hibernation flow in proxy_service.rs:1908-1912.

🐛 Potential Issues

1. Missing Mutex Lock Semantics (High Priority)

In pegboard-gateway/src/lib.rs:592-614, the hibernate_ws function:

async fn hibernate_ws(ws_rx: Arc<Mutex<WebSocketReceiver>>) -> Result<HibernationResult> {
    let mut guard = ws_rx.lock().await;
    let mut pinned = std::pin::Pin::new(&mut *guard);
    
    loop {
        if let Some(msg) = pinned.as_mut().peek().await {
            // ...

Issue: The mutex lock is held for the entire duration of hibernation (potentially indefinitely). If handle_websocket_hibernation is called again with the same WebSocketHandle, it will deadlock.

Fix: Consider releasing the lock between peek operations or documenting that concurrent calls are not supported.

2. Incomplete Test Coverage (High Priority)

The test file engine/packages/guard-core/tests/custom_serve.rs:105-107 has:

async fn handle_websocket_hibernation(&self, _websocket: WebSocketHandle) -> Result<bool> {
    todo!();
}

Issue: The test implementation uses todo!() which will panic if called.

Recommendation: Either implement a proper test or mark the test as #[should_panic] with documentation explaining this is not testable in this context.

3. Race Condition Potential (Medium Priority)

In pegboard-gateway/src/lib.rs:555-584, the handle_websocket_hibernation uses tokio::select!:

let close = tokio::select! {
    _ = ready_sub.next() => { false }
    hibernation_res = hibernate_ws(client_ws.recv()) => { /* ... */ }
};

Issue: If the actor becomes ready while a message is being peeked, there's a small window where the message could be lost or processed out of order.

Recommendation: Consider whether message ordering guarantees are maintained when waking from hibernation.

4. Attempt Counter Reset Logic (Low Priority)

In proxy_service.rs:1901-1905:

if ws_hibernate {
    attempts = 0;
} else {
    attempts += 1;
}

Observation: Hibernation resets the attempt counter. This is probably correct, but ensure this doesn't allow infinite retries if hibernation repeatedly fails.

🔍 Code Quality

  1. Follows project conventions: ✅

    • Uses structured logging: `tracing::debug!(?err, "...")\
    • Proper error handling with anyhow::Result
    • Follows naming conventions (snake_case, descriptive names)
  2. Import management: ✅

    • CLAUDE.md update correctly changes from glob import to explicit imports
    • Example in api_public.rs:3: use anyhow::{Context, Result};
  3. Variable naming: ✅

    • Changed handlershandler (singular) in proxy_service.rs:1831 for clarity

🔒 Security Considerations

  1. No obvious security issues - hibernation doesn't expose sensitive data
  2. Proper close frame handling - connection cleanup appears correct
  3. No DoS vector - hibernation is opt-in via canHibernateWebSocket

⚡ Performance Considerations

  1. Mutex contention: The long-held lock in hibernate_ws could impact performance under high connection counts
  2. Memory usage: Hibernated connections maintain state - ensure this is bounded
  3. Subscription overhead: Each hibernation creates a new subscription (pegboard-gateway/src/lib.rs:556-559) - consider pooling

📝 Minor Suggestions

  1. Typo fix already included: ✅ canHibernatWebSocketcanHibernateWebSocket (good catch!)

  2. Consider adding metrics: Track hibernation events, wake-ups, and duration for observability

  3. Documentation: Add a doc comment to handle_websocket_hibernation explaining the return value semantics more clearly

🧪 Test Coverage

Missing:

  • Unit tests for hibernate_ws function
  • Integration tests for the full hibernation cycle
  • Tests for race conditions (actor ready vs message arrival)
  • Tests for error cases (hibernation timeout, connection drops during hibernation)

Recommendation: Add at least one integration test demonstrating the hibernation flow works end-to-end.

Summary

This is a well-implemented feature that follows the codebase conventions. The main concerns are:

  1. Fix the mutex lock semantics in hibernate_ws to prevent deadlocks
  2. Complete the test implementation or document why it's incomplete
  3. Add integration tests for the hibernation feature
  4. Consider the race condition between actor ready events and message arrival

The code is production-ready after addressing the high-priority issues above.

export const chatRoom = actor({
options: {
sleepTimeout: 2000,
canHibernateWebSocket: true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed – this is for onWebSocket

Copy link
Contributor Author

@MasterPtato MasterPtato Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i removed this line but i still dont know what you mean


false
}
hibernation_res = hibernate_ws(client_ws.recv()) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't this consume the message? or does this get written to the buffer + replayed? either way, needs a comment on what happens to the message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

none of the logic related to websocket messages is present here, and in the hibernate_ws fn it uses .peek() which implies its not consuming the message


if attempts > max_attempts
|| (!is_retryable_ws_error(&err) && !ws_retry)
if ws_hibernate {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we handle failure cases if we fail to re-wake the actor when the client disconnects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't do anything if the client disconnects. When we add waking and if it fails to wake the only thing we can do is log the error

@claude
Copy link

claude bot commented Nov 13, 2025

PR Review: WebSocket Hibernation Feature

This PR introduces WebSocket hibernation support, allowing WebSocket connections to "sleep" when the downstream service is temporarily unavailable and wake up when it becomes available again. Overall, this is a well-implemented feature with good test coverage. Here's my detailed review:


✅ Strengths

1. Clean Architecture

  • The HibernationResult enum provides a clear abstraction for hibernation outcomes
  • The trait-based design (CustomServeTrait) allows for extensible hibernation behavior
  • Good separation of concerns between the proxy service and hibernation logic

2. Comprehensive Test Coverage

  • New test test_custom_serve_websocket_hibernation validates the hibernation flow
  • Tests verify the complete cycle: hibernate → send message → wake up → receive response
  • Good use of timeouts to prevent hanging tests

3. Error Handling

  • Renamed error from WebSocketServiceRetryWebSocketServiceHibernate for clarity
  • Appropriate use of ensure! macro for invariant checking in engine/packages/guard-core/src/proxy_service.rs:1914

4. Documentation & Code Quality

  • Good inline comments explaining hibernation behavior
  • Proper use of structured logging (following CLAUDE.md guidelines)
  • TypeScript typo fix (canHibernatWebSocketcanHibernateWebSocket) improves API consistency

🔍 Issues & Concerns

Critical: Potential Message Loss During Hibernation

Location: engine/packages/pegboard-gateway/src/lib.rs:590-612

The hibernate_ws function uses peek() to check for incoming messages, but there's a subtle issue:

loop {
    if let Some(msg) = pinned.as_mut().peek().await {
        match msg {
            Ok(Message::Binary(_)) | Ok(Message::Text(_)) => {
                return Ok(HibernationResult::Continue);
            }
            Ok(Message::Close(_)) => return Ok(HibernationResult::Close),
            // Ignore rest
            _ => {
                pinned.try_next().await?;  // ⚠️ This consumes the message!
            }
        }
    } else {
        return Ok(HibernationResult::Close);
    }
}

Problem: When ignoring non-text/binary messages (like Ping, Pong), the code calls try_next() to consume them. However, after hibernation ends, these messages won't be available for the main WebSocket handler. While ping/pong frames are typically handled at the protocol level, this could cause issues if the application expects to see these frames.

Recommendation: Consider documenting this behavior or ensuring that consuming control frames during hibernation is intentional and safe.


High: Race Condition in Retry Logic

Location: engine/packages/guard-core/src/proxy_service.rs:1974-1988

After hibernation or errors, the code re-resolves the route:

// Retry route resolution
match state
    .resolve_route(
        &req_host,
        &req_path,
        &req_method,
        state.port_type.clone(),
        &req_headers,
        true,
    )
    .await
{
    Ok(ResolveRouteOutput::CustomServe(new_handler)) => {
        handler = new_handler;
        continue;
    }
    // ... error cases
}

Problem: The route resolution happens after the backoff delay (for regular retries) or after hibernation. If the route changes between the hibernation call and route re-resolution, the WebSocket might reconnect to a different handler instance.

Impact: While this might be intentional for hot-reloading scenarios, it could lead to unexpected behavior if handler state is important.

Recommendation: Add a comment explaining whether handler replacement is intentional or if handler identity should be preserved across hibernations.


Medium: Unbounded Hibernation Loop

Location: engine/packages/pegboard-gateway/src/lib.rs:590-612

The hibernate_ws function contains a loop that could theoretically run indefinitely if the stream keeps producing ignorable messages (ping/pong/etc.).

Recommendation: Consider adding a limit to iterations or a timeout to prevent infinite loops.


Medium: Unclear Invariant Check

Location: engine/packages/guard-core/src/proxy_service.rs:1914-1916

ensure!(
    !ws_hibernation_close,
    "should not be hibernating again after receiving a close frame during hibernation"
);

Question: The comment states this "should be unreachable," but the code uses ensure! which suggests it's a runtime check. If it's truly unreachable, consider using debug_assert! or unreachable!() instead.

Recommendation: Clarify whether this is a defensive check or an actual invariant that could be violated under certain race conditions.


Low: Missing Error Context

Location: engine/packages/guard-core/src/proxy_service.rs:1924-1926

let res = handler
    .handle_websocket_hibernation(ws_handle.clone())
    .await?;

Per CLAUDE.md guidelines, errors should be annotated with context. Consider:

let res = handler
    .handle_websocket_hibernation(ws_handle.clone())
    .await
    .context("failed during websocket hibernation")?;

Low: Default Implementation Could Be More Descriptive

Location: engine/packages/guard-core/src/custom_serve.rs:41-44

The default implementation returns a generic error:

async fn handle_websocket_hibernation(
    &self,
    _websocket: WebSocketHandle,
) -> Result<HibernationResult> {
    bail!("service does not support websocket hibernation");
}

Recommendation: Consider using a structured error from the error system for better error handling:

Err(errors::WebSocketHibernationNotSupported.build())

This would provide more structured error information to clients.


🎯 Performance Considerations

  1. Peekable Stream Overhead: Converting to Peekable adds minimal overhead, but it's worth noting for high-throughput scenarios
  2. Mutex Contention: The Arc<Mutex<WebSocketReceiver>> could become a bottleneck under high message rates, though this is likely acceptable for typical WebSocket usage
  3. Route Re-resolution Cost: Re-resolving routes after hibernation adds latency. Consider caching route information if this becomes a bottleneck

🔒 Security Considerations

  1. Resource Exhaustion: No limits on hibernation duration in the test implementation. Production code should have timeouts to prevent resource exhaustion
    • The PegboardGateway implementation handles this better by racing with actor ready events
  2. Close Frame Handling: The handling of close frames during hibernation is safe and prevents unnecessary reconnection attempts

📝 Nitpicks & Style

  1. Import Style Compliance: ✅ Good adherence to CLAUDE.md - no glob imports from anyhow (in fact, this PR fixes that in CLAUDE.md itself!)
  2. Logging Style: ✅ Proper lowercase log messages with structured fields
  3. Error Messages: ✅ Clear and descriptive
  4. Test Organization: ✅ Well-structured with good naming

📊 Test Coverage Assessment

Strengths:

  • Core hibernation flow is tested
  • Message delivery after hibernation is validated
  • Multiple concurrent connections are tested

Gaps:

  • No test for hibernation timeout behavior
  • No test for close frame during hibernation
  • No test for error cases during route re-resolution
  • Missing test for the ws_hibernation_close flag logic

Recommendation: Add tests for:

#[tokio::test]
async fn test_websocket_hibernation_with_close_during_hibernation() { /* ... */ }

#[tokio::test]
async fn test_websocket_hibernation_timeout() { /* ... */ }

#[tokio::test]
async fn test_websocket_route_changes_during_hibernation() { /* ... */ }

🎬 Conclusion

This is a solid implementation of WebSocket hibernation with good architectural design. The main concerns are around edge cases (message loss, race conditions) and missing test coverage for error scenarios. I'd recommend:

  1. Before merge: Address the message consumption behavior in hibernate_ws and add clarifying comments
  2. Before merge: Add error context to the hibernation call
  3. Post-merge (nice to have): Add tests for edge cases mentioned above
  4. Post-merge (nice to have): Consider adding hibernation duration metrics for observability

Overall: Approve with minor changes requested

Great work on this feature! The hibernation mechanism will significantly improve connection resilience for temporarily unavailable services.

@claude
Copy link

claude bot commented Nov 13, 2025

PR Review: WebSocket Hibernation Feature

Thanks for this PR! This is a well-structured implementation of WebSocket hibernation. I've reviewed the changes and have the following feedback:

✅ Strengths

  1. Clean Architecture: The abstraction through CustomServeTrait with handle_websocket_hibernation method is well-designed and maintainable.

  2. Proper Error Handling: The transition from WebSocketServiceRetry to WebSocketServiceHibernate makes the intent clearer and the error handling follows the custom error system properly.

  3. Good Test Coverage: The test in custom_serve.rs demonstrates the hibernation flow effectively, including timeout handling.

  4. Documentation Fix: Good catch on updating CLAUDE.md to avoid glob imports from anyhow - this aligns with best practices.

  5. TypeScript Fix: The typo fix canHibernatWebSocketcanHibernateWebSocket is important for API consistency.

🔍 Potential Issues

1. Critical: Mutex Deadlock Risk in hibernate_ws (packages/pegboard-gateway/src/lib.rs:590)

The hibernate_ws function holds a lock on the WebSocket receiver for the entire hibernation duration:

async fn hibernate_ws(ws_rx: Arc<Mutex<WebSocketReceiver>>) -> Result<HibernationResult> {
    let mut guard = ws_rx.lock().await;  // Lock held for entire hibernation
    let mut pinned = std::pin::Pin::new(&mut *guard);
    
    loop {
        if let Some(msg) = pinned.as_mut().peek().await {
            // ...
        }
    }
}

Issue: If any other code path tries to access ws_rx during hibernation, it will block indefinitely. While this may be intentional, it should be explicitly documented.

Recommendation: Add a comment explaining the locking strategy, or consider if there's a safer pattern.

2. Logic Issue: ws_hibernation_close flag unused after setting (packages/guard-core/src/proxy_service.rs:1843, 1931)

The flag ws_hibernation_close is set to true when receiving a close frame during hibernation:

if let HibernationResult::Close = res {
    tracing::debug\!("starting hibernating websocket close");
    ws_hibernation_close = true;
}

However, the comment says "we are going to reconnect to the actor so that it knows the connection has closed," but there's no code that actually uses this flag or treats the reconnection differently. The loop just continues normally.

Questions:

  • Should the handler be notified that this is a "close" reconnection?
  • Should a close frame be sent to the downstream after reconnection?
  • Is this TODO/future work?

3. Potential Infinite Loop in hibernate_ws (packages/pegboard-gateway/src/lib.rs:594)

The loop continues when receiving non-text/binary/close messages:

loop {
    if let Some(msg) = pinned.as_mut().peek().await {
        match msg {
            Ok(Message::Binary(_)) | Ok(Message::Text(_)) => return Ok(HibernationResult::Continue),
            Ok(Message::Close(_)) => return Ok(HibernationResult::Close),
            _ => {
                pinned.try_next().await?;  // Consume message
            }
        }
    } else {
        return Ok(HibernationResult::Close);
    }
}

Issue: If a client continuously sends Ping/Pong/Frame messages, the loop could spin consuming them. While unlikely to be a real issue, consider:

  • Adding a counter to limit iterations
  • Logging when ignoring messages
  • Or documenting this is expected behavior

4. Typo in Error Check Comment (packages/guard-core/src/proxy_service.rs:1922)

Comment says "should not be hibernating again after receiving a close frame during hibernation" but the check uses ensure\! which will panic in production builds if this invariant is violated.

Recommendation: If this is truly unreachable, consider using unreachable\!() or debug_assert\! instead. If it's a defensive check, the error handling should be more graceful.

🎯 Performance Considerations

  1. Peekable Stream Overhead: Wrapping the WebSocket receiver in Peekable adds a small buffering overhead, but this is acceptable for the hibernation use case.

  2. Lock Contention: The mutex on ws_rx during hibernation could impact performance if there are concurrent access patterns, but given the design this seems intentional.

🔒 Security Considerations

  1. No Obvious Security Issues: The hibernation mechanism doesn't introduce new attack vectors.

  2. Resource Exhaustion: The HIBERNATION_TIMEOUT in tests is 2 seconds, but I don't see a timeout configured in production code for handle_websocket_hibernation. Consider adding a maximum hibernation duration to prevent resource exhaustion attacks.

📋 Test Coverage

The test coverage is good:

  • ✅ Basic hibernation flow tested
  • ✅ Message handling during and after hibernation
  • ✅ Timeout handling

Missing test cases:

  • ❌ What happens when actor never becomes ready during hibernation?
  • ❌ Close frame received during hibernation
  • ❌ Multiple rapid hibernation cycles
  • ❌ Error handling when subscription fails

📝 Style & Convention Notes

  1. Follows CLAUDE.md conventions: ✅ Hard tabs, proper imports, structured logging
  2. Error naming: ✅ Uses proper past-tense naming conventions
  3. Logging: ✅ Uses structured logging with tracing

🚀 Suggestions

  1. Add inline documentation for the hibernation flow in proxy_service.rs explaining the state machine.
  2. Consider extracting the retry/hibernation logic into a separate function for clarity.
  3. Add documentation to HibernationResult enum explaining when each variant should be used.

📊 Summary

This is a solid implementation with good architecture and test coverage. The main concerns are:

  • Critical: Clarify the mutex locking strategy in hibernate_ws
  • Important: Implement or document the intended behavior of ws_hibernation_close flag
  • Nice-to-have: Add production timeout for hibernation and additional edge case tests

Overall: Approve with suggestions - The code is production-ready with minor improvements needed.


Review generated by Claude Code

@claude
Copy link

claude bot commented Nov 13, 2025

PR Review: WebSocket Hibernation Feature

This PR implements WebSocket hibernation functionality for the gateway, allowing WebSocket connections to "pause" when the downstream actor is unavailable and resume when it becomes ready again. Overall, this is a solid implementation with good test coverage. Below are my findings:

✅ Strengths

  1. Clean Architecture: The separation of concerns between CustomServeTrait::handle_websocket_hibernation() and the hibernation logic in pegboard-gateway is well-designed.

  2. Good Test Coverage: The test in engine/packages/guard-core/tests/custom_serve.rs:260-320 validates the hibernation flow end-to-end.

  3. Proper Error Handling: Using a custom error type (WebSocketServiceHibernate) to signal hibernation is a good pattern - it's type-safe and distinguishable from actual errors.

  4. Documentation: The code includes helpful comments explaining the hibernation flow (e.g., lines 1911-1923 in proxy_service.rs).

  5. Backwards Compatibility: Default implementations in the trait ensure non-hibernating services continue to work without changes.

🔍 Code Quality Issues

1. Missing Variable Usage Check (Minor)

Location: engine/packages/guard-core/src/proxy_service.rs:1843

The ws_hibernation_close variable is set but its behavior during the retry loop needs clarification:

let mut ws_hibernation_close = false;
// ... later ...
if let HibernationResult::Close = res {
    tracing::debug!("starting hibernating websocket close");
    ws_hibernation_close = true;
}

After setting ws_hibernation_close = true, the code continues to retry route resolution and reconnect. The comment says "we are going to reconnect to the actor so that it knows the connection has closed" (line 1928-1930), but it's not clear:

  • Does the next handle_websocket call properly handle this case?
  • Should there be a check to prevent further retries after this flag is set?
  • Is the ensure check at line 1914-1917 sufficient?

Recommendation: Add a comment or validation to clarify the intended behavior when ws_hibernation_close = true.

2. Potential Race Condition in Peek/Next Pattern (Medium)

Location: engine/packages/pegboard-gateway/src/lib.rs:590-606

async fn hibernate_ws(ws_rx: Arc<Mutex<WebSocketReceiver>>) -> Result<HibernationResult> {
    let mut guard = ws_rx.lock().await;
    let mut pinned = std::pin::Pin::new(&mut *guard);

    loop {
        if let Some(msg) = pinned.as_mut().peek().await {
            match msg {
                Ok(Message::Binary(_)) | Ok(Message::Text(_)) => {
                    return Ok(HibernationResult::Continue);
                }
                Ok(Message::Close(_)) => return Ok(HibernationResult::Close),
                _ => {
                    pinned.try_next().await?;  // ⚠️ Here
                }
            }
        } else {
            return Ok(HibernationResult::Close);
        }
    }
}

The function uses peek() to check messages without consuming them, which is correct for Binary/Text/Close messages. However, for other message types (Ping, Pong, Frame), it calls try_next() to consume them.

Issue: After this function returns HibernationResult::Continue, the message is still on the stream (because we only peeked). But the ws_rx mutex is dropped when this function returns. If another part of the code tries to consume the message concurrently, there could be ordering issues.

Question: Is the WebSocketReceiver guaranteed to be accessed only sequentially? The Arc<Mutex<>> suggests potential concurrent access.

Recommendation: Add documentation about the locking guarantees, or consider consuming the message before returning Continue.

3. Documentation Comment Accuracy (Minor)

Location: engine/packages/guard-core/src/custom_serve.rs:41

/// Returns true if the websocket should close.
async fn handle_websocket_hibernation(

The comment says "Returns true if the websocket should close" but the function returns Result<HibernationResult>, not a boolean.

Fix: Update to: /// Returns HibernationResult indicating whether to continue or close the websocket.

4. Typo in TypeScript (Minor - already fixed)

Location: rivetkit-typescript/packages/rivetkit/src/actor/config.ts:81

Good catch fixing the typo from canHibernatWebSocket to canHibernateWebSocket.

🔒 Security Considerations

  1. Resource Exhaustion: During hibernation, the WebSocket connection remains open while waiting for the actor to become ready. If many connections hibernate simultaneously, this could consume server resources.

    • Mitigation: Consider adding a maximum hibernation timeout or a limit on concurrent hibernating connections (doesn't appear to be implemented).
  2. Message Ordering: The peek-based approach ensures messages aren't lost during hibernation, which is good for data integrity.

⚡ Performance Considerations

  1. Lock Contention: hibernate_ws() holds the ws_rx mutex for the entire hibernation period. If hibernation takes a long time, this could block other operations on the WebSocket.

    • Current Impact: Likely acceptable since each WebSocket has its own receiver, but worth monitoring.
  2. Subscription Overhead: Each hibernation creates a new subscription to the actor Ready event (line 559-562). Ensure these subscriptions are properly cleaned up.

🧪 Testing Recommendations

The current test coverage is good but could be enhanced:

  1. Add test for Close during hibernation: Verify that HibernationResult::Close is properly handled when a client closes during hibernation.

  2. Add test for hibernation timeout: What happens if the actor never becomes ready?

  3. Add test for rapid hibernation cycles: Send "hibernate" message multiple times in succession to verify the ensure check at line 1914 works correctly.

  4. Add test for non-text message types: Verify Ping/Pong/Frame messages are properly consumed during hibernation.

📝 Minor Style Issues

  1. Inconsistent error message capitalization: Following CLAUDE.md conventions, log messages should be lowercase.

    • Line 566: "actor became ready during hibernation" ✅ Good
    • Line 575: "received message during hibernation" ✅ Good
    • Line 578: "websocket stream closed during hibernation" ✅ Good
  2. CLAUDE.md compliance: The PR updates CLAUDE.md to discourage glob imports from anyhow, and the code correctly uses use anyhow::{Result, bail, Context, ensure}. ✅ Good

🎯 Final Recommendation

Verdict: Approve with minor suggestions

This is a well-implemented feature that adds valuable functionality. The main concerns are:

  1. Clarify the ws_hibernation_close behavior
  2. Document or validate the peek/consume pattern in hibernate_ws()
  3. Fix the doc comment for handle_websocket_hibernation()
  4. Consider adding hibernation timeout limits for production safety

The issues identified are relatively minor and don't block merging, but addressing them would improve robustness and maintainability.

Great work on the hibernation feature! 🎉

@claude
Copy link

claude bot commented Nov 13, 2025

PR Review: WebSocket Hibernation Feature

Overview

This PR implements WebSocket hibernation for the Pegboard Gateway, allowing WebSocket connections to persist during actor unavailability. The implementation is well-architected with good separation of concerns.

Code Quality & Architecture

✅ Strengths

  1. Clean Architecture: The hibernation logic is properly abstracted through the CustomServeTrait with the new handle_websocket_hibernation method and HibernationResult enum.

  2. Error Handling: Proper error type renaming from WebSocketServiceRetry to WebSocketServiceHibernate better reflects the semantic meaning of the operation.

  3. Good Separation: The hibernate_ws function (pegboard-gateway/src/lib.rs:590-612) properly separates concerns using peek() to avoid consuming messages prematurely.

  4. Test Coverage: The test in guard-core/tests/custom_serve.rs demonstrates the hibernation flow with proper timeout handling.

  5. TypeScript Consistency: Typo fix in RivetKit (canHibernatWebSocketcanHibernateWebSocket) ensures consistency.

🔍 Potential Issues & Recommendations

1. Race Condition in Hibernation Check (High Priority)

Location: proxy_service.rs:1914-1917

ensure!(
    !ws_hibernation_close,
    "should not be hibernating again after receiving a close frame during hibernation"
);

Issue: While the code has a safeguard, the logic flow could still allow edge cases where:

  • Actor becomes ready
  • Client sends close frame
  • Actor hibernates again before close is processed

Recommendation: Consider adding state machine documentation or more explicit state transitions to prevent unexpected hibernation states.

2. Mutex Lock Duration (Medium Priority)

Location: pegboard-gateway/src/lib.rs:590-612

async fn hibernate_ws(ws_rx: Arc<Mutex<WebSocketReceiver>>) -> Result<HibernationResult> {
    let mut guard = ws_rx.lock().await;
    let mut pinned = std::pin::Pin::new(&mut *guard);
    
    loop {
        if let Some(msg) = pinned.as_mut().peek().await {
            // ... holding lock during peek()

Issue: The mutex lock is held for the entire duration of peek().await, which could block other operations trying to access the receiver. However, since this is during hibernation and the WebSocket is in a special state, this may be intentional.

Question: Is there a scenario where another task needs to access ws_rx during hibernation? If so, consider restructuring to minimize lock duration.

3. Subscription Leak Potential (Low Priority)

Location: pegboard-gateway/src/lib.rs:559-562

let mut ready_sub = self
    .ctx
    .subscribe::<pegboard::workflows::actor::Ready>(("actor_id", self.actor_id))
    .await?;

Issue: If the subscription isn't properly cleaned up in error paths, it could lead to resource leaks.

Recommendation: Verify that the subscription is properly dropped in all code paths, including error cases.

4. Test Coverage Gaps (Medium Priority)

The current test covers the happy path but could benefit from additional scenarios:

  • Client closes connection during hibernation (partially covered)
  • Multiple hibernation cycles
  • Actor never becomes ready (timeout scenarios)
  • Concurrent hibernation attempts
  • Error handling in handle_websocket_hibernation

Recommendation: Add edge case tests for robustness.

5. WebSocket Message Processing in Loop (Low Priority)

Location: pegboard-gateway/src/lib.rs:594-611

loop {
    if let Some(msg) = pinned.as_mut().peek().await {
        match msg {
            Ok(Message::Binary(_)) | Ok(Message::Text(_)) => {
                return Ok(HibernationResult::Continue);
            }
            Ok(Message::Close(_)) => return Ok(HibernationResult::Close),
            // Ignore rest
            _ => {
                pinned.try_next().await?;
            }
        }
    } else {
        return Ok(HibernationResult::Close);
    }
}

Issue: Ping/Pong frames are consumed and discarded during hibernation. This is likely intentional but worth documenting.

Recommendation: Add a comment explaining why Ping/Pong frames are ignored during hibernation.

Security Considerations

✅ Good Practices

  • No obvious injection vectors introduced
  • Proper error boundaries maintained
  • Close frames handled securely

⚠️ Considerations

  • DoS Potential: An attacker could potentially trigger repeated hibernation cycles. Verify that rate limiting or actor lifecycle management prevents abuse.
  • Resource Management: Ensure hibernation doesn't allow indefinite resource holding. The test uses a 2-second timeout, but verify production has appropriate limits.

Performance Considerations

Positive

  • Reduces connection churn by maintaining WebSocket connections during brief actor unavailability
  • Proper use of peek() avoids unnecessary message deserialization

Concerns

  • Holding mutex locks during async operations (mentioned above)
  • Subscription overhead for every hibernating connection

Documentation

Missing Documentation

  1. The HibernationResult enum could benefit from doc comments explaining when to use Continue vs Close
  2. The comment at proxy_service.rs:1911-1913 mentions "unreachable" but doesn't explain why the ensure! is still needed
  3. The hibernation flow in the main loop (proxy_service.rs:1910-1935) would benefit from a high-level comment explaining the state machine

Minor Issues

  1. CLAUDE.md Update: Good catch updating the anyhow import guidance from glob imports to explicit imports.

  2. Consistent Naming: The rename from handlers to handler in proxy_service.rs:1833 is cleaner since it's a single instance.

Overall Assessment

This is a solid implementation of WebSocket hibernation with good architecture and error handling. The main concerns are around:

  1. Potential race conditions in edge cases
  2. Test coverage for failure scenarios
  3. Documentation of the hibernation state machine

Recommendation: Approve with Minor Changes

Consider addressing:

  • Add documentation for the hibernation state machine
  • Add doc comments to HibernationResult
  • Consider additional edge case tests
  • Verify DoS protection exists at the actor lifecycle level

Great work on this feature! The abstraction through CustomServeTrait makes it easy to extend and maintain.

@claude
Copy link

claude bot commented Nov 13, 2025

PR Review: WebSocket Hibernation Feature

This PR introduces WebSocket hibernation support for the gateway, allowing WebSocket connections to persist while actors are not ready/available. Overall, this is a well-implemented feature with good test coverage. Here are my findings:

✅ Strengths

  1. Clean Architecture: The introduction of HibernationResult enum and the handle_websocket_hibernation trait method provides a clear separation of concerns.

  2. Good Error Handling: The rename from WebSocketServiceRetry to WebSocketServiceHibernate better reflects the semantic meaning of the operation.

  3. Comprehensive Testing: The test in custom_serve.rs:260 demonstrates the hibernation flow well, including verifying that messages are queued and delivered after hibernation ends.

  4. Proper Use of Peek Pattern: Using Peekable stream wrapper allows checking for incoming messages during hibernation without consuming them, which is the correct approach.

  5. TypeScript Consistency: Fixed the typo canHibernatWebSocketcanHibernateWebSocket throughout the TypeScript codebase.

🔍 Potential Issues & Concerns

1. Race Condition in Hibernation Logic (proxy_service.rs:1914-1917)

ensure!(
    !ws_hibernation_close,
    "should not be hibernating again after receiving a close frame during hibernation"
);

This ensure! will panic if hibernation is triggered twice after a close. While the comment at line 1911-1913 says this "should be unreachable," using ensure! for unreachable states can cause production panics. Consider:

  • Using debug_assert! if this is truly unreachable
  • Or handling this gracefully by breaking the loop with an error message
  • Adding a test case that verifies this scenario cannot occur

2. Mutex Lock Holding Duration (lib.rs:590-610)

async fn hibernate_ws(ws_rx: Arc<Mutex<WebSocketReceiver>>) -> Result<HibernationResult> {
    let mut guard = ws_rx.lock().await;
    // ... holds lock for entire hibernation duration

The mutex guard is held for the entire hibernation period, which could potentially block other operations that need access to the receiver. However, since the WebSocketReceiver appears to be single-consumer by design, this might be intentional. Consider documenting this behavior or verifying no other code paths need concurrent access.

3. Error Handling in Peek Loop (lib.rs:604-606)

// Ignore rest
_ => {
    pinned.try_next().await?;
}

This silently consumes non-binary/text/close messages (like Ping, Pong) during hibernation. While probably correct, consider:

  • Adding a log statement for visibility: tracing::trace!("ignoring websocket message during hibernation")
  • Documenting which message types are ignored and why

4. Missing Timeout on Hibernation (lib.rs:555-587)

The handle_websocket_hibernation function waits indefinitely for either:

  • Actor ready event
  • Client message/close

If an actor never becomes ready and the client never sends a message, this could hang forever. Consider adding a timeout:

tokio::select! {
    _ = ready_sub.next() => { ... }
    hibernation_res = hibernate_ws(client_ws.recv()) => { ... }
    _ = tokio::time::sleep(MAX_HIBERNATION_DURATION) => {
        return Err(errors::WebSocketServiceTimeout.build());
    }
}

5. Backoff Logic Regression (proxy_service.rs:1959-1971)

After the refactoring, the backoff is now applied even for ws_hibernate errors in some paths. The logic changed from:

if !ws_retry {
    let backoff = ...;
    tokio::time::sleep(backoff).await;
}

To unconditional backoff at line 1971. Verify this is intentional - hibernation should probably not apply exponential backoff since it's not a failure.

UPDATE: Looking more carefully, the backoff is in the else branch (line 1959) which only executes when !ws_hibernate, so this is actually correct. The indentation made it harder to see.

6. Documentation Comment (custom_serve.rs:41)

/// Returns true if the websocket should close.

This comment is outdated - the function returns Result<HibernationResult>, not a boolean. Should be:

/// Handles WebSocket hibernation when the downstream target is unavailable.
/// Returns whether to continue the connection or close it after hibernation ends.

7. CLAUDE.md Update

Good catch updating the anyhow import guidance! However, verify this change is consistently applied across the codebase - are there existing files that still use use anyhow::*; that should be updated in a separate PR?

🎯 Minor Suggestions

  1. Test Coverage: Consider adding tests for:

    • Hibernation timeout scenario (if timeout is added)
    • Multiple hibernation cycles in sequence
    • Hibernation with close frame received (the ws_hibernation_close path)
  2. Metrics: Consider adding metrics for:

    • Number of hibernation events
    • Duration of hibernation periods
    • Success/failure rates of reconnection after hibernation
  3. Type Safety: The hibernate_ws function could potentially use StreamExt::peekable() directly on the parameter rather than requiring the caller to wrap it. This would make the API harder to misuse.

🔒 Security Considerations

No significant security issues identified. The hibernation mechanism properly validates state transitions and maintains connection integrity.

⚡ Performance Considerations

  1. The peekable stream wrapper adds minimal overhead (just an Option<Item> field)
  2. Holding the mutex lock during hibernation is acceptable for single-consumer pattern
  3. The event subscription for actor ready events is efficient

📝 Summary

This is a solid implementation of an important feature. The main concerns are:

  1. The ensure! that could panic (high priority - should be addressed)
  2. Missing hibernation timeout (medium priority - could lead to resource leaks)
  3. Documentation accuracy (low priority)

Recommend addressing items #1 and #4 before merging. The rest can be follow-ups.

Great work overall! The feature is well-architected and tested. 🚀

@claude
Copy link

claude bot commented Nov 13, 2025

Pull Request Review: WebSocket Hibernation Feature

I've reviewed this PR which implements WebSocket hibernation functionality for the Rivet gateway. Overall, this is a well-structured implementation with good error handling and testing. Here are my findings:

✅ Strengths

  1. Good Architectural Design: The hibernation mechanism using HibernationResult enum is clean and well-thought-out, allowing services to opt-in to hibernation behavior.

  2. Comprehensive Test Coverage: The new test test_custom_serve_websocket_hibernation properly validates the hibernation functionality with timeout handling.

  3. Error Handling: Proper use of custom error types (WebSocketServiceHibernate) and error propagation throughout the stack.

  4. Code Cleanup: Good adherence to CLAUDE.md guidelines:

    • Fixed anyhow import pattern (removed glob imports)
    • Proper structured logging with tracing
    • Consistent error handling patterns
  5. Documentation: Good inline comments explaining complex behavior, especially around hibernation close handling.

🔍 Potential Issues

1. Type Safety: Missing Clone Trait Bounds

In pegboard-gateway/src/lib.rs:590, the hibernate_ws function uses Peekable<WebSocketReceiver> which is wrapped in Arc<Mutex<>>. However, the code doesn't verify that all required traits are properly bounded. Consider adding explicit trait bounds if needed.

2. Race Condition Risk in Hibernation

pegboard-gateway/src/lib.rs:555-587: The handle_websocket_hibernation function uses tokio::select! to race between actor ready signal and incoming WebSocket messages. While this is likely correct, consider:

  • What happens if the actor becomes ready exactly when a close frame arrives?
  • The code assumes the Ready workflow signal will trigger, but there's no timeout. Should there be a maximum hibernation duration?

Suggestion: Add a timeout to prevent indefinite hibernation:

const MAX_HIBERNATION_DURATION: Duration = Duration::from_secs(300); // 5 minutes

let close = tokio::select! {
    _ = ready_sub.next() => { /* ... */ }
    hibernation_res = hibernate_ws(client_ws.recv()) => { /* ... */ }
    _ = tokio::time::sleep(MAX_HIBERNATION_DURATION) => {
        tracing::warn!("max hibernation duration exceeded");
        HibernationResult::Close
    }
};

3. Incomplete Error Recovery Path

guard-core/src/proxy_service.rs:1910-1936: The hibernation logic handles ws_hibernation_close with an ensure! that should be "unreachable." However, if this invariant is violated, the error will bubble up and potentially crash the task. Consider:

  • Is this truly unreachable, or is there an edge case we're missing?
  • Should this be a logged warning with graceful handling instead of a hard error?

4. Missing Await in Test

guard-core/tests/custom_serve.rs:257-321: The test waits for the hibernation to complete but doesn't verify the timing. Since HIBERNATION_TIMEOUT is 2 seconds, the test should ensure the message is received after hibernation completes, not before. The current 100ms sleep may not be sufficient validation.

Suggestion: Add timing validation:

let start = std::time::Instant::now();
// ... send hibernate and test message ...
let response = tokio::time::timeout(HIBERNATION_TIMEOUT * 2, ws_stream.next())
    .await
    .expect("timed out waiting for message from hibernating WebSocket");
let elapsed = start.elapsed();
assert!(elapsed >= HIBERNATION_TIMEOUT, "message received too early, hibernation may not have occurred");

5. Typo Fix Verification

rivetkit-typescript/packages/rivetkit/src/actor/config.ts:81: Good catch fixing canHibernatWebSocketcanHibernateWebSocket. However, this is a breaking change for any existing configs using the old name. Ensure:

  • Migration path is documented
  • Backward compatibility is considered (or explicitly not supported)
  • Users are notified in release notes

🎯 Code Quality

  1. Follows CLAUDE.md conventions: ✅

    • Hard tabs used
    • Proper error handling with anyhow
    • Structured logging
    • Timestamps as i64 where applicable
  2. Consistent naming: ✅

    • handle_websocket_hibernation follows naming patterns
    • Error types follow conventions
  3. Import organization: ✅

    • Imports are at the top of files
    • No glob imports from anyhow

🔐 Security Considerations

  1. Resource Exhaustion: The hibernation mechanism could potentially be abused to keep connections open indefinitely. The missing timeout mentioned in issue [SVC-2479] Send cluster events to PostHog #2 above should be addressed.

  2. State Management: The ws_hibernation_close flag logic is complex. Ensure this cannot lead to connection leaks or zombie connections.

🚀 Performance Considerations

  1. Lock Contention: hibernate_ws function holds a mutex lock on WebSocketReceiver while peeking. This is fine for single-threaded access, but verify this doesn't cause contention during normal operation.

  2. Message Buffering: The pending message resend logic in pegboard-gateway looks correct, but verify memory usage doesn't grow unbounded if an actor takes a long time to become ready.

📝 Recommendations

  1. High Priority: Add maximum hibernation timeout (issue [SVC-2479] Send cluster events to PostHog #2)
  2. Medium Priority: Add timing validation to test (issue [SVC-2483] Remove hardcoded uses of rivet.gg #4)
  3. Medium Priority: Document the breaking change in rivetkit config (issue [SVC-2358] Enable Redis memory overcommit #5)
  4. Low Priority: Consider whether the ensure! at line 1914 should be a softer failure
  5. Low Priority: Consider adding metrics/observability for hibernation events

✅ Approval Status

This is a solid implementation with good testing. The issues I've identified are mostly edge cases and hardening opportunities. Once the maximum hibernation timeout is added (critical for preventing resource exhaustion), this would be ready to merge.

Recommended Action: Address the timeout issue, then merge.

Great work on this feature! The hibernation mechanism will be a valuable addition for managing WebSocket connections efficiently.


Reviewed using Claude Code with reference to CLAUDE.md conventions

@NathanFlurry NathanFlurry changed the base branch from 11-12-chore_rivetkit_fix_type_checks to graphite-base/3400 November 13, 2025 22:39
@NathanFlurry NathanFlurry force-pushed the 11-05-feat_gateway_websocket_hibernation branch from 2564d3e to 657d42e Compare November 13, 2025 22:42
@NathanFlurry NathanFlurry changed the base branch from graphite-base/3400 to main November 13, 2025 22:42
@graphite-app
Copy link
Contributor

graphite-app bot commented Nov 13, 2025

Merge activity

@NathanFlurry NathanFlurry merged commit 4e60a30 into main Nov 13, 2025
8 of 16 checks passed
@NathanFlurry NathanFlurry deleted the 11-05-feat_gateway_websocket_hibernation branch November 13, 2025 22:44
@claude
Copy link

claude bot commented Nov 13, 2025

PR Review: WebSocket Hibernation Feature

Overview

This PR introduces WebSocket hibernation functionality to the gateway, allowing WebSocket connections to pause when the downstream service becomes unavailable and resume when it becomes ready again.

Code Quality & Best Practices

Strengths

  1. Good Error Handling Pattern: The rename from WebSocketServiceRetry to WebSocketServiceHibernate better expresses the intent
  2. Clean Abstraction: The HibernationResult enum provides a clean API for hibernation outcomes
  3. Default Implementation: Providing default implementations for handle_websocket and handle_websocket_hibernation is good
  4. Test Coverage: Good addition of test_custom_serve_websocket_hibernation
  5. CLAUDE.md Update: Good catch updating the import style guidance from glob imports to explicit imports for anyhow

Potential Issues

1. Race Condition in Hibernation Logic (proxy_service.rs:1910-1935)

The ws_hibernation_close flag is set after hibernation returns Close, but the next iteration could potentially enter hibernation again before the connection is properly closed. The ensure! will panic if this happens. Consider breaking out of the retry loop immediately when ws_hibernation_close is set.

2. Ping/Pong Handling During Hibernation (pegboard-gateway/src/lib.rs:590-612)

The hibernate_ws function doesn't explicitly handle Message::Ping. While they fall into the default case and get consumed, the WebSocket might need to respond to pings during hibernation to keep the connection alive.

3. TypeScript Breaking Change (rivetkit-typescript/packages/rivetkit/src/actor/config.ts:81)

Good catch fixing the typo from canHibernatWebSocket to canHibernateWebSocket. However, this is a breaking change. Consider deprecating the old name first and supporting both for a transition period.

Security Concerns

  1. DoS via Hibernation: An attacker could potentially trigger repeated hibernation cycles to keep connections open indefinitely. Consider adding a maximum hibernation duration and limits on hibernation cycles per connection.

Test Coverage

Good basic coverage, but missing test cases for:

  • Close during hibernation
  • Multiple hibernation cycles
  • Timeout during hibernation
  • Error during hibernation

Recommendations

High Priority

  • Address the potential race condition with ws_hibernation_close flag
  • Add test coverage for close-during-hibernation scenario
  • Consider ping/pong handling during hibernation for connection keepalive

Medium Priority

  • Add documentation/comments explaining the hibernation flow
  • Consider limits on hibernation duration or cycles to prevent DoS
  • Handle TypeScript breaking change gracefully (deprecation period)

Summary

This is a well-implemented feature with good abstractions and test coverage. The main concerns are around edge cases (particularly the close-during-hibernation flow) and potential race conditions. The code follows Rivet conventions well.

Approval Status: Approved with suggestions for improvements

Great work on this feature!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants