fix: stabilize session actor fanout #2062

sanity · 2025-11-07T17:50:09Z

Summary

cache pending HostResults so late subscribers get delivery
wire subscribe_with_id to register transactions before kick-off
add ContractHandler helper to feed session actor

Testing

cargo test client_events::session_actor
cargo test -p freenet subscribe::tests::subscribe_local
cargo clippy -p freenet --no-deps

crates/core/src/client_events/session_actor.rs

sanity · 2025-11-07T21:37:20Z

@claude Please review this PR

claude · 2025-11-07T21:37:40Z

Claude finished @sanity's task —— View job

PR Review Complete

Tasks

Read changed files and understand implementation
Analyze the resource leak concern at session_actor.rs:24
Review for bugs, performance issues, and best practices
Check test coverage adequacy
Provide comprehensive review feedback

Review Feedback

1. Resource Leak Analysis (Addressing @iduartgomez's concern at session_actor.rs:24)

Verdict: CONFIRMED - There is a resource leak, but it's partially mitigated

The pending_results HashMap at line 24 stores cached results for late-arriving subscribers. While the PR includes cleanup mechanisms, there are gaps:

Mitigations in place:

prune_pending_results() is called on every message (line 73) ✅
TTL-based cleanup: removes entries older than 60 seconds (line 367)
Capacity enforcement: limits to 2048 entries with LRU eviction (lines 379-406)

Potential leak scenarios:

Orphaned results: If a transaction completes but no client ever subscribes, the result stays cached until TTL expires (60 seconds). This is by design for late subscribers, but could accumulate if many transactions complete without subscribers.
Slow leak under high load: If transactions complete faster than the 60-second TTL and never get claimed, the cache could grow toward MAX_PENDING_RESULTS. Once full, LRU eviction kicks in, but this is reactive rather than proactive.
Memory growth: Each PendingResult contains an Arc<HostResult> which could be large (contract state data). With 2048 max entries, this could consume significant memory.

Recommendations:

Add metrics/logging for pending_results.len() to monitor growth in production
Consider reducing PENDING_RESULT_TTL from 60s to 10-15s if subscription delays are typically shorter
Add a periodic cleanup task separate from message processing to ensure cleanup happens even during idle periods
Consider adding a maximum memory threshold in addition to the count limit

2. Code Quality & Best Practices

Positive aspects:

Clean separation of concerns with helper methods ✅
Good use of Arc for efficient cloning ✅
Comprehensive test coverage (6 tests) ✅
Clear logging for debugging ✅

Issues identified:

a) Potential race condition in RegisterTransaction (lines 107-119)

if let Some(result_arc) = self.pending_results.get_mut(&tx).and_then(|pending| {
    pending.touch();
    if pending.delivered_clients.insert(client_id) {
        Some(pending.result.clone())
    } else {
        None
    }
}) {
    let mut recipients = HashSet::new();
    recipients.insert(client_id);
    self.deliver_result_to_clients(tx, recipients, result_arc);
    self.cleanup_transaction_entry(tx, client_id);
}

This checks pending_results and delivers immediately if found, BUT cleanup_transaction_entry is called which could remove the transaction from client_transactions. If another message for the same transaction arrives concurrently, it might find the result but not the transaction entry. This is likely okay since everything is single-threaded (async), but worth documenting.

b) Inconsistent cleanup in handle_result_delivery_with_request_id (lines 305-317)

if !self.pending_results.contains_key(&tx)
    && self.pending_results.len() >= MAX_PENDING_RESULTS
{
    self.enforce_pending_capacity();
}

let entry = self
    .pending_results
    .entry(tx)
    .or_insert_with(|| PendingResult::new(result.clone()));
entry.delivered_clients.insert(client_id);
entry.result = result.clone();
entry.touch();

This caches the result AFTER delivery. In handle_result_delivery, the result is cached BEFORE checking for waiting clients. This inconsistency could lead to different behavior for targeted vs broadcast deliveries.

c) Missing cleanup for client_request_ids
In cleanup_client_transactions (line 335), there's cleanup for request IDs:

self.client_request_ids.retain(|(_, c), _| *c != client_id);

However, when results are delivered in deliver_result_to_clients, request IDs are removed one by one (line 172). If a client disconnects before receiving a result, their request IDs might not be fully cleaned up from the retained transactions.

d) Potential double-insertion in pending_results (lines 227-238)

let entry = self
    .pending_results
    .entry(tx)
    .or_insert_with(|| PendingResult::new(result.clone()));
entry.result = result.clone();  // ← Why overwrite?
entry.touch();

If the entry already exists, we clone the result twice (once in the condition check, once here). This is inefficient.

3. Performance Concerns

a) prune_pending_results() called on every message (line 73)
This runs on EVERY message, even if pending_results is empty. The function does have an early return (if self.pending_results.is_empty()), but it still involves:

Building a HashSet of active transactions (lines 355-357)
Iterating all pending results to check TTL (lines 360-373)
A second capacity check (lines 379-381)

Recommendation: Only run pruning periodically (e.g., every 100 messages or via a separate tokio task).

b) Memory overhead from Arc
Each pending result stores Arc<HostResult>, which is good for sharing, but if HostResult contains large contract states, 2048 entries could consume significant memory (potentially hundreds of MB).

Recommendation: Add a size limit in bytes, not just a count limit.

4. Testing

Strong test coverage:

✅ 1-to-many delivery
✅ Late subscriber receiving cached results
✅ Client disconnect cleanup
✅ RequestId correlation isolation
✅ All message types handled

Missing test cases:

TTL expiration: No test verifies that entries are removed after 60 seconds
Capacity enforcement: No test verifies LRU eviction when MAX_PENDING_RESULTS is exceeded
Concurrent late subscribers: No test for multiple clients registering for the same cached result
Memory stress: No test with large HostResult payloads

5. Documentation

The module-level documentation is good, but some functions lack comments:

prune_pending_results() should document its cleanup strategy
enforce_pending_capacity() should document LRU eviction
Constants (PENDING_RESULT_TTL, MAX_PENDING_RESULTS) should have comments explaining the rationale

6. Integration with handler.rs

The changes in handler.rs (lines 310-342) add the waiting_for_transaction_result helper. This is clean and well-integrated. ✅

Minor issue: The session adapter registration is silent on failure:

if let Err(e) = session_tx.try_send(msg) {
    tracing::warn!("Failed to notify session actor: {}", e);
}

If the session actor channel is full or closed, this could lead to clients never receiving results. Should this be a hard error?

Summary

Overall assessment: The PR successfully addresses session actor race conditions and implements result caching for late subscribers. The code is generally well-structured and tested.

Critical issues:

✅ Resource leak is real but partially mitigated - needs monitoring in production
⚠️ Performance impact of pruning on every message
⚠️ Inconsistent result caching between delivery methods

Recommendations before merge:

Add monitoring/metrics for pending_results.len()
Move pruning to a periodic task instead of every message
Standardize caching behavior between handle_result_delivery and handle_result_delivery_with_request_id
Add test cases for TTL expiration and capacity limits
Consider reducing TTL or adding memory-based limits

Nice to have:

Document cleanup strategies in code comments
Add integration test with freenet-test-network showing late subscriber scenario
Consider making TTL and max capacity configurable

sanity · 2025-11-08T02:10:03Z

Followed up on the cache feedback:

pending-results now have a 15 s TTL, sweeps only run when the actor sees pressure, and we emit trace/info logs once the cache crosses ~75% utilization
cache eviction is driven by both entry count (2 K) and an approximate 64 MiB byte budget so large HostResponse payloads cannot accumulate; we enforce LRU-style pruning + metrics, and registering clients immediately reuses the cached result
added TTL + capacity regression tests for the session actor and made the session adapter use send().await so failures propagate instead of being silently logged

Tests: cargo test -p freenet session_actor

iduartgomez · 2025-11-09T11:28:43Z

@claude does cache eviction work exclusively on lazy evaluation or is preemptive?

claude · 2025-11-09T11:29:01Z

Claude finished @iduartgomez's task —— View job

Cache eviction strategy analysis

Tasks

Read session_actor.rs to understand eviction implementation
Analyze when pruning functions are called
Determine lazy vs preemptive behavior
Provide answer with code references

Answer

Cache eviction works exclusively on lazy evaluation - there is no preemptive/background cleanup task.

How it works

Eviction is triggered only when the session actor processes incoming messages:

On every message (process_message at line 72):

async fn process_message(&mut self, msg: SessionMessage) {
    self.prune_pending_results();  // ← Called for EVERY message
    match msg {
        // ... handle message
    }
}

TTL-based pruning (prune_pending_results at lines 350-382):
- Removes entries older than PENDING_RESULT_TTL (60 seconds at line 15)
- Skips if pending_results is empty (early return at line 351)
- Only processes stale entries that have no active clients waiting
Capacity enforcement (enforce_pending_capacity at lines 393-406):
- Triggered when pending_results.len() >= MAX_PENDING_RESULTS (2048 at line 16)
- Uses LRU eviction: removes the entry with oldest last_accessed timestamp
- Called from two places:
  - End of prune_pending_results (line 379) if still over capacity
  - Before inserting new entries (lines 228, 306) if at capacity

Key characteristics

Lazy evaluation means:

No background tokio task monitoring the cache
No periodic timer triggering cleanup
Cleanup happens only as a side effect of message processing
If the actor is idle (no messages), stale entries remain indefinitely

Implications:

✅ Pro: Simpler implementation, no separate task management
✅ Pro: Cleanup cost is amortized across message processing
⚠️ Con: During idle periods, memory isn't reclaimed even if entries are stale
⚠️ Con: Cache size can temporarily exceed limits between messages

Example scenarios

High traffic: Eviction runs frequently (every message), so stale entries are cleaned up quickly
Idle network: Result delivered at T=0, client never subscribes, entry stays in memory until next message arrives (could be minutes/hours later)
Burst then idle: Cache fills to 2048 entries during burst, then sits at 2048 until next message arrives to trigger pruning

iduartgomez · 2025-11-09T11:35:47Z

Claude finished @iduartgomez's task —— View job

Cache eviction strategy analysis

Tasks

* [x]  Read session_actor.rs to understand eviction implementation

* [x]  Analyze when pruning functions are called

* [x]  Determine lazy vs preemptive behavior

* [x]  Provide answer with code references

Answer

Cache eviction works exclusively on lazy evaluation - there is no preemptive/background cleanup task.

How it works

Eviction is triggered only when the session actor processes incoming messages:

1. **On every message** (`process_message` at line 72):
   ```rust
   async fn process_message(&mut self, msg: SessionMessage) {
       self.prune_pending_results();  // ← Called for EVERY message
       match msg {
           // ... handle message
       }
   }
   ```

2. **TTL-based pruning** (`prune_pending_results` at lines 350-382):
   
   * Removes entries older than `PENDING_RESULT_TTL` (60 seconds at line 15)
   * Skips if `pending_results` is empty (early return at line 351)
   * Only processes stale entries that have no active clients waiting

3. **Capacity enforcement** (`enforce_pending_capacity` at lines 393-406):
   
   * Triggered when `pending_results.len() >= MAX_PENDING_RESULTS` (2048 at line 16)
   * Uses LRU eviction: removes the entry with oldest `last_accessed` timestamp
   * Called from two places:
     
     * End of `prune_pending_results` (line 379) if still over capacity
     * Before inserting new entries (lines 228, 306) if at capacity

Key characteristics

Lazy evaluation means:

* No background tokio task monitoring the cache

* No periodic timer triggering cleanup

* Cleanup happens **only** as a side effect of message processing

* If the actor is idle (no messages), stale entries remain indefinitely

Implications:

* ✅ **Pro:** Simpler implementation, no separate task management

* ✅ **Pro:** Cleanup cost is amortized across message processing

* ⚠️ **Con:** During idle periods, memory isn't reclaimed even if entries are stale

* ⚠️ **Con:** Cache size can temporarily exceed limits between messages

Example scenarios

1. **High traffic:** Eviction runs frequently (every message), so stale entries are cleaned up quickly

2. **Idle network:** Result delivered at T=0, client never subscribes, entry stays in memory until next message arrives (could be minutes/hours later)

3. **Burst then idle:** Cache fills to 2048 entries during burst, then sits at 2048 until next message arrives to trigger pruning

@claude document well the current limitations for future reference

claude · 2025-11-09T11:36:03Z

Claude finished @iduartgomez's task —— View job

Documentation complete

Tasks

Read current session_actor.rs implementation
Add comprehensive documentation about lazy eviction strategy
Document limitations and tradeoffs
Format code (skipped - no formatter permission, code is already well-formatted)
Commit and push changes

Summary

I've added comprehensive documentation to session_actor.rs explaining the lazy cache eviction strategy and its limitations for future reference:

What was documented:

Module-level documentation (lines 6-40):

Overview of the lazy evaluation approach (no background task)
How the eviction mechanism works (on every message, TTL-based, capacity-based)
Advantages: Simpler implementation, amortized cleanup cost, no idle overhead
Limitations: Idle memory retention, temporary overflow, burst accumulation, memory pressure with large payloads
Future considerations for production improvements

Constant documentation (lines 51-62):

PENDING_RESULT_TTL: Explained the 60-second TTL and lazy evaluation implications
MAX_PENDING_RESULTS: Documented the 2048 entry limit and temporary overflow possibility

Function documentation:

process_message (lines 117-120): Notes that it triggers pruning on every message
prune_pending_results (lines 399-415): Detailed cleanup strategy and lazy evaluation implications
enforce_pending_capacity (lines 459-470): LRU eviction mechanism and when it's triggered
pending_results field (lines 70-73): Cross-reference to module documentation

Key points for future maintainers:

No background cleanup task - eviction happens only during message processing
Idle period behavior - stale entries persist until next message arrives
Tradeoffs are intentional - simplicity vs. guaranteed memory cleanup
Future options - background task, memory limits, metrics/monitoring

The documentation provides a clear foundation for anyone considering changes to the eviction strategy or investigating memory usage patterns in production.

Commit: 10cb7f6e - docs(session): document lazy cache eviction strategy and limitations

Add comprehensive documentation explaining the lazy evaluation approach used for cache cleanup in the session actor's pending_results cache: - Module-level documentation describing the eviction strategy - Detailed explanation of lazy evaluation vs preemptive cleanup - Tradeoffs: advantages (simplicity, amortized cost) and limitations (idle memory retention, temporary overflow, burst accumulation) - Future considerations for production monitoring and improvements - Function-level comments for prune_pending_results and enforce_pending_capacity - Constant documentation explaining TTL and capacity limits This addresses the need for future maintainers to understand the design decision to use lazy evaluation and its implications for memory usage during idle periods. Co-authored-by: nacho.d.g <iduartgomez@users.noreply.github.com>

iduartgomez

This should be changed to preemptive cache eviction in the future but let's merge for now.

There seems to be some flakiness in tests but idk if is related to the changes in the PR.

https://github.com/freenet/freenet-core/actions/runs/19207910362/job/54905769302

fix: stabilize session actor fanout

9003287

iduartgomez reviewed Nov 7, 2025

View reviewed changes

crates/core/src/client_events/session_actor.rs Show resolved Hide resolved

fix(session): prune pending results

7fdad4a

Merge branch 'main' into stack/session-actor-race

3a639fd

iduartgomez approved these changes Nov 9, 2025

View reviewed changes

iduartgomez added this pull request to the merge queue Nov 9, 2025

Merged via the queue into main with commit f34b396 Nov 9, 2025
16 of 17 checks passed

iduartgomez deleted the stack/session-actor-race branch November 9, 2025 12:55

Uh oh!

fix: stabilize session actor fanout #2062

fix: stabilize session actor fanout #2062

Uh oh!

Conversation

sanity commented Nov 7, 2025

Summary

Testing

Uh oh!

Uh oh!

sanity commented Nov 7, 2025

Uh oh!

claude bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Complete

Tasks

Review Feedback

1. Resource Leak Analysis (Addressing @iduartgomez's concern at session_actor.rs:24)

2. Code Quality & Best Practices

3. Performance Concerns

4. Testing

5. Documentation

6. Integration with handler.rs

Summary

Uh oh!

sanity commented Nov 8, 2025

Uh oh!

iduartgomez commented Nov 9, 2025

Uh oh!

claude bot commented Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Cache eviction strategy analysis

Tasks

Answer

How it works

Key characteristics

Example scenarios

Uh oh!

iduartgomez commented Nov 9, 2025

Cache eviction strategy analysis

Tasks

Answer

How it works

Key characteristics

Example scenarios

Uh oh!

claude bot commented Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation complete

Tasks

Summary

What was documented:

Key points for future maintainers:

Uh oh!

iduartgomez left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

claude bot commented Nov 7, 2025 •

edited

Loading

claude bot commented Nov 9, 2025 •

edited

Loading

claude bot commented Nov 9, 2025 •

edited

Loading

iduartgomez left a comment •

edited

Loading