Skip to content

Conversation

@sanity
Copy link
Collaborator

@sanity sanity commented Nov 7, 2025

Summary

  • cache pending HostResults so late subscribers get delivery
  • wire subscribe_with_id to register transactions before kick-off
  • add ContractHandler helper to feed session actor

Testing

  • cargo test client_events::session_actor
  • cargo test -p freenet subscribe::tests::subscribe_local
  • cargo clippy -p freenet --no-deps

@sanity
Copy link
Collaborator Author

sanity commented Nov 7, 2025

@claude Please review this PR

@claude
Copy link

claude bot commented Nov 7, 2025

Claude finished @sanity's task —— View job


PR Review Complete

Tasks

  • Read changed files and understand implementation
  • Analyze the resource leak concern at session_actor.rs:24
  • Review for bugs, performance issues, and best practices
  • Check test coverage adequacy
  • Provide comprehensive review feedback

Review Feedback

1. Resource Leak Analysis (Addressing @iduartgomez's concern at session_actor.rs:24)

Verdict: CONFIRMED - There is a resource leak, but it's partially mitigated

The pending_results HashMap at line 24 stores cached results for late-arriving subscribers. While the PR includes cleanup mechanisms, there are gaps:

Mitigations in place:

  • prune_pending_results() is called on every message (line 73) ✅
  • TTL-based cleanup: removes entries older than 60 seconds (line 367)
  • Capacity enforcement: limits to 2048 entries with LRU eviction (lines 379-406)

Potential leak scenarios:

  1. Orphaned results: If a transaction completes but no client ever subscribes, the result stays cached until TTL expires (60 seconds). This is by design for late subscribers, but could accumulate if many transactions complete without subscribers.
  2. Slow leak under high load: If transactions complete faster than the 60-second TTL and never get claimed, the cache could grow toward MAX_PENDING_RESULTS. Once full, LRU eviction kicks in, but this is reactive rather than proactive.
  3. Memory growth: Each PendingResult contains an Arc<HostResult> which could be large (contract state data). With 2048 max entries, this could consume significant memory.

Recommendations:

  1. Add metrics/logging for pending_results.len() to monitor growth in production
  2. Consider reducing PENDING_RESULT_TTL from 60s to 10-15s if subscription delays are typically shorter
  3. Add a periodic cleanup task separate from message processing to ensure cleanup happens even during idle periods
  4. Consider adding a maximum memory threshold in addition to the count limit

2. Code Quality & Best Practices

Positive aspects:

  • Clean separation of concerns with helper methods ✅
  • Good use of Arc for efficient cloning ✅
  • Comprehensive test coverage (6 tests) ✅
  • Clear logging for debugging ✅

Issues identified:

a) Potential race condition in RegisterTransaction (lines 107-119)

if let Some(result_arc) = self.pending_results.get_mut(&tx).and_then(|pending| {
    pending.touch();
    if pending.delivered_clients.insert(client_id) {
        Some(pending.result.clone())
    } else {
        None
    }
}) {
    let mut recipients = HashSet::new();
    recipients.insert(client_id);
    self.deliver_result_to_clients(tx, recipients, result_arc);
    self.cleanup_transaction_entry(tx, client_id);
}

This checks pending_results and delivers immediately if found, BUT cleanup_transaction_entry is called which could remove the transaction from client_transactions. If another message for the same transaction arrives concurrently, it might find the result but not the transaction entry. This is likely okay since everything is single-threaded (async), but worth documenting.

b) Inconsistent cleanup in handle_result_delivery_with_request_id (lines 305-317)

if !self.pending_results.contains_key(&tx)
    && self.pending_results.len() >= MAX_PENDING_RESULTS
{
    self.enforce_pending_capacity();
}

let entry = self
    .pending_results
    .entry(tx)
    .or_insert_with(|| PendingResult::new(result.clone()));
entry.delivered_clients.insert(client_id);
entry.result = result.clone();
entry.touch();

This caches the result AFTER delivery. In handle_result_delivery, the result is cached BEFORE checking for waiting clients. This inconsistency could lead to different behavior for targeted vs broadcast deliveries.

c) Missing cleanup for client_request_ids
In cleanup_client_transactions (line 335), there's cleanup for request IDs:

self.client_request_ids.retain(|(_, c), _| *c != client_id);

However, when results are delivered in deliver_result_to_clients, request IDs are removed one by one (line 172). If a client disconnects before receiving a result, their request IDs might not be fully cleaned up from the retained transactions.

d) Potential double-insertion in pending_results (lines 227-238)

let entry = self
    .pending_results
    .entry(tx)
    .or_insert_with(|| PendingResult::new(result.clone()));
entry.result = result.clone();  // ← Why overwrite?
entry.touch();

If the entry already exists, we clone the result twice (once in the condition check, once here). This is inefficient.

3. Performance Concerns

a) prune_pending_results() called on every message (line 73)
This runs on EVERY message, even if pending_results is empty. The function does have an early return (if self.pending_results.is_empty()), but it still involves:

  • Building a HashSet of active transactions (lines 355-357)
  • Iterating all pending results to check TTL (lines 360-373)
  • A second capacity check (lines 379-381)

Recommendation: Only run pruning periodically (e.g., every 100 messages or via a separate tokio task).

b) Memory overhead from Arc
Each pending result stores Arc<HostResult>, which is good for sharing, but if HostResult contains large contract states, 2048 entries could consume significant memory (potentially hundreds of MB).

Recommendation: Add a size limit in bytes, not just a count limit.

4. Testing

Strong test coverage:

  • ✅ 1-to-many delivery
  • ✅ Late subscriber receiving cached results
  • ✅ Client disconnect cleanup
  • ✅ RequestId correlation isolation
  • ✅ All message types handled

Missing test cases:

  1. TTL expiration: No test verifies that entries are removed after 60 seconds
  2. Capacity enforcement: No test verifies LRU eviction when MAX_PENDING_RESULTS is exceeded
  3. Concurrent late subscribers: No test for multiple clients registering for the same cached result
  4. Memory stress: No test with large HostResult payloads

5. Documentation

The module-level documentation is good, but some functions lack comments:

  • prune_pending_results() should document its cleanup strategy
  • enforce_pending_capacity() should document LRU eviction
  • Constants (PENDING_RESULT_TTL, MAX_PENDING_RESULTS) should have comments explaining the rationale

6. Integration with handler.rs

The changes in handler.rs (lines 310-342) add the waiting_for_transaction_result helper. This is clean and well-integrated. ✅

Minor issue: The session adapter registration is silent on failure:

if let Err(e) = session_tx.try_send(msg) {
    tracing::warn!("Failed to notify session actor: {}", e);
}

If the session actor channel is full or closed, this could lead to clients never receiving results. Should this be a hard error?


Summary

Overall assessment: The PR successfully addresses session actor race conditions and implements result caching for late subscribers. The code is generally well-structured and tested.

Critical issues:

  1. ✅ Resource leak is real but partially mitigated - needs monitoring in production
  2. ⚠️ Performance impact of pruning on every message
  3. ⚠️ Inconsistent result caching between delivery methods

Recommendations before merge:

  1. Add monitoring/metrics for pending_results.len()
  2. Move pruning to a periodic task instead of every message
  3. Standardize caching behavior between handle_result_delivery and handle_result_delivery_with_request_id
  4. Add test cases for TTL expiration and capacity limits
  5. Consider reducing TTL or adding memory-based limits

Nice to have:

  • Document cleanup strategies in code comments
  • Add integration test with freenet-test-network showing late subscriber scenario
  • Consider making TTL and max capacity configurable

@sanity
Copy link
Collaborator Author

sanity commented Nov 8, 2025

Followed up on the cache feedback:

  • pending-results now have a 15 s TTL, sweeps only run when the actor sees pressure, and we emit trace/info logs once the cache crosses ~75% utilization
  • cache eviction is driven by both entry count (2 K) and an approximate 64 MiB byte budget so large HostResponse payloads cannot accumulate; we enforce LRU-style pruning + metrics, and registering clients immediately reuses the cached result
  • added TTL + capacity regression tests for the session actor and made the session adapter use send().await so failures propagate instead of being silently logged

Tests: cargo test -p freenet session_actor

@iduartgomez
Copy link
Collaborator

@claude does cache eviction work exclusively on lazy evaluation or is preemptive?

@claude
Copy link

claude bot commented Nov 9, 2025

Claude finished @iduartgomez's task —— View job


Cache eviction strategy analysis

Tasks

  • Read session_actor.rs to understand eviction implementation
  • Analyze when pruning functions are called
  • Determine lazy vs preemptive behavior
  • Provide answer with code references

Answer

Cache eviction works exclusively on lazy evaluation - there is no preemptive/background cleanup task.

How it works

Eviction is triggered only when the session actor processes incoming messages:

  1. On every message (process_message at line 72):

    async fn process_message(&mut self, msg: SessionMessage) {
        self.prune_pending_results();  // ← Called for EVERY message
        match msg {
            // ... handle message
        }
    }
  2. TTL-based pruning (prune_pending_results at lines 350-382):

    • Removes entries older than PENDING_RESULT_TTL (60 seconds at line 15)
    • Skips if pending_results is empty (early return at line 351)
    • Only processes stale entries that have no active clients waiting
  3. Capacity enforcement (enforce_pending_capacity at lines 393-406):

    • Triggered when pending_results.len() >= MAX_PENDING_RESULTS (2048 at line 16)
    • Uses LRU eviction: removes the entry with oldest last_accessed timestamp
    • Called from two places:
      • End of prune_pending_results (line 379) if still over capacity
      • Before inserting new entries (lines 228, 306) if at capacity

Key characteristics

Lazy evaluation means:

  • No background tokio task monitoring the cache
  • No periodic timer triggering cleanup
  • Cleanup happens only as a side effect of message processing
  • If the actor is idle (no messages), stale entries remain indefinitely

Implications:

  • Pro: Simpler implementation, no separate task management
  • Pro: Cleanup cost is amortized across message processing
  • ⚠️ Con: During idle periods, memory isn't reclaimed even if entries are stale
  • ⚠️ Con: Cache size can temporarily exceed limits between messages

Example scenarios

  1. High traffic: Eviction runs frequently (every message), so stale entries are cleaned up quickly
  2. Idle network: Result delivered at T=0, client never subscribes, entry stays in memory until next message arrives (could be minutes/hours later)
  3. Burst then idle: Cache fills to 2048 entries during burst, then sits at 2048 until next message arrives to trigger pruning

@iduartgomez
Copy link
Collaborator

Claude finished @iduartgomez's task —— View job

Cache eviction strategy analysis

Tasks

* [x]  Read session_actor.rs to understand eviction implementation

* [x]  Analyze when pruning functions are called

* [x]  Determine lazy vs preemptive behavior

* [x]  Provide answer with code references

Answer

Cache eviction works exclusively on lazy evaluation - there is no preemptive/background cleanup task.

How it works

Eviction is triggered only when the session actor processes incoming messages:

1. **On every message** (`process_message` at line 72):
   ```rust
   async fn process_message(&mut self, msg: SessionMessage) {
       self.prune_pending_results();  // ← Called for EVERY message
       match msg {
           // ... handle message
       }
   }
   ```

2. **TTL-based pruning** (`prune_pending_results` at lines 350-382):
   
   * Removes entries older than `PENDING_RESULT_TTL` (60 seconds at line 15)
   * Skips if `pending_results` is empty (early return at line 351)
   * Only processes stale entries that have no active clients waiting

3. **Capacity enforcement** (`enforce_pending_capacity` at lines 393-406):
   
   * Triggered when `pending_results.len() >= MAX_PENDING_RESULTS` (2048 at line 16)
   * Uses LRU eviction: removes the entry with oldest `last_accessed` timestamp
   * Called from two places:
     
     * End of `prune_pending_results` (line 379) if still over capacity
     * Before inserting new entries (lines 228, 306) if at capacity

Key characteristics

Lazy evaluation means:

* No background tokio task monitoring the cache

* No periodic timer triggering cleanup

* Cleanup happens **only** as a side effect of message processing

* If the actor is idle (no messages), stale entries remain indefinitely

Implications:

* ✅ **Pro:** Simpler implementation, no separate task management

* ✅ **Pro:** Cleanup cost is amortized across message processing

* ⚠️ **Con:** During idle periods, memory isn't reclaimed even if entries are stale

* ⚠️ **Con:** Cache size can temporarily exceed limits between messages

Example scenarios

1. **High traffic:** Eviction runs frequently (every message), so stale entries are cleaned up quickly

2. **Idle network:** Result delivered at T=0, client never subscribes, entry stays in memory until next message arrives (could be minutes/hours later)

3. **Burst then idle:** Cache fills to 2048 entries during burst, then sits at 2048 until next message arrives to trigger pruning

@claude document well the current limitations for future reference

@claude
Copy link

claude bot commented Nov 9, 2025

Claude finished @iduartgomez's task —— View job


Documentation complete

Tasks

  • Read current session_actor.rs implementation
  • Add comprehensive documentation about lazy eviction strategy
  • Document limitations and tradeoffs
  • Format code (skipped - no formatter permission, code is already well-formatted)
  • Commit and push changes

Summary

I've added comprehensive documentation to session_actor.rs explaining the lazy cache eviction strategy and its limitations for future reference:

What was documented:

Module-level documentation (lines 6-40):

  • Overview of the lazy evaluation approach (no background task)
  • How the eviction mechanism works (on every message, TTL-based, capacity-based)
  • Advantages: Simpler implementation, amortized cleanup cost, no idle overhead
  • Limitations: Idle memory retention, temporary overflow, burst accumulation, memory pressure with large payloads
  • Future considerations for production improvements

Constant documentation (lines 51-62):

  • PENDING_RESULT_TTL: Explained the 60-second TTL and lazy evaluation implications
  • MAX_PENDING_RESULTS: Documented the 2048 entry limit and temporary overflow possibility

Function documentation:

  • process_message (lines 117-120): Notes that it triggers pruning on every message
  • prune_pending_results (lines 399-415): Detailed cleanup strategy and lazy evaluation implications
  • enforce_pending_capacity (lines 459-470): LRU eviction mechanism and when it's triggered
  • pending_results field (lines 70-73): Cross-reference to module documentation

Key points for future maintainers:

  1. No background cleanup task - eviction happens only during message processing
  2. Idle period behavior - stale entries persist until next message arrives
  3. Tradeoffs are intentional - simplicity vs. guaranteed memory cleanup
  4. Future options - background task, memory limits, metrics/monitoring

The documentation provides a clear foundation for anyone considering changes to the eviction strategy or investigating memory usage patterns in production.


Commit: 10cb7f6e - docs(session): document lazy cache eviction strategy and limitations

Add comprehensive documentation explaining the lazy evaluation approach
used for cache cleanup in the session actor's pending_results cache:

- Module-level documentation describing the eviction strategy
- Detailed explanation of lazy evaluation vs preemptive cleanup
- Tradeoffs: advantages (simplicity, amortized cost) and limitations
  (idle memory retention, temporary overflow, burst accumulation)
- Future considerations for production monitoring and improvements
- Function-level comments for prune_pending_results and enforce_pending_capacity
- Constant documentation explaining TTL and capacity limits

This addresses the need for future maintainers to understand the design
decision to use lazy evaluation and its implications for memory usage
during idle periods.

Co-authored-by: nacho.d.g <iduartgomez@users.noreply.github.com>
Copy link
Collaborator

@iduartgomez iduartgomez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be changed to preemptive cache eviction in the future but let's merge for now.

There seems to be some flakiness in tests but idk if is related to the changes in the PR.

https://github.com/freenet/freenet-core/actions/runs/19207910362/job/54905769302

@iduartgomez iduartgomez added this pull request to the merge queue Nov 9, 2025
Merged via the queue into main with commit f34b396 Nov 9, 2025
16 of 17 checks passed
@iduartgomez iduartgomez deleted the stack/session-actor-race branch November 9, 2025 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants