Skip to content

Conversation

@sanity
Copy link
Collaborator

@sanity sanity commented Nov 7, 2025

Summary

  • ensure subscribes fetch contract code and log candidate peers
  • bump stdlib + freenet-ping deps and update gateway script
  • adjust router distance defaults and request density tracing

Testing

  • cargo test operations::subscribe
  • cargo test --test operations subscribe_path
  • cargo test connectivity::test_gateway_reconnection

@sanity
Copy link
Collaborator Author

sanity commented Nov 7, 2025

@claude Please review PR

@claude
Copy link

claude bot commented Nov 7, 2025

Claude finished @sanity's task —— View job


PR Review Complete

Todo List

  • Read all changed files
  • Review main subscription routing changes (subscribe.rs)
  • Review router distance and tracing changes
  • Check for bugs, security issues, and performance problems
  • Provide comprehensive feedback

Summary

This PR significantly enhances subscription routing reliability by adding contract code fetching, improved candidate peer logging, and router distance defaults adjustments. The changes are generally well-structured and improve the robustness of subscription operations.

Key Improvements

  1. Contract Code Fetching: Added fetch_contract_if_missing() to ensure contract code is available before completing subscriptions
  2. Enhanced Logging: Extensive tracing additions for debugging subscription routing
  3. Router Distance Handling: Fixed edge cases where peers without locations could cause issues
  4. Test Output Cleanup: Replaced println! with debug! for cleaner test output

Detailed Findings

1. Subscription Routing Logic (subscribe.rs)

Strengths:

  • Excellent addition of fetch_contract_if_missing() to ensure contract code is present (lines 54-72)
  • Good use of wait_for_local_contract() with timeout/polling for async contract availability (lines 38-52)
  • Improved handling of local subscription scenarios with complete_local_subscription() helper (lines 258-288)
  • Enhanced skip list management prevents routing loops
  • Better HTL (hops-to-live) validation and bounds checking (lines 520-521, 536-543)

Issues & Suggestions:

🔴 High Priority - Potential Infinite Recursion
crates/core/src/operations/subscribe.rs:564-580
When no candidates are found for forwarding, the code attempts to fetch the contract locally using GET operation. However, if the GET also fails to find the contract, this could potentially lead to cascading failures. Consider:

// Current approach might retry indefinitely
let get_op = get::start_op(*key, true, false);
if let Err(fetch_err) = get::request_get(op_manager, get_op, HashSet::new()).await

Recommendation: Add a retry limit or context flag to prevent GET operations from triggering additional subscribe operations.

🟡 Medium Priority - Error Handling Inconsistency
crates/core/src/operations/subscribe.rs:771
The fetch_contract_if_missing() call uses ? operator which will propagate errors up, but there's no logging at this level:

fetch_contract_if_missing(op_manager, *key).await?;

Recommendation: Add error logging before propagation:

if let Err(e) = fetch_contract_if_missing(op_manager, *key).await {
    tracing::warn!(tx = %id, %key, error = %e, "Failed to fetch contract code");
    return Err(e);
}

🟡 Medium Priority - Magic Numbers
crates/core/src/operations/subscribe.rs:23-24
The timeout and polling interval constants lack documentation:

const LOCAL_FETCH_TIMEOUT_MS: u64 = 1_500;
const LOCAL_FETCH_POLL_INTERVAL_MS: u64 = 25;

Recommendation: Add documentation explaining the rationale for these values:

/// Maximum time to wait for a contract to become available locally after a GET request
const LOCAL_FETCH_TIMEOUT_MS: u64 = 1_500;
/// Interval between checks when polling for contract availability
const LOCAL_FETCH_POLL_INTERVAL_MS: u64 = 25;

🟢 Low Priority - Verbose Logging
crates/core/src/operations/subscribe.rs:649-679
The subscribers_snapshot() calls before/after subscriber registration create significant log volume. While useful for debugging, this might be overwhelming in production.
Recommendation: Consider using debug! instead of info! for the before/after snapshots, or add a feature flag for verbose subscription logging.

🟢 Low Priority - Code Duplication
crates/core/src/operations/subscribe.rs:405-445 and crates/core/src/operations/subscribe.rs:649-679
Similar subscriber registration logic appears multiple times with slight variations.
Recommendation: Consider extracting into a helper function like register_subscriber_with_logging() to reduce duplication and improve maintainability.

2. Router Distance Defaults (router/mod.rs)

Strengths:

  • Good fix for handling peers without location data (lines 165-169)
  • Using Distance::new(0.5) as default is reasonable (middle of the ring)

Issues & Suggestions:

🟡 Medium Priority - Inconsistent Handling
crates/core/src/router/mod.rs:165-169 vs crates/core/src/router/mod.rs:206-212
Two different approaches for handling missing locations:

// In select_closest_peers: unwrap_or with default distance
.map(|peer| {
    let distance = peer.location
        .map(|loc| target_location.distance(loc))
        .unwrap_or_else(|| Distance::new(0.5));
    (peer, distance)
})

// In select_k_best_peers: filter_map that excludes peers without location
.filter_map(|peer| {
    peer.location.map(|loc| {
        let distance = target_location.distance(loc);
        (peer, distance)
    })
})

Recommendation: Decide on a consistent strategy. If peers without locations should be included with default distance, apply this uniformly. If they should be excluded, apply that uniformly. The current mixed approach could lead to unexpected behavior.

3. PUT Operation Changes (operations/put.rs)

Strengths:

  • Good fix for preserving upstream reference in broadcast scenarios (lines 888-896)

Issues & Suggestions:

🟢 Low Priority - Logic Complexity
crates/core/src/operations/put.rs:888-896
The preserved_upstream logic with or_else(|| Some(upstream.clone())) is somewhat convoluted:

let upstream_for_completion = preserved_upstream
    .clone()
    .or_else(|| Some(upstream.clone()));

Recommendation: Simplify to:

let upstream_for_completion = preserved_upstream.or(Some(upstream));

(No need for clone() after or_else if you use or)

4. Test Cleanup

Strengths:

  • Good practice replacing println! with debug! in tests
  • Maintains debuggability while keeping test output clean

Testing Recommendations

The PR description mentions three test commands. Please also consider:

  1. Stress Testing: Run subscription operations under high load with multiple concurrent subscribes
  2. Network Partition Tests: Test behavior when contract is unavailable and network is partitioned
  3. HTL Boundary Tests: Verify behavior at HTL=0 and HTL=1 edge cases
  4. GET-Subscribe Interaction: Test scenarios where GET operations are triggered from within subscribe operations

Minor Observations

  • Dependency bumps (stdlib 0.1.14 → 0.1.22) look reasonable
  • Script changes are minor cleanup improvements
  • Overall code quality is high with good error handling

Conclusion

This is a solid PR that meaningfully improves subscription routing robustness. The main concerns are:

  1. Potential for cascading failures in GET fallback logic
  2. Inconsistent peer location handling in router
  3. Some opportunities for code simplification

The extensive logging additions are valuable for debugging but consider tunability for production environments.

Recommendation: Approve with minor revisions suggested above.


@sanity
Copy link
Collaborator Author

sanity commented Nov 8, 2025

Took another pass over Claude’s feedback:

  • Documented the local wait constants and replaced the repetitive info! snapshots with register_subscriber_with_logging, which keeps the before/after diagnostics but at debug! level and avoids duplicating the logic (crates/core/src/operations/subscribe.rs).
  • Added warn-level context around fetch_contract_if_missing failures so we don’t lose the underlying error when a contract can’t be hydrated (crates/core/src/operations/subscribe.rs:803).
  • Introduced peer_distance_or_default so both the “closest peers” and “k best peers” paths treat peers without a recorded location the same way instead of filtering them out only in one branch (crates/core/src/router/mod.rs).
  • Simplified the preserved-upstream hand-off in the PUT completion path using Option::or (crates/core/src/operations/put.rs:983).

On the potential recursion: when a seek request can’t find any candidates we attempt a single GET via get::start_op(*key, true, false) (crates/core/src/operations/subscribe.rs:615). That final false means the GET op never turns around and issues a subscription, so there’s no recursive subscribe→GET→subscribe loop. If the GET fails or the contract still isn’t present after wait_for_local_contract, we immediately return return_not_subbed() (crates/core/src/operations/subscribe.rs:615-707), so we never retry indefinitely.

Testing:

  • cargo test operations::subscribe
  • cargo test --test operations subscribe_path
  • `cargo test connectivity::test_gateway_reconnection

@sanity sanity requested a review from iduartgomez November 8, 2025 17:18
chrono = { workspace = true, features = ["default"] }
clap = { version = "4.5", features = ["derive"] }
freenet-stdlib = { version = "0.1.24", features = ["net"] }
freenet-stdlib = { version = "0.1.22", features = ["net"] }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a downgrade (!), why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Codex] Rebased the stack and restored this dependency to 0.1.24 so the app stays aligned with the workspace and the newer stdlib APIs that the upper PRs expect.

[workspace.dependencies]
# freenet-stdlib = { path = "./../../stdlib/rust", features = ["contract"] }
freenet-stdlib = { version = "0.1.24" }
freenet-stdlib = { version = "0.1.14" }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a downgrade (!), why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Codex] Same here—reset to 0.1.24 so the workspace definition matches the app crate and the rest of the stack. This keeps the stdlib bump intact for the dependent PRs.

.filter_map(|peer| {
peer.location
.map(|loc| (peer, target_location.distance(loc)))
.map(|peer| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a dejavuh with this, already commented in a previous PR, let's not make upa. distance, filter_map will filter on collect peers which dont have a distance setup anyway, which in any case is not gonna happen in practice, unless is some sort of transient connection

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Codex] Thanks for flagging. I dropped the synthetic distance helper and restored the original filter_map behaviour so we only consider peers with a recorded location—matching the prior implementation and avoiding any made-up distances.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we trying to route an update while the node has not joined the network? that sounds wrong.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iduartgomez You mean if the node is not part of the subscription tree for that contract? The idea was that the update would be routed towards the contract in the hope that it would hit a subscribed node. Is that incorrect?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During bootstrap we still pass PeerKeyLocations without a real location (e.g. from connection_manager.own_location() before the node has joined), so dropping them leaves us with no routing candidates.

If the node has no real location is because it hasn't joined yet no? There is no other way.

Copy link
Collaborator Author

@sanity sanity Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Codex] Right—the only time we lack a location is before the join completes, and we do issue ops during that pre-join window while connection_manager.own_location() is PeerKeyLocation { location: None }. When I switched back to filter_map to drop those entries, test_put_contract_three_hop_returns_response started timing out because routing never had a candidate (CI hit the 120‑second put timeout). The midpoint fallback just keeps those provisional peers eligible until update_location(Some(Location::from_address(&addr))) runs; after that they carry the real location and get ranked normally.

@sanity sanity force-pushed the stack/subscription-routing branch from 117dac9 to f4ad769 Compare November 9, 2025 16:56
@iduartgomez iduartgomez force-pushed the stack/update-fallback branch from c3fdf36 to e161690 Compare November 9, 2025 18:36
@iduartgomez iduartgomez force-pushed the stack/subscription-routing branch from 4103ba5 to e1f507b Compare November 9, 2025 19:20
Base automatically changed from stack/update-fallback to main November 9, 2025 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants