Skip to content

Conversation

@sanity
Copy link
Collaborator

@sanity sanity commented Nov 7, 2025

Summary

  • replace the legacy connect/handshake path with the trimmed two-message flow
  • consolidate ring + transport plumbing (connection manager counters, packet handling, tracing)
  • add GitHub workflow that runs the river six-peer regression on the freenet-128gb runner

Testing

  • cargo test -p freenet connect::tests
  • cargo test -p freenet node::network_bridge::priority_select::tests
  • cargo test connectivity::test_gateway_reconnection
  • FREENET_CORE_PATH=/home/ian/code/freenet/freenet-core/courtesy-connection-budget cargo test --test message_flow river_message_flow_over_freenet_six_peers_five_rounds -- --ignored --exact

@sanity sanity force-pushed the stack/connect-transport-rewrite branch from 9e98c1c to 967d4d9 Compare November 7, 2025 18:47
@sanity
Copy link
Collaborator Author

sanity commented Nov 7, 2025

@claude Please review PR

@claude
Copy link

claude bot commented Nov 7, 2025

Claude finished @sanity's task —— View job


PR Review: Transport/Handshake Pipeline Refactor

I've completed a comprehensive review of this PR. This is a significant refactoring that simplifies the connect/handshake flow and consolidates connection management logic. Overall, the changes are well-structured and improve code maintainability.

Summary of Changes

This PR accomplishes:

  • Replaces legacy multi-stage connect handshake with a streamlined two-message flow
  • Consolidates ring + transport connection management
  • Removes ~1500 lines of complex handshake logic
  • Adds CI workflow for six-peer regression testing
  • Net reduction of ~1300 lines of code

Key Findings

1. Handshake Simplification (✅ Major Improvement)

File: crates/core/src/node/network_bridge/handshake.rs

The new handshake handler is dramatically simpler:

  • Before: 1732 lines of complex state machine logic
  • After: 225 lines of lightweight connection adapter
  • Removes 651 lines of handshake tests (deleted file: handshake/tests.rs)

Strengths:

  • Much cleaner separation of concerns
  • Handshake driver now just wires transport events to/from event loop
  • Higher-level routing decisions moved into ConnectOp where they belong

Concerns:

  • ⚠️ Loss of test coverage: The deleted handshake/tests.rs file contained 651 lines of tests. While the old code was complex, these tests validated important handshake scenarios. The PR description mentions some replacement tests (cargo test -p freenet connect::tests), but it's unclear if all scenarios are covered.

Recommendation: Consider adding comments or documentation explaining which test scenarios from the old handshake tests are now covered by the new connect::tests and which (if any) are intentionally not covered because they're obsolete.

2. Connect Operation Refactor (✅ Good Design)

File: crates/core/src/operations/connect.rs

The new two-message connect flow is elegant:

  • ConnectMsg::Request: Joiner sends request toward target with TTL and visited list
  • ConnectMsg::Response: Acceptors respond back along discovered path
  • ConnectMsg::ObservedAddress: Peers notify joiners of their observed public address

Strengths:

  • Clear state separation: JoinerState vs RelayState
  • TTL-based forwarding prevents infinite loops
  • Visited list prevents trivial routing cycles
  • Observable address discovery for NAT traversal

Code Quality:

  • Lines 189-254: RelayState::handle_request() is well-structured with clear action outputs
  • Lines 317-334: JoinerState::register_acceptance() has good progress tracking
  • Lines 766-834: join_ring_request() has proper error handling and backoff logic

Minor Issues:

  • Line 299: courtesy_hint() always returns self.op_manager.ring.open_connections() == 0 - this seems simplistic. Is there documentation explaining the courtesy connection strategy?
  • Lines 836-944: initial_join_procedure() spawns an infinite loop that checks connection health. Consider adding a way to gracefully shut down this task.

3. Connection Manager Changes (✅ Solid)

File: crates/core/src/ring/connection_manager.rs

Improvements:

  • Lines 122-266: should_accept() has excellent logging and atomic counter management
  • Lines 203-219: Gateway-specific direct-accept limit (GATEWAY_DIRECT_ACCEPT_LIMIT = 2) is a good safeguard
  • Lines 268-293: New record_pending_location() method properly tracks peers during handshake
  • Lines 386-428: update_peer_identity() handles provisional → final peer identity transitions

Potential Issues:

  • Lines 151-178: The reserved_connections CAS loop could theoretically retry forever if under extreme contention. Consider adding a retry limit.
  • Lines 345-351: prune_in_transit_connection() and prune_alive_connection() both call the same prune_connection() with a boolean flag - this is fine, but the naming could be clearer (e.g., was_alive instead of is_alive).

4. Network Bridge Changes (⚠️ Complex but Necessary)

File: crates/core/src/node/network_bridge/p2p_protoc.rs

This file grew from ~900 to ~2000 lines, absorbing handshake coordination logic.

Concerns:

  • Lines 347-360: Bug check for self-targeting messages logs as "BUG" and converts to inbound processing. This is good defensive programming, but the error message says "This indicates a routing logic error". If this shouldn't happen, should it panic in debug builds?
  • Lines 367-388: TOCTOU race condition mitigation is well-documented (line 367 comment), but the fallback logic at lines 374-388 (resolving by public key when address is unspecified) feels fragile. What happens if multiple connections exist with different addresses but same key?
  • Lines 449-497: Connection establishment is spawned into a background task with 20-second timeout. If this fails, the original operation may hang waiting for a connection that never completes. Is there a cleanup mechanism?

Strengths:

  • Lines 1319-1454: handle_handshake_action() consolidates handshake event processing cleanly
  • Lines 1498-1593: handle_successful_connection() properly coordinates connection manager updates
  • Lines 1605-1716: handle_peer_connection_msg() handles provisional peer identity updates well (lines 1608-1646)

5. Transport Layer (✅ Minor Cleanup)

File: crates/core/src/transport/peer_connection.rs

  • Removed 78 lines of unused code
  • File crates/core/src/transport/mod.rs removed 7 lines of dead code

This is good cleanup.

6. CI/Testing (✅ Good Addition)

File: .github/workflows/six-peer-regression.yml

  • New workflow runs River six-peer regression test
  • 120-minute timeout is generous
  • Tests on both stack/connect-transport-rewrite and main branches

Suggestion: Consider adding a fast-fail basic connectivity test that runs on every PR before the expensive 2-hour test.

7. Test Updates (✅ Updated Appropriately)

Multiple test files updated to match new connection patterns:

  • crates/core/tests/connectivity.rs: Updated gateway reconnection test
  • crates/core/tests/operations.rs: Updated for new connect message structure
  • crates/core/tests/error_notification.rs: Updated error scenarios

Issues Found

🔴 Critical

None found - the code appears functional.

🟡 Medium Priority

  1. Test Coverage Gap (handshake/tests.rs deleted):

    • The PR removes 651 lines of handshake tests
    • Need to verify all important scenarios are covered by new tests
    • Location: crates/core/src/node/network_bridge/handshake/tests.rs (deleted)
  2. Infinite Loop Task Management (connect.rs:836-944):

    • initial_join_procedure() spawns infinite loop with no shutdown mechanism
    • Could leak tasks on node restart
    • Location: crates/core/src/operations/connect.rs:836-944
  3. Potential Connection Leak (p2p_protoc.rs:449-497):

    • Background task spawned to wait for connection with 20s timeout
    • No obvious cleanup if original requester goes away
    • Location: crates/core/src/node/network_bridge/p2p_protoc.rs:449-497

🟢 Low Priority (Style/Documentation)

  1. Courtesy Connection Strategy (connect.rs:299):

    • Logic is simple (open_connections() == 0) but not documented
    • Location: crates/core/src/operations/connect.rs:299
  2. Reserved Connections CAS Loop (connection_manager.rs:151-178):

    • Could add retry limit for extreme contention scenarios
    • Location: crates/core/src/ring/connection_manager.rs:151-178
  3. Self-Routing Bug Detection (p2p_protoc.rs:347-360):

    • Logs as "BUG" but continues execution - consider panic in debug mode?
    • Location: crates/core/src/node/network_bridge/p2p_protoc.rs:347-360

Recommendations

  1. Before merge:

    • Add comments mapping old handshake test scenarios to new test coverage
    • Document the courtesy connection strategy (why open_connections() == 0?)
    • Review connection cleanup in background tasks (line 449 in p2p_protoc.rs)
  2. Follow-up PRs:

    • Add graceful shutdown for initial_join_procedure() infinite loop
    • Consider retry limits on CAS loops in connection_manager
    • Add metrics/observability for connection state transitions

Conclusion

This is a solid refactoring that significantly simplifies the codebase while preserving functionality. The new two-message connect flow is more maintainable than the legacy multi-stage approach. The main concern is ensuring test coverage remains comprehensive after removing the old handshake tests.

Recommendation: ✅ Approve with minor follow-up items

The code quality is good, and the concerns raised are mostly about documentation and future robustness rather than correctness issues. The reduction of ~1300 lines while maintaining functionality is impressive.


Review completed by Claude

@sanity sanity requested a review from iduartgomez November 8, 2025 01:40
@sanity
Copy link
Collaborator Author

sanity commented Nov 8, 2025

Line 299: courtesy_hint() always returns self.op_manager.ring.open_connections() == 0 - this seems simplistic. Is there documentation explaining the courtesy connection strategy?

Courtesy connections still need work, I have a local branch that removes courtesy connections - currently they'll stick around like normal connections which isn't ideal.

@sanity
Copy link
Collaborator Author

sanity commented Nov 8, 2025

[Codex] Addressed the outstanding feedback:

  • documented why the current courtesy hint is intentionally simple and tied it back to the follow-up branch that will finish the feature.
  • taught initial_join_procedure to return a JoinHandle and made the node runtime/testing harness keep and abort it, so the watchdog loop can be shut down cleanly when the node exits.

Tests: cargo test -p freenet connect::tests:: -- --nocapture

@sanity sanity force-pushed the stack/connect-transport-rewrite branch from ead6ba4 to d3318b5 Compare November 8, 2025 02:09
@sanity
Copy link
Collaborator Author

sanity commented Nov 8, 2025

[Codex] Rebased onto the latest stack/subscription-routing to clear the GitHub conflicts (the only manual resolution was in crates/core/src/operations/put.rs, making sure the new connect-origin fields coexist with the TTL/skip-list tweaks). Re-ran cargo test -p freenet connect::tests:: -- --nocapture after the rebase and pushed with --force-with-lease.

@sanity sanity force-pushed the stack/subscription-routing branch from 117dac9 to f4ad769 Compare November 9, 2025 16:56
@sanity sanity force-pushed the stack/connect-transport-rewrite branch 2 times, most recently from d3602a7 to 3cfdb91 Compare November 9, 2025 18:29
@sanity
Copy link
Collaborator Author

sanity commented Nov 9, 2025

[Codex] Rebased onto the latest stack/subscription-routing so the PR is conflict-free again. Only automated picks this time—no manual edits were required beyond what was already on the branch. Re-ran cargo test -p freenet connect::tests:: -- --nocapture after the rebase and force-pushed with lease.

@iduartgomez iduartgomez force-pushed the stack/subscription-routing branch from 4103ba5 to e1f507b Compare November 9, 2025 19:20
@sanity sanity force-pushed the stack/connect-transport-rewrite branch from 3cfdb91 to 9b7d958 Compare November 9, 2025 21:34
@sanity
Copy link
Collaborator Author

sanity commented Nov 9, 2025

[Codex] Rebased onto the latest stack/subscription-routing (which force-pushed earlier today). The only conflicts were the two session-actor commits that upstream already absorbed, so I kept the newer upstream implementation while replaying the stack. Targeted tests (cargo test -p freenet connect::tests:: -- --nocapture) pass, and the branch has been force-pushed with lease.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants