Skip to content

Commit 60c5129

Browse files
shayne-fletchermeta-codesync[bot]
authored andcommitted
system: audit tests (meta-pytorch#2005)
Summary: Pull Request resolved: meta-pytorch#2005 this completes the V0→V1 audit for system.rs. all remaining tests here are V0-specific and do not have V1 equivalents because the centralized `SystemActor` model (join, snapshot, system-wide stop, selective world stop, address-update routing, etc.) does not exist in the mesh-based architecture. each test is now annotated accordingly so it's clear which parts of the file are legacy behavior tied to the old system-wide coordinator. the only behavioral change in this diff is the removal of `test_channel_dial_count`. that test was originally written as a debugging aid to understand when dials occur in the V0 forwarding path; it has never been a correctness test and has no meaningful counterpart in V1 (which uses static addressing and structured proc/host meshes). dropping it reduces noise in the V0 test corpus and clarifies which tests matter going forward. as with the previous files, the intent of this stack is audit and classification, not behavior change: V0 tests stay in place until `hyperactor_multiprocess` is retired, but they are now explicitly marked as V0-only. Reviewed By: dulinriley Differential Revision: D87929235 fbshipit-source-id: ef74d84d5b6403385e22674e55a4c26cf2598f91
1 parent a019901 commit 60c5129

File tree

1 file changed

+36
-77
lines changed

1 file changed

+36
-77
lines changed

hyperactor_multiprocess/src/system.rs

Lines changed: 36 additions & 77 deletions
Original file line numberDiff line numberDiff line change
@@ -177,7 +177,6 @@ mod tests {
177177
use hyperactor::WorldId;
178178
use hyperactor::channel::ChannelAddr;
179179
use hyperactor::channel::ChannelTransport;
180-
use hyperactor::channel::TcpMode;
181180
use hyperactor::clock::Clock;
182181
use hyperactor::clock::RealClock;
183182
use hyperactor_telemetry::env::execution_id;
@@ -199,6 +198,12 @@ mod tests {
199198
use crate::system_actor::WorldSnapshotProcInfo;
200199
use crate::system_actor::WorldStatus;
201200

201+
// V0-specific test - no V1 equivalent. Tests System::attach()
202+
// which creates ephemeral client instances that can join a
203+
// running system and communicate through the centralized
204+
// SystemActor. V1' mesh-based architecture pre-allocates procs in
205+
// ProcMesh - there's no centralized SystemActor or concept of
206+
// ad-hoc ephemeral clients dynamically joining a running system.
202207
#[tokio::test]
203208
async fn test_join() {
204209
for transport in ChannelTransport::all() {
@@ -230,6 +235,15 @@ mod tests {
230235
}
231236
}
232237

238+
// V0-specific test - no V1 equivalent. Tests
239+
// SystemActor::snapshot() which queries the centralized
240+
// SystemActor for the global state of all worlds, procs, and
241+
// their statuses. Also tests SystemSnapshotFilter for querying
242+
// subsets by world ID, world labels, or proc labels. V1 has no
243+
// centralized SystemActor or global snapshot API - each
244+
// ProcMesh/HostMesh independently tracks its own state via
245+
// StatusMesh for spawn/stop operations, but there's no
246+
// system-wide snapshot query mechanism.
233247
#[tokio::test]
234248
async fn test_system_snapshot() {
235249
let system_handle = System::serve(
@@ -536,11 +550,17 @@ mod tests {
536550
}
537551
}
538552

539-
// The test consists of 2 steps:
540-
// 1. spawn a system with 2 host procs, and 8 worker procs. For each worker
541-
// proc, spawn a root actor with a children tree.
542-
// 2. Send a Stop message to system actor, and verify everything will be
543-
// shut down.
553+
// V0-specific test - no V1 equivalent. Tests SystemActor::stop()
554+
// which provides a single API call to shut down the entire system
555+
// (all worlds, all procs, all actors) that the SystemActor
556+
// manages. Creates a complex multi-world setup (worker worlds
557+
// from host procs, directly-joined proc worlds), calls one
558+
// SystemActor::stop(), and verifies everything stops cleanly. V1
559+
// has hierarchical shutdown (ProcMesh::stop() cascades to all
560+
// procs/actors in that mesh), but no centralized registry
561+
// tracking all meshes - you must stop each top-level mesh
562+
// (ProcMesh/HostMesh) you created. V1 has no single- call
563+
// system-wide shutdown like SystemActor::stop().
544564
#[tracing_test::traced_test]
545565
#[async_timed_test(timeout_secs = 60)]
546566
async fn test_system_shutdown() {
@@ -661,6 +681,16 @@ mod tests {
661681
}
662682
}
663683

684+
// V0-specific test - no V1 equivalent. Tests SystemActor::stop()
685+
// with a world filter to selectively shut down specific worlds
686+
// while keeping the system and other worlds running. Creates
687+
// worker_world (16 procs) and foo_world (2 procs), calls
688+
// SystemActor::stop(Some([foo_world]), ...) to stop only
689+
// foo_world, and verifies foo_world procs stopped while
690+
// worker_world and SystemActor remain running. V1 has no
691+
// centralized world registry or selective shutdown - you just
692+
// call stop() on individual meshes you want to stop, with no
693+
// notion of "worlds" or a coordinator that continues running.
664694
#[async_timed_test(timeout_secs = 60)]
665695
async fn test_single_world_shutdown() {
666696
let system_handle = System::serve(
@@ -820,75 +850,4 @@ mod tests {
820850
);
821851
}
822852
}
823-
824-
// Test our understanding of when & where channel addresses are
825-
// dialed.
826-
#[tracing_test::traced_test]
827-
#[tokio::test]
828-
async fn test_channel_dial_count() {
829-
let system_handle = System::serve(
830-
ChannelAddr::any(ChannelTransport::Tcp(TcpMode::Hostname)),
831-
Duration::from_secs(10),
832-
Duration::from_secs(10),
833-
)
834-
.await
835-
.unwrap();
836-
837-
let system_addr = system_handle.local_addr();
838-
let mut system = System::new(system_addr.clone());
839-
// `system.attach()` calls `system.send()` which
840-
// `channel::dial()`s the system address for a `MailboxClient`
841-
// for the `EnvelopingMailboxSender` to be the forwarding
842-
// sender for `client1`s proc (+1 dial).
843-
//
844-
// The forwarding sender will be used to send a join message
845-
// to the system actor that uses the `NetTx` just dialed so no
846-
// new `channel::dial()` for that (+0 dial). However, the
847-
// system actor will respond to the join message by using the
848-
// proc address (given in the join message) for the new proc
849-
// when it sends from its `DialMailboxRouter` so we expect to
850-
// see a `channel::dial()` there (+1 dial).
851-
let client1 = system.attach().await.unwrap();
852-
853-
// `system.attach()` calls `system.send()` which
854-
// `channel::dial()`s the system address for a `MailboxClient`
855-
// for the `EnvelopingMailboxSender` to be the forwarding
856-
// sender for `client2`s proc (+1 dial).
857-
//
858-
// The forwarding sender will be used to send a join message
859-
// to the system actor that uses the `NetTx` just dialed so no
860-
// new `channel::dial()` for that (+0 dial). However, the
861-
// system actor will respond to the join message by using the
862-
// proc address (given in the join message) for the new proc
863-
// when it sends from its `DialMailboxRouter` so we expect to
864-
// see a `channel::dial()` there (+1 dial).
865-
let client2 = system.attach().await.unwrap();
866-
867-
// Send a message to `client2` from `client1`. This will
868-
// involve forwarding to the system actor using `client1`'s
869-
// proc's forwarder already dialied `NetTx` (+0 dial). The
870-
// system actor will relay to `client2`'s proc. The `NetTx` to
871-
// that proc was cached in the system actor's
872-
// `DialmailboxRouter` when responding to `client2`'s join (+0
873-
// dial).
874-
let (port, mut port_rx) = client2.open_port();
875-
port.bind().send(&client1, 123u64).unwrap();
876-
assert_eq!(port_rx.recv().await.unwrap(), 123u64);
877-
878-
// In summary we expect to see 4 dials.
879-
logs_assert(|logs| {
880-
let dial_count = logs
881-
.iter()
882-
.filter(|log| log.contains("dialing channel tcp"))
883-
.count();
884-
if dial_count == 4 {
885-
Ok(())
886-
} else {
887-
Err(format!("unexpected tcp channel dial count: {}", dial_count))
888-
}
889-
});
890-
891-
system_handle.stop().await.unwrap();
892-
system_handle.await;
893-
}
894853
}

0 commit comments

Comments
 (0)