[Local Tensor] Replace dry_run.py with fake mode implementation #2057

fegin · 2025-11-18T21:35:37Z

Stack from ghstack (oldest at bottom):

-> [Local Tensor] Replace dry_run.py with fake mode implementation #2057

Replaces dry_run.py implementation with fake PG mode for DRY_RUN configuration validation. This PR also adds support of Local tensor mode to provide deeper validation coverage.

Note: Currently returns early before init_weights() if using local tensor mode due to some limitation of local tensor, which will be fixed by pytorch/pytorch#166540 .

[ghstack-poisoned]

Replaces `dry_run.py` implementation with local tensor mode for DRY_RUN configuration validation. Local tensor mode provides deeper validation coverage, including `ParallelDims` creation, which the previous implementation could not verify. **Note:** Currently returns early before `init_weights()` due to a known limitation in local tensor mode. This still validates more of the pipeline than the previous approach. ghstack-source-id: b53ea9f Pull-Request: #2057

[ghstack-poisoned]

Replaces `dry_run.py` implementation with local tensor mode for DRY_RUN configuration validation. Local tensor mode provides deeper validation coverage, including `ParallelDims` creation, which the previous implementation could not verify. **Note:** Currently returns early before `init_weights()` due to a known limitation in local tensor mode. This still validates more of the pipeline than the previous approach. ghstack-source-id: c37e849 Pull-Request: #2057

wconstab · 2025-11-18T23:20:21Z

torchtitan/train.py

+        # TODO(local_tensor): Remove this special case once LocalTensor supports
+        # init_weights(). In local tensor mode, skip training/checkpointing as the
+        # model is not fully initialized
+        if config.comm.local_tensor_mode:


probably a naive question. what's the advantage of localtensor over faketensor if we're just running the parallelization setup code?

localtensor would do the actual compute and give correct results, but it would run (more slowly) on a single gpu since it simulates each rank's operations. fake tensor would skip all the compute and run (more quickly, i think) on a single CPU.

Do we intend to run with numerics or just smoke-test that we don't get API errors along the way to setup?

Do we intend to run with numerics or just smoke-test that we don't get API errors along the way to setup?

For DryRun mode that validates the configurations, fake backend should be enough. But I also want to use this infra to enable some tests that do not require GPUs, running one step and verify the output. We are putting all the different parallelisms test into integration tests which require one H100 machine. The queuing time is going to become longer.

Also, DeviceMesh uses tensor operations, if we want to verify DeviceMesh operations on all ranks, we will need LocalTensor. Fake backend only allow you to verify rank0 DeviceMesh behavior, though this should not be a big deal as we mostly do SPMD.

I re-think you comment. I think I should not couple dry run mode with local tensor mode, which my intention was to use it for light-weight integration tests.

I add another option to enable ONLY fake PG mode. Local tensor mode depends on fake PG mode and dry run only requires fake PG mode.

ok- that makes sense! i like the idea of using localtensor mode for running actual numerics validation. it just seemed overkill if used for dry-run.

I think the local tensor mode might help to some extent to debug numerics, however I met numeric issue because of DTensor, or missing communication. So we would recommend user debug with following order: dry_run mode -> local tensor model -> parallelism , right?

IMO, dry run mode is purely debugging the setup of the trainer phase. The setup means the trainer configurations, DeviceMesh setting up and parallelism configurations. For an end user of TorchTitan, dry run mode is mostly useful to detect configurations error before launching a large scale training. For TorchTitan developers like us, dry run mode is useful as an early debugging signal when developing trainer, components and parallelisms (e.g., parallelize.py).

LocalTensor, on the other hand is useful to actually check what happen during the forward and backward. While fake tensor can also help, it doesn't actually incur the communication and computation which may hide the issues, and when the computation involve data dependent op, it will fail. For example, you won't be able to debug CP load balancing with fake tensor.

I'm confused by the options here.

IIUC fake backend doesn't mean fake tensor right? What happens if local_tensor_mode=False but fake_backend=True?

Since this is user facing, I feel it might be more clear to organize options based on user intent. E.g. comm_mode = "dry" / "local" instead of providing multiple knobs which only functions when combined properly.

fake_backend=True, just means that we use fake backend for the communication. It doesn't have to be used with comm_mode if we don't care about accuracy. One computation will be done locally on rank0 but the collectives will not. For local_tensor_mode, all the computation (rank0 - rankN-1) will be done on rank0 and the collectives will be simulated as well.

I'm okay to combine the two together.

With fake_backend=True, is rank0 the only device which participated in computation? Or does each rank compute their own stuff without actual comms?

@tianyu-l All ranks will perform the computation locally, but it actually doesn't matter because ranks don't talk to each others. For dry run mode, we always launch with one rank only (but fake it as there are N ranks).

[ghstack-poisoned]

Replaces `dry_run.py` implementation with local tensor mode for DRY_RUN configuration validation. Local tensor mode provides deeper validation coverage, including `ParallelDims` creation, which the previous implementation could not verify. **Note:** Currently returns early before `init_weights()` due to a known limitation in local tensor mode. This still validates more of the pipeline than the previous approach. ghstack-source-id: b024f8f Pull-Request: #2057

torchtitan/config/job_config.py

wwwjn · 2025-11-19T01:51:31Z

torchtitan/train.py

+        # TODO(local_tensor): Remove this special case once LocalTensor supports
+        # init_weights(). In local tensor mode, skip training/checkpointing as the
+        # model is not fully initialized
+        if config.comm.local_tensor_mode:


I think the local tensor mode might help to some extent to debug numerics, however I met numeric issue because of DTensor, or missing communication. So we would recommend user debug with following order: dry_run mode -> local tensor model -> parallelism , right?

[ghstack-poisoned]

Replaces `dry_run.py` implementation with local tensor mode for DRY_RUN configuration validation. Local tensor mode provides deeper validation coverage, including `ParallelDims` creation, which the previous implementation could not verify. **Note:** Currently returns early before `init_weights()` due to a known limitation in local tensor mode. This still validates more of the pipeline than the previous approach. ghstack-source-id: 27b8bad Pull-Request: #2057

tianyu-l · 2025-11-20T05:37:09Z

torchtitan/train.py

            self.loss_fn, self.gradient_accumulation_steps
        )

+        # TODO(local_tensor): Remove this early return once LocalTensor supports


pytorch/pytorch#166540 is merged, shall we remove this early return?

There are still some gaps. I updated the comment.

torchtitan/train.py

tianyu-l · 2025-11-20T05:41:11Z

torchtitan/train.py

    try:
        trainer = trainer_class(config)

+        # TODO(local_tensor): Remove this special case once LocalTensor supports


similarly, can we remove this now?

There are still some gaps. I updated the comment.

torchtitan/distributed/utils.py

tianyu-l · 2025-11-20T05:53:04Z

torchtitan/train.py

+        # TODO(local_tensor): Remove this special case once LocalTensor supports
+        # init_weights(). In local tensor mode, skip training/checkpointing as the
+        # model is not fully initialized
+        if config.comm.local_tensor_mode:


I'm confused by the options here.

IIUC fake backend doesn't mean fake tensor right? What happens if local_tensor_mode=False but fake_backend=True?

Since this is user facing, I feel it might be more clear to organize options based on user intent. E.g. comm_mode = "dry" / "local" instead of providing multiple knobs which only functions when combined properly.

[ghstack-poisoned]

Replaces `dry_run.py` implementation with local tensor mode for DRY_RUN configuration validation. Local tensor mode provides deeper validation coverage, including `ParallelDims` creation, which the previous implementation could not verify. **Note:** Currently returns early before `init_weights()` due to a known limitation in local tensor mode. This still validates more of the pipeline than the previous approach. ghstack-source-id: 5ea1d46 Pull-Request: #2057

Update

cd47ff8

[ghstack-poisoned]

fegin requested review from tianyu-l, wconstab and wwwjn as code owners November 18, 2025 21:35

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 18, 2025

Update

7bfd210

[ghstack-poisoned]

wconstab reviewed Nov 18, 2025

View reviewed changes

Update

2f0a6e2

[ghstack-poisoned]

fegin changed the title ~~[Local Tensor] Replace dry_run.py with local tensor mode implementation~~ [Local Tensor] Replace dry_run.py with fake mode implementation Nov 19, 2025

wwwjn reviewed Nov 19, 2025

View reviewed changes

Update

7e121a6

[ghstack-poisoned]

fegin requested review from wconstab and wwwjn November 20, 2025 02:16

tianyu-l reviewed Nov 20, 2025

View reviewed changes

Update

5983b61

[ghstack-poisoned]

[Local Tensor] Replace dry_run.py with fake mode implementation #2057

Are you sure you want to change the base?

[Local Tensor] Replace dry_run.py with fake mode implementation #2057

Conversation

fegin commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fegin Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fegin Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fegin commented Nov 18, 2025 •

edited

Loading

fegin Nov 18, 2025 •

edited

Loading

fegin Nov 19, 2025 •

edited

Loading