Skip to content

Conversation

@fegin
Copy link
Contributor

@fegin fegin commented Nov 18, 2025

Stack from ghstack (oldest at bottom):

Replaces dry_run.py implementation with fake PG mode for DRY_RUN configuration validation. This PR also adds support of Local tensor mode to provide deeper validation coverage.

Note: Currently returns early before init_weights() if using local tensor mode due to some limitation of local tensor, which will be fixed by pytorch/pytorch#166540 .

[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 18, 2025
Replaces `dry_run.py` implementation with local tensor mode for DRY_RUN configuration validation. Local tensor mode provides deeper validation coverage, including `ParallelDims` creation, which the previous implementation could not verify.

**Note:** Currently returns early before `init_weights()` due to a known limitation in local tensor
mode. This still validates more of the pipeline than the previous approach.


ghstack-source-id: b53ea9f
Pull-Request: #2057
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 18, 2025
[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 18, 2025
Replaces `dry_run.py` implementation with local tensor mode for DRY_RUN configuration validation. Local tensor mode provides deeper validation coverage, including `ParallelDims` creation, which the previous implementation could not verify.

**Note:** Currently returns early before `init_weights()` due to a known limitation in local tensor
mode. This still validates more of the pipeline than the previous approach.

ghstack-source-id: c37e849
Pull-Request: #2057
fegin added a commit that referenced this pull request Nov 18, 2025
Replaces `dry_run.py` implementation with local tensor mode for DRY_RUN configuration validation. Local tensor mode provides deeper validation coverage, including `ParallelDims` creation, which the previous implementation could not verify.

**Note:** Currently returns early before `init_weights()` due to a known limitation in local tensor
mode. This still validates more of the pipeline than the previous approach.

ghstack-source-id: c37e849
Pull-Request: #2057
# TODO(local_tensor): Remove this special case once LocalTensor supports
# init_weights(). In local tensor mode, skip training/checkpointing as the
# model is not fully initialized
if config.comm.local_tensor_mode:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably a naive question. what's the advantage of localtensor over faketensor if we're just running the parallelization setup code?

localtensor would do the actual compute and give correct results, but it would run (more slowly) on a single gpu since it simulates each rank's operations. fake tensor would skip all the compute and run (more quickly, i think) on a single CPU.

Do we intend to run with numerics or just smoke-test that we don't get API errors along the way to setup?

Copy link
Contributor Author

@fegin fegin Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we intend to run with numerics or just smoke-test that we don't get API errors along the way to setup?

For DryRun mode that validates the configurations, fake backend should be enough. But I also want to use this infra to enable some tests that do not require GPUs, running one step and verify the output. We are putting all the different parallelisms test into integration tests which require one H100 machine. The queuing time is going to become longer.

Also, DeviceMesh uses tensor operations, if we want to verify DeviceMesh operations on all ranks, we will need LocalTensor. Fake backend only allow you to verify rank0 DeviceMesh behavior, though this should not be a big deal as we mostly do SPMD.

Copy link
Contributor Author

@fegin fegin Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-think you comment. I think I should not couple dry run mode with local tensor mode, which my intention was to use it for light-weight integration tests.

I add another option to enable ONLY fake PG mode. Local tensor mode depends on fake PG mode and dry run only requires fake PG mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok- that makes sense! i like the idea of using localtensor mode for running actual numerics validation. it just seemed overkill if used for dry-run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the local tensor mode might help to some extent to debug numerics, however I met numeric issue because of DTensor, or missing communication. So we would recommend user debug with following order: dry_run mode -> local tensor model -> parallelism , right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, dry run mode is purely debugging the setup of the trainer phase. The setup means the trainer configurations, DeviceMesh setting up and parallelism configurations. For an end user of TorchTitan, dry run mode is mostly useful to detect configurations error before launching a large scale training. For TorchTitan developers like us, dry run mode is useful as an early debugging signal when developing trainer, components and parallelisms (e.g., parallelize.py).

LocalTensor, on the other hand is useful to actually check what happen during the forward and backward. While fake tensor can also help, it doesn't actually incur the communication and computation which may hide the issues, and when the computation involve data dependent op, it will fail. For example, you won't be able to debug CP load balancing with fake tensor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by the options here.

IIUC fake backend doesn't mean fake tensor right? What happens if local_tensor_mode=False but fake_backend=True?

Since this is user facing, I feel it might be more clear to organize options based on user intent. E.g. comm_mode = "dry" / "local" instead of providing multiple knobs which only functions when combined properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fake_backend=True, just means that we use fake backend for the communication. It doesn't have to be used with comm_mode if we don't care about accuracy. One computation will be done locally on rank0 but the collectives will not. For local_tensor_mode, all the computation (rank0 - rankN-1) will be done on rank0 and the collectives will be simulated as well.

I'm okay to combine the two together.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With fake_backend=True, is rank0 the only device which participated in computation? Or does each rank compute their own stuff without actual comms?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tianyu-l All ranks will perform the computation locally, but it actually doesn't matter because ranks don't talk to each others. For dry run mode, we always launch with one rank only (but fake it as there are N ranks).

[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 19, 2025
Replaces `dry_run.py` implementation with local tensor mode for DRY_RUN configuration validation. Local tensor mode provides deeper validation coverage, including `ParallelDims` creation, which the previous implementation could not verify.

**Note:** Currently returns early before `init_weights()` due to a known limitation in local tensor
mode. This still validates more of the pipeline than the previous approach.

ghstack-source-id: b024f8f
Pull-Request: #2057
@fegin fegin changed the title [Local Tensor] Replace dry_run.py with local tensor mode implementation [Local Tensor] Replace dry_run.py with fake mode implementation Nov 19, 2025
# TODO(local_tensor): Remove this special case once LocalTensor supports
# init_weights(). In local tensor mode, skip training/checkpointing as the
# model is not fully initialized
if config.comm.local_tensor_mode:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the local tensor mode might help to some extent to debug numerics, however I met numeric issue because of DTensor, or missing communication. So we would recommend user debug with following order: dry_run mode -> local tensor model -> parallelism , right?

[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 20, 2025
Replaces `dry_run.py` implementation with local tensor mode for DRY_RUN configuration validation. Local tensor mode provides deeper validation coverage, including `ParallelDims` creation, which the previous implementation could not verify.

**Note:** Currently returns early before `init_weights()` due to a known limitation in local tensor
mode. This still validates more of the pipeline than the previous approach.

ghstack-source-id: 27b8bad
Pull-Request: #2057
@fegin fegin requested review from wconstab and wwwjn November 20, 2025 02:16
self.loss_fn, self.gradient_accumulation_steps
)

# TODO(local_tensor): Remove this early return once LocalTensor supports
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytorch/pytorch#166540 is merged, shall we remove this early return?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still some gaps. I updated the comment.

try:
trainer = trainer_class(config)

# TODO(local_tensor): Remove this special case once LocalTensor supports
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly, can we remove this now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still some gaps. I updated the comment.

# TODO(local_tensor): Remove this special case once LocalTensor supports
# init_weights(). In local tensor mode, skip training/checkpointing as the
# model is not fully initialized
if config.comm.local_tensor_mode:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by the options here.

IIUC fake backend doesn't mean fake tensor right? What happens if local_tensor_mode=False but fake_backend=True?

Since this is user facing, I feel it might be more clear to organize options based on user intent. E.g. comm_mode = "dry" / "local" instead of providing multiple knobs which only functions when combined properly.

[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 20, 2025
Replaces `dry_run.py` implementation with local tensor mode for DRY_RUN configuration validation. Local tensor mode provides deeper validation coverage, including `ParallelDims` creation, which the previous implementation could not verify.

**Note:** Currently returns early before `init_weights()` due to a known limitation in local tensor
mode. This still validates more of the pipeline than the previous approach.

ghstack-source-id: 5ea1d46
Pull-Request: #2057
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants