-
Notifications
You must be signed in to change notification settings - Fork 609
[Local Tensor] Replace dry_run.py with fake mode implementation #2057
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gh/fegin/44/base
Are you sure you want to change the base?
Conversation
Replaces `dry_run.py` implementation with local tensor mode for DRY_RUN configuration validation. Local tensor mode provides deeper validation coverage, including `ParallelDims` creation, which the previous implementation could not verify. **Note:** Currently returns early before `init_weights()` due to a known limitation in local tensor mode. This still validates more of the pipeline than the previous approach. ghstack-source-id: b53ea9f Pull-Request: #2057
Replaces `dry_run.py` implementation with local tensor mode for DRY_RUN configuration validation. Local tensor mode provides deeper validation coverage, including `ParallelDims` creation, which the previous implementation could not verify. **Note:** Currently returns early before `init_weights()` due to a known limitation in local tensor mode. This still validates more of the pipeline than the previous approach. ghstack-source-id: c37e849 Pull-Request: #2057
Replaces `dry_run.py` implementation with local tensor mode for DRY_RUN configuration validation. Local tensor mode provides deeper validation coverage, including `ParallelDims` creation, which the previous implementation could not verify. **Note:** Currently returns early before `init_weights()` due to a known limitation in local tensor mode. This still validates more of the pipeline than the previous approach. ghstack-source-id: c37e849 Pull-Request: #2057
torchtitan/train.py
Outdated
| # TODO(local_tensor): Remove this special case once LocalTensor supports | ||
| # init_weights(). In local tensor mode, skip training/checkpointing as the | ||
| # model is not fully initialized | ||
| if config.comm.local_tensor_mode: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably a naive question. what's the advantage of localtensor over faketensor if we're just running the parallelization setup code?
localtensor would do the actual compute and give correct results, but it would run (more slowly) on a single gpu since it simulates each rank's operations. fake tensor would skip all the compute and run (more quickly, i think) on a single CPU.
Do we intend to run with numerics or just smoke-test that we don't get API errors along the way to setup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we intend to run with numerics or just smoke-test that we don't get API errors along the way to setup?
For DryRun mode that validates the configurations, fake backend should be enough. But I also want to use this infra to enable some tests that do not require GPUs, running one step and verify the output. We are putting all the different parallelisms test into integration tests which require one H100 machine. The queuing time is going to become longer.
Also, DeviceMesh uses tensor operations, if we want to verify DeviceMesh operations on all ranks, we will need LocalTensor. Fake backend only allow you to verify rank0 DeviceMesh behavior, though this should not be a big deal as we mostly do SPMD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I re-think you comment. I think I should not couple dry run mode with local tensor mode, which my intention was to use it for light-weight integration tests.
I add another option to enable ONLY fake PG mode. Local tensor mode depends on fake PG mode and dry run only requires fake PG mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok- that makes sense! i like the idea of using localtensor mode for running actual numerics validation. it just seemed overkill if used for dry-run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the local tensor mode might help to some extent to debug numerics, however I met numeric issue because of DTensor, or missing communication. So we would recommend user debug with following order: dry_run mode -> local tensor model -> parallelism , right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, dry run mode is purely debugging the setup of the trainer phase. The setup means the trainer configurations, DeviceMesh setting up and parallelism configurations. For an end user of TorchTitan, dry run mode is mostly useful to detect configurations error before launching a large scale training. For TorchTitan developers like us, dry run mode is useful as an early debugging signal when developing trainer, components and parallelisms (e.g., parallelize.py).
LocalTensor, on the other hand is useful to actually check what happen during the forward and backward. While fake tensor can also help, it doesn't actually incur the communication and computation which may hide the issues, and when the computation involve data dependent op, it will fail. For example, you won't be able to debug CP load balancing with fake tensor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by the options here.
IIUC fake backend doesn't mean fake tensor right? What happens if local_tensor_mode=False but fake_backend=True?
Since this is user facing, I feel it might be more clear to organize options based on user intent. E.g. comm_mode = "dry" / "local" instead of providing multiple knobs which only functions when combined properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fake_backend=True, just means that we use fake backend for the communication. It doesn't have to be used with comm_mode if we don't care about accuracy. One computation will be done locally on rank0 but the collectives will not. For local_tensor_mode, all the computation (rank0 - rankN-1) will be done on rank0 and the collectives will be simulated as well.
I'm okay to combine the two together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With fake_backend=True, is rank0 the only device which participated in computation? Or does each rank compute their own stuff without actual comms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tianyu-l All ranks will perform the computation locally, but it actually doesn't matter because ranks don't talk to each others. For dry run mode, we always launch with one rank only (but fake it as there are N ranks).
Replaces `dry_run.py` implementation with local tensor mode for DRY_RUN configuration validation. Local tensor mode provides deeper validation coverage, including `ParallelDims` creation, which the previous implementation could not verify. **Note:** Currently returns early before `init_weights()` due to a known limitation in local tensor mode. This still validates more of the pipeline than the previous approach. ghstack-source-id: b024f8f Pull-Request: #2057
torchtitan/train.py
Outdated
| # TODO(local_tensor): Remove this special case once LocalTensor supports | ||
| # init_weights(). In local tensor mode, skip training/checkpointing as the | ||
| # model is not fully initialized | ||
| if config.comm.local_tensor_mode: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the local tensor mode might help to some extent to debug numerics, however I met numeric issue because of DTensor, or missing communication. So we would recommend user debug with following order: dry_run mode -> local tensor model -> parallelism , right?
Replaces `dry_run.py` implementation with local tensor mode for DRY_RUN configuration validation. Local tensor mode provides deeper validation coverage, including `ParallelDims` creation, which the previous implementation could not verify. **Note:** Currently returns early before `init_weights()` due to a known limitation in local tensor mode. This still validates more of the pipeline than the previous approach. ghstack-source-id: 27b8bad Pull-Request: #2057
torchtitan/train.py
Outdated
| self.loss_fn, self.gradient_accumulation_steps | ||
| ) | ||
|
|
||
| # TODO(local_tensor): Remove this early return once LocalTensor supports |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytorch/pytorch#166540 is merged, shall we remove this early return?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are still some gaps. I updated the comment.
| try: | ||
| trainer = trainer_class(config) | ||
|
|
||
| # TODO(local_tensor): Remove this special case once LocalTensor supports |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similarly, can we remove this now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are still some gaps. I updated the comment.
torchtitan/train.py
Outdated
| # TODO(local_tensor): Remove this special case once LocalTensor supports | ||
| # init_weights(). In local tensor mode, skip training/checkpointing as the | ||
| # model is not fully initialized | ||
| if config.comm.local_tensor_mode: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by the options here.
IIUC fake backend doesn't mean fake tensor right? What happens if local_tensor_mode=False but fake_backend=True?
Since this is user facing, I feel it might be more clear to organize options based on user intent. E.g. comm_mode = "dry" / "local" instead of providing multiple knobs which only functions when combined properly.
Replaces `dry_run.py` implementation with local tensor mode for DRY_RUN configuration validation. Local tensor mode provides deeper validation coverage, including `ParallelDims` creation, which the previous implementation could not verify. **Note:** Currently returns early before `init_weights()` due to a known limitation in local tensor mode. This still validates more of the pipeline than the previous approach. ghstack-source-id: 5ea1d46 Pull-Request: #2057
Stack from ghstack (oldest at bottom):
Replaces
dry_run.pyimplementation with fake PG mode for DRY_RUN configuration validation. This PR also adds support of Local tensor mode to provide deeper validation coverage.Note: Currently returns early before
init_weights()if using local tensor mode due to some limitation of local tensor, which will be fixed by pytorch/pytorch#166540 .