First checkpoint not being saved #19002
Unanswered
jwliu36
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi team,
I am trying to save all the checkpoints as well as the last.ckpt by setting
save_last=Trueandsave_top_k=-1. Additionally, I am also usingAsyncCheckpointIOfor async uploading checkpoints to local and S3 file paths.However, I am running into issue where the first
checkpoint-{epoch}-{step}.ckptis not getting saved, but only last.ckpt is created. As the training job goes on, all subsequentcheckpoint-{epoch}-{step}.ckptwould get saved into the same directory.Can you point me to which method within
ModelCheckpointclass that I may need to override?Would it be
_save_last_checkpoint: code ref or_should_skip_saving_checkpointcode ref? If it is other methods, please point me to the reference. Thank you!Beta Was this translation helpful? Give feedback.
All reactions