fix: use mtime by default in Trainer._rotate_checkpoints with automatic fallback #37260
+12
−3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR fixes an issue with checkpoint rotation in
transformers.Trainer
. When training with checkpoints saved every 100 steps and a maximum limit of 3 checkpoints, starting a new training session in the same output directory can cause a newly created checkpoint (e.g.,checkpoint-100
) to be mistakenly identified as the oldest and immediately deleted.Detailed Problem Description
Trainer
usingsave_steps=100
andsave_total_limit=3
.checkpoint-500
,checkpoint-600
, andcheckpoint-700
are produced. When starting a new training run in the same output directory, the new checkpoint (checkpoint-100
) is mistakenly identified as the oldest checkpoint due to its lower numerical value and is immediately deleted.Proposed Solution
_rotate_checkpoints
to use file modification time (mtime) by default for ordering checkpoints. This approach better reflects the actual creation order.Related Issue
This PR is related to #26961 and #28862 . In that issue, users observed that using
use_mtime=True
sometimes resulted in the unintended deletion of newer checkpoints. Although settinguse_mtime=False
could avoid these issues on certain filesystems, our solution defaults touse_mtime=True
to accurately reflect checkpoint creation order, with an automatic fallback mechanism to ensure robustness when mtime is unreliable.Before submitting