-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Hi, I'm a new DVC user, and I'm struggling to find a reliable way to make sure I haven't forgotten to track any of my data. This has been very confusing, both on the CLI and in the VS Code plugin.
The Core Problem
I follow the standard practice of adding my data/ directory to .gitignore so that Git doesn't track my large data files.
My problem is when I add new files to this data/ directory and forget to track them with DVC.
Here is my "forgotten file" scenario:
- I have
data/listed in my.gitignore. - I have a file
data/good_dataset.csvwhich is correctly tracked by DVC (it's indata.dvc). - Later, I download a new file,
data/new_file.csv. - I forget to run
dvc add data/new_file.csv.
Now, data/new_file.csv is an "orphan" file. It's not tracked by Git (because of .gitignore) and it's not tracked by DVC (because I forgot).
What I Tried on the CLI
I thought I could find this "orphan" file using a status command:
git statusdoesn't show it (because of.gitignore). This is expected.dvc data status --untracked-filesalso doesn't show it. It seems to respect my.gitignorefile, so it doesn't even look inside thedata/folder for untracked files.
This means I have no way to "audit" my repository to find files that I should have either tracked with DVC or added to .dvcignore.
This Problem is Even Worse in the VS Code Plugin
I was hoping the official DVC VS Code extension would be the solution, but it's currently unusable for this. The "DVC SCM" view is very confusing for a new user:
- It Hides the "Orphan" Data Files: Just like the CLI, the plugin seems to respect
.gitignore. My new, untrackeddata/new_file.csvdoes not show up in the plugin's file list. This means it fails to show me the exact files I'm looking for. - It Lists All My Code Files: At the same time, the plugin does list all of my normal code files (like
train.py) that are already tracked by Git. It incorrectly suggests that I should add all my source code to DVC, which is wrong and creates a lot of noise.
This is the opposite of what I need. The plugin fails to show the data I missed, but does incorrectly show the code I've correctly tracked with Git.
My "Beginner" Expectation
As a new user, my expectation is that DVC tools should help me find data files I've forgotten to dvc add. I want a clear list of files that are:
- Not tracked by Git
- Not tracked by DVC
- Not ignored by
.dvcignore
It seems the only way to do this is to write a complex custom Python script to walk the whole file system and compare git ls-files, dvc list, and dvc check-ignore, which seems very user unfriendly.
Proposed Solution
It would be incredibly helpful if there was a command or a flag to find these "orphan" files, without respecting .gitignore.
And ideally, the VS Code plugin's SCM view should be changed to reflect this. It should:
- Show untracked data files (even if in
.gitignore). - Hide code files that are already tracked by Git.
This would give me confidence that my repository is clean and I haven't forgotten to dvc add any new data. Thanks!