Skip to content

Feature Request: Find "orphan" files untracked by DVC, especially those in .gitignore (and fix VS Code plugin confusion) #10907

@frederikvand

Description

@frederikvand

Hi, I'm a new DVC user, and I'm struggling to find a reliable way to make sure I haven't forgotten to track any of my data. This has been very confusing, both on the CLI and in the VS Code plugin.

The Core Problem

I follow the standard practice of adding my data/ directory to .gitignore so that Git doesn't track my large data files.

My problem is when I add new files to this data/ directory and forget to track them with DVC.

Here is my "forgotten file" scenario:

  1. I have data/ listed in my .gitignore.
  2. I have a file data/good_dataset.csv which is correctly tracked by DVC (it's in data.dvc).
  3. Later, I download a new file, data/new_file.csv.
  4. I forget to run dvc add data/new_file.csv.

Now, data/new_file.csv is an "orphan" file. It's not tracked by Git (because of .gitignore) and it's not tracked by DVC (because I forgot).

What I Tried on the CLI

I thought I could find this "orphan" file using a status command:

  • git status doesn't show it (because of .gitignore). This is expected.
  • dvc data status --untracked-files also doesn't show it. It seems to respect my .gitignore file, so it doesn't even look inside the data/ folder for untracked files.

This means I have no way to "audit" my repository to find files that I should have either tracked with DVC or added to .dvcignore.

This Problem is Even Worse in the VS Code Plugin

I was hoping the official DVC VS Code extension would be the solution, but it's currently unusable for this. The "DVC SCM" view is very confusing for a new user:

  1. It Hides the "Orphan" Data Files: Just like the CLI, the plugin seems to respect .gitignore. My new, untracked data/new_file.csv does not show up in the plugin's file list. This means it fails to show me the exact files I'm looking for.
  2. It Lists All My Code Files: At the same time, the plugin does list all of my normal code files (like train.py) that are already tracked by Git. It incorrectly suggests that I should add all my source code to DVC, which is wrong and creates a lot of noise.

This is the opposite of what I need. The plugin fails to show the data I missed, but does incorrectly show the code I've correctly tracked with Git.

My "Beginner" Expectation

As a new user, my expectation is that DVC tools should help me find data files I've forgotten to dvc add. I want a clear list of files that are:

  • Not tracked by Git
  • Not tracked by DVC
  • Not ignored by .dvcignore

It seems the only way to do this is to write a complex custom Python script to walk the whole file system and compare git ls-files, dvc list, and dvc check-ignore, which seems very user unfriendly.

Proposed Solution

It would be incredibly helpful if there was a command or a flag to find these "orphan" files, without respecting .gitignore.

And ideally, the VS Code plugin's SCM view should be changed to reflect this. It should:

  • Show untracked data files (even if in .gitignore).
  • Hide code files that are already tracked by Git.

This would give me confidence that my repository is clean and I haven't forgotten to dvc add any new data. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions