Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"dvc status" checks for remote updates in git imports when checking for workspace changes #4983

Closed
zhiltsov-max opened this issue Nov 27, 2020 · 4 comments

Comments

@zhiltsov-max
Copy link

Bug Report

Reproduce:

dvc init
dvc import <git repo>
dvc status

The status call will download the repository into a temporary folder and compare workspace against it, instead of comparing workspace against cache (like the documentation describes). This can be fixed by changing this line with check_updates=False, but I'm not sure, if this is intended. Looking at documentation, I would expect the status command to download anything only when working in cloud or remote modes.

Please provide information about your setup

Output of dvc version:

DVC version: 1.10.2 (pip)
---------------------------------
Platform: Python 3.6.5 on Linux-4.15.0-99-generic-x86_64-with-debian-buster-sid
Supports: http, https, ssh
Cache types: hardlink, symlink
Caches: local
Remotes: local, local
Repo: dvc, git
@pmrowla
Copy link
Contributor

pmrowla commented Nov 28, 2020

If you are referring to how status will do a shallow clone of the source git repo, this is expected behavior for imports and external dependencies. status has to examine (clone) the source repo in order to determine whether or not the imported file has been updated in the source repo.

In that situation status will report something like

a.txt.dvc
        changed deps:
                update available: a.txt (<src git repo URL>)

Essentially, the locally cached version of a.txt is up to date with the workspace here, but the workspace itself is behind the source repo and needs to be updated.

This is different than determining whether or not files need to be pushed or pulled to any configured remotes in the current DVC repo (which is what status does in cloud/remote mode).

@pmrowla pmrowla closed this as completed Nov 28, 2020
@efiop
Copy link
Contributor

efiop commented Nov 28, 2020

We do plan on adding caching for these git repos #3496 in the future.

@zhiltsov-max
Copy link
Author

zhiltsov-max commented Nov 28, 2020

status has to examine (clone) the source repo in order to determine whether or not the imported file has been updated in the source repo.

According to the documentation, status has to compare the workspace against the cache:

Comparisons are made between data files in the workspace and corresponding files in the cache directory (e.g. .dvc/cache)

I would expect it to check file existence and hashes, and compare them to the saved ones in .dvc files, like git status does.

@pmrowla
Copy link
Contributor

pmrowla commented Nov 28, 2020

I would expect it to check file existence and hashes, and compare them to the saved ones in .dvc files, like git status does.

This is what dvc status does for regular tracked files, whether they are added via dvc add or repro/run pipeline outputs. But imported/external files are handled differently, since it makes sense for status to also check the status of the imported .dvc file itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants