Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fetch: don't re-download latest imports #4597

Closed
jorgeorpinel opened this issue Sep 22, 2020 · 2 comments
Closed

fetch: don't re-download latest imports #4597

jorgeorpinel opened this issue Sep 22, 2020 · 2 comments
Labels
awaiting response we are waiting for your reply, please respond! :) question I have a question?

Comments

@jorgeorpinel
Copy link
Contributor

Bug Report

Apparently imports are re-downloaded by fetch/pull (BTW this is undocumented, see iterative/dvc.org/issues/1792). I tested this and it seems even if you have the latest data in cache, the repo is cloned and the files are downloaded, overwriting the cached version. This seems unnecessary and could be an issue for large files/dirs. Shouldn't DVC be able to tell you already have the latest version without downloading the file? We have the commit hash, md5, etag, and checksum fields for that.

I'm not sure that this is the case though. I tested with both Git-tracked and DVC-tracked files ant the verbose output seems to indicate the data file is downloaded at least for sure for Git imports.

Git imports

λ dvc fetch -v
2020-09-22 14:10:43,975 DEBUG: Check for update is enabled.
2020-09-22 14:10:44,064 DEBUG: Trying to spawn '['daemon', '-q', 'updater']'
2020-09-22 14:10:47,941 DEBUG: Spawned '['daemon', '-q', 'updater']'
2020-09-22 14:10:47,951 DEBUG: fetched: [(3,)]
2020-09-22 14:10:48,020 DEBUG: Creating external repo ../test@35117e1c2c941edf8e50511b9f69b3f848497846
2020-09-22 14:10:48,026 DEBUG: erepo: git clone '../test' to a temporary dir
2020-09-22 14:10:49,164 DEBUG: Saving '..\..\AppData\Local\Temp\tmpv7lxhlb9dvc-clone\code' to '.dvc\cache\78\44a93ad4b97169834dade975b5beff'.
2020-09-22 14:10:49,169 DEBUG: Assuming 'C:\Users\poj12\DVC-repos\test2\.dvc\cache\78\44a93ad4b97169834dade975b5beff' is unchanged since it is read-only
2020-09-22 14:10:49,246 DEBUG: fetched: [(6,)]
Everything is up to date.

The DEBUG: erepo: git clone and DEBUG: Saving messages seem to indicate that the source repo was cloned and the Git-tracked file was overwritten in the cache.

But the DEBUG: Assuming only exists if the file is already in cache, however I can't tell what it means exactly. If I remove the file from cache and workspace first, I get DEBUG: cache '...5beff' expected 'HashInfo(name='md5', value='...beff', dir_info=None)' actual 'None' instead.

DVC imports

λ dvc fetch -v
2020-09-22 14:39:51,369 DEBUG: Check for update is enabled.
2020-09-22 14:39:51,443 DEBUG: Trying to spawn '['daemon', '-q', 'updater']'
2020-09-22 14:39:54,944 DEBUG: Spawned '['daemon', '-q', 'updater']'
2020-09-22 14:39:54,963 DEBUG: fetched: [(3,)]
2020-09-22 14:39:54,993 DEBUG: Creating external repo ../test2@7c029062ba91f9f8d30dbd30dd21ae3c762cc5b1
2020-09-22 14:39:55,000 DEBUG: erepo: git clone '../test2' to a temporary dir
2020-09-22 14:39:56,930 DEBUG: Saving '..\..\AppData\Local\Temp\tmp41v5wmkldvc-clone\data' to '.dvc\cache\61\37cde4893c59f76f005a8123d8e8e6'.
2020-09-22 14:39:56,938 DEBUG: Assuming 'C:\Users\poj12\DVC-repos\test\.dvc\cache\61\37cde4893c59f76f005a8123d8e8e6' is unchanged since it is read-only
2020-09-22 14:39:57,038 DEBUG: fetched: [(2,)]

Again, it seems to indicate that the file is saved from the tmp repo clone — which seems weird since it's not tracked by Git... The -v output is quite different if I remove the data from workspace and cache first, and actually mentions downloading from the remote, so I'm not sure what's happening here.

Maybe it's just a confusion on my part, not understanding the debug output. Just wanted to double-check. Sorry for extra long question!

Somewhat related: #2599

Please provide information about your setup

DVC 1.7.2 (exe) on Win

@efiop
Copy link
Contributor

efiop commented Sep 24, 2020

Looks like you are simply confused by the debug message itself

logger.debug("Saving '%s' to '%s'.", path_info, to_info)
, which is shown even if cache is already populated, which is indicated by the "assuming" message below.

Git repos are cloned each time, yes, it is a known, e.g. #3496

@efiop efiop added the awaiting response we are waiting for your reply, please respond! :) label Sep 24, 2020
@jorgeorpinel
Copy link
Contributor Author

OK, just wanted to double check. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response we are waiting for your reply, please respond! :) question I have a question?
Projects
None yet
Development

No branches or pull requests

2 participants