Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Equivalent of datalad install with datalad crawler #114

Open
N-HEDGER opened this issue Nov 15, 2021 · 1 comment
Open

Equivalent of datalad install with datalad crawler #114

N-HEDGER opened this issue Nov 15, 2021 · 1 comment
Labels
question Further information is requested

Comments

@N-HEDGER
Copy link

Hi there,

Great package. I had a query about how one replicates a call to datalad install with the datalad crawler.

My understanding of 'datalad install' is that only small files are downloaded and the directory structure is mirrored locally - and large files can be downloaded with subsequent calls to 'datalad get'

With datalad crawler, I can do something like the below. With the 'drop_immediately' argument, it appears to download all of the files and subsequently 'drop' them. So I end up with what I want - but only after having gone through the lengthy process of downloading all of the files.

dl.crawl_init(template=self.template,save=True,args=[self.bucketstring,dstring,'drop_immediately=True'])
dl.crawl()

However - is there a way of replicating a 'datalad install' call, so that the directory structure + small files are mirrored in my system - but no large files are downloaded until I call datalad get ? It seems like this could be possible - but I am unable to find documentation on this. Alternatively, there could be some technical reason as to why all the data needs to be downloaded that I am not appreciating.

I appreciate your time.

@yarikoptic
Copy link
Member

  • install (or clone) clones an already populated by someone git/git-annex repository (datalad dataset), which might have been pushed to some other location (ssh server, github, gin, etc).
  • crawl_init + crawl populates a typically empty to start with dataset, so it could later be pushed to some other location for sharing, from which it could later be install/cloned. This is how many datasets on datasets.datalad.org were "populated" and later updated, e.g. those of http://datasets.datalad.org/?dir=/crcns or http://datasets.datalad.org/?dir=/indi

why to download -- is to establish checksum-based git-annex keys so there is a guarantee of data integrity etc, and crawl by default operates in default git-annex mode which does download, computes checksum ... with drop_immediately it drops content so to save space locally.
An elderly Annexificator in datalad-crawler has mode option, which could take "fast" and thus use git annex addurl --fast which will not download but use url as the value for the key (with the size). But then you loose data I am not sure we ever exposed it conveniently in datalad-crawler... If you have a list/table of urls you care about, you could use datalad addurls (or dl.addurls) where fast mode if exposed I believe.

hope this helps

@yarikoptic yarikoptic added the question Further information is requested label Nov 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants