Equivalent of datalad install with datalad crawler #114

N-HEDGER · 2021-11-15T16:31:22Z

Hi there,

Great package. I had a query about how one replicates a call to datalad install with the datalad crawler.

My understanding of 'datalad install' is that only small files are downloaded and the directory structure is mirrored locally - and large files can be downloaded with subsequent calls to 'datalad get'

With datalad crawler, I can do something like the below. With the 'drop_immediately' argument, it appears to download all of the files and subsequently 'drop' them. So I end up with what I want - but only after having gone through the lengthy process of downloading all of the files.

dl.crawl_init(template=self.template,save=True,args=[self.bucketstring,dstring,'drop_immediately=True'])
dl.crawl()

However - is there a way of replicating a 'datalad install' call, so that the directory structure + small files are mirrored in my system - but no large files are downloaded until I call datalad get ? It seems like this could be possible - but I am unable to find documentation on this. Alternatively, there could be some technical reason as to why all the data needs to be downloaded that I am not appreciating.

I appreciate your time.

yarikoptic · 2021-11-15T19:21:42Z

install (or clone) clones an already populated by someone git/git-annex repository (datalad dataset), which might have been pushed to some other location (ssh server, github, gin, etc).
crawl_init + crawl populates a typically empty to start with dataset, so it could later be pushed to some other location for sharing, from which it could later be install/cloned. This is how many datasets on datasets.datalad.org were "populated" and later updated, e.g. those of http://datasets.datalad.org/?dir=/crcns or http://datasets.datalad.org/?dir=/indi

why to download -- is to establish checksum-based git-annex keys so there is a guarantee of data integrity etc, and crawl by default operates in default git-annex mode which does download, computes checksum ... with drop_immediately it drops content so to save space locally.
An elderly Annexificator in datalad-crawler has mode option, which could take "fast" and thus use git annex addurl --fast which will not download but use url as the value for the key (with the size). But then you loose data I am not sure we ever exposed it conveniently in datalad-crawler... If you have a list/table of urls you care about, you could use datalad addurls (or dl.addurls) where fast mode if exposed I believe.

hope this helps

yarikoptic added the question Further information is requested label Nov 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Equivalent of datalad install with datalad crawler #114

Equivalent of datalad install with datalad crawler #114

N-HEDGER commented Nov 15, 2021

yarikoptic commented Nov 15, 2021

Equivalent of datalad install with datalad crawler #114

Equivalent of datalad install with datalad crawler #114

Comments

N-HEDGER commented Nov 15, 2021

yarikoptic commented Nov 15, 2021