You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Great package. I had a query about how one replicates a call to datalad install with the datalad crawler.
My understanding of 'datalad install' is that only small files are downloaded and the directory structure is mirrored locally - and large files can be downloaded with subsequent calls to 'datalad get'
With datalad crawler, I can do something like the below. With the 'drop_immediately' argument, it appears to download all of the files and subsequently 'drop' them. So I end up with what I want - but only after having gone through the lengthy process of downloading all of the files.
However - is there a way of replicating a 'datalad install' call, so that the directory structure + small files are mirrored in my system - but no large files are downloaded until I call datalad get ? It seems like this could be possible - but I am unable to find documentation on this. Alternatively, there could be some technical reason as to why all the data needs to be downloaded that I am not appreciating.
I appreciate your time.
The text was updated successfully, but these errors were encountered:
install (or clone) clones an already populated by someone git/git-annex repository (datalad dataset), which might have been pushed to some other location (ssh server, github, gin, etc).
crawl_init + crawl populates a typically empty to start with dataset, so it could later be pushed to some other location for sharing, from which it could later be install/cloned. This is how many datasets on datasets.datalad.org were "populated" and later updated, e.g. those of http://datasets.datalad.org/?dir=/crcns or http://datasets.datalad.org/?dir=/indi
why to download -- is to establish checksum-based git-annex keys so there is a guarantee of data integrity etc, and crawl by default operates in default git-annex mode which does download, computes checksum ... with drop_immediately it drops content so to save space locally.
An elderly Annexificator in datalad-crawler has mode option, which could take "fast" and thus use git annex addurl --fast which will not download but use url as the value for the key (with the size). But then you loose data I am not sure we ever exposed it conveniently in datalad-crawler... If you have a list/table of urls you care about, you could use datalad addurls (or dl.addurls) where fast mode if exposed I believe.
Hi there,
Great package. I had a query about how one replicates a call to datalad install with the datalad crawler.
My understanding of 'datalad install' is that only small files are downloaded and the directory structure is mirrored locally - and large files can be downloaded with subsequent calls to 'datalad get'
With datalad crawler, I can do something like the below. With the 'drop_immediately' argument, it appears to download all of the files and subsequently 'drop' them. So I end up with what I want - but only after having gone through the lengthy process of downloading all of the files.
However - is there a way of replicating a 'datalad install' call, so that the directory structure + small files are mirrored in my system - but no large files are downloaded until I call datalad get ? It seems like this could be possible - but I am unable to find documentation on this. Alternatively, there could be some technical reason as to why all the data needs to be downloaded that I am not appreciating.
I appreciate your time.
The text was updated successfully, but these errors were encountered: