Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace URLs with versioned urls where possible since some are 'disappearing' already #15

Open
1 task done
yarikoptic opened this issue Mar 23, 2018 · 7 comments
Open
1 task done

Comments

@yarikoptic
Copy link

What would you like to do:

  • Report an issue

while preparing datalad dataset we ran into a bunch of URLs 404ing since there were deleted in the bucket. But bucket was versioned seems after they were added and before they were removed so possibly those versions (or some other versions) are still available if null revision id would be provided, e.g.

$> wget -S 'http://fcp-indi.s3.amazonaws.com/data/Projects/CORR/Outputs/IBA_TRT/freesurfer/0027256-session_2/mri/T1.mgz?versionId=null' 
--2018-03-23 09:06:04--  http://fcp-indi.s3.amazonaws.com/data/Projects/CORR/Outputs/IBA_TRT/freesurfer/0027256-session_2/mri/T1.mgz?versionId=null
Resolving fcp-indi.s3.amazonaws.com (fcp-indi.s3.amazonaws.com)... 52.216.133.139
Connecting to fcp-indi.s3.amazonaws.com (fcp-indi.s3.amazonaws.com)|52.216.133.139|:80... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK
  x-amz-id-2: jv1iiXrsK4IGUiRUAESIivfdxWabFalvSyDeW5SeHN0fpqfqY21l50xXf81cqvEsso8sBd8UOVA=
  x-amz-request-id: FA464ADD438B76F3
  Date: Fri, 23 Mar 2018 13:06:05 GMT
  Last-Modified: Mon, 17 Oct 2016 19:49:07 GMT
  ETag: "f71962c9688a8cc17e4e6ddff40c1946"
  x-amz-version-id: null
  Accept-Ranges: bytes
  Content-Type: application/octet-stream
  Content-Length: 3777778
  Server: AmazonS3
Length: 3777778 (3,6M) [application/octet-stream]
Saving to: ‘T1.mgz?versionId=null’

T1.mgz?versionId=null                                    100%[================================================================================================================================>]   3,60M  1,21MB/s    in 3,0s    

2018-03-23 09:06:07 (1,21 MB/s) - ‘T1.mgz?versionId=null’ saved [3777778/3777778]

$> wget -S 'http://fcp-indi.s3.amazonaws.com/data/Projects/CORR/Outputs/IBA_TRT/freesurfer/0027256-session_2/mri/T1.mgz'               
--2018-03-23 09:13:40--  http://fcp-indi.s3.amazonaws.com/data/Projects/CORR/Outputs/IBA_TRT/freesurfer/0027256-session_2/mri/T1.mgz
Resolving fcp-indi.s3.amazonaws.com (fcp-indi.s3.amazonaws.com)... 54.231.33.131
Connecting to fcp-indi.s3.amazonaws.com (fcp-indi.s3.amazonaws.com)|54.231.33.131|:80... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 404 Not Found
  x-amz-request-id: D524E765315BF904
  x-amz-id-2: A49WIVJJZJB+N92BpqNIiSt75osl29SojPLHKzvgX1XPZRumO+43YGBjwwfPSEYWrTBCBwmxqX4=
  x-amz-delete-marker: true
  x-amz-version-id: ZT77s.ror9NN7Yt7bjGtH5h36leBw8Yp
  Content-Type: application/xml
  Transfer-Encoding: chunked
  Date: Fri, 23 Mar 2018 13:13:39 GMT
  Server: AmazonS3
2018-03-23 09:13:40 ERROR 404: Not Found.

since many urls do come from versioned fcp-indi bucket it I wondered if it would be great to remove ambiguity and make access more robust (unless bucket gets removed/recreated which would invalidate versionIds) by replacing URLs with versioned urls, like
http://fcp-indi.s3.amazonaws.com/data/Projects/BGSP/orig_bids/sub-1435/ses-01/anat/sub-1435_ses-01_T1w.nii.gz?versionId=ZzwCQ1fzDpWfUZzNvVGqwAONQ_QL.eI9
instead of
http://fcp-indi.s3.amazonaws.com/data/Projects/BGSP/orig_bids/sub-1435/ses-01/anat/sub-1435_ses-01_T1w.nii.gz . datalad ls could be of help here:

$> datalad ls -aL s3://fcp-indi/data/Projects/BGSP/orig_bids/sub-1435/ses-01/anat/sub-1435_ses-01_T1w.nii.gz                                                                          
Connecting to bucket: fcp-indi
[INFO   ] S3 session: Connecting to the bucket fcp-indi 
Bucket info:
  Versioning: S3ResponseError: 403 Forbidden
     Website: S3ResponseError: 403 Forbidden
         ACL: S3ResponseError: 403 Forbidden
data/Projects/BGSP/orig_bids/sub-1435/ses-01/anat/sub-1435_ses-01_T1w.nii.gz 2016-12-04T13:20:43.000Z 4853715 ver:ZzwCQ1fzDpWfUZzNvVGqwAONQ_QL.eI9  acl:AccessDenied  http://fcp-indi.s3.amazonaws.com/data/Projects/BGSP/orig_bids/sub-1435/ses-01/anat/sub-1435_ses-01_T1w.nii.gz?versionId=ZzwCQ1fzDpWfUZzNvVGqwAONQ_QL.eI9 [OK]
@yarikoptic
Copy link
Author

a few more besides CORR and BGSP

addurls(error): /mnt/btrfs/datasets/datalad/crawl/labs/openneurolab/metasearch/abide_initiative/sub-50993/ses-1/T1_rep-0.mgz (file) [AnnexBatchCommandError: command 'addurl'
Error, annex reported failure for addurl (url='https://s3.amazonaws.com/fcp-indi/data/Projects/ABIDE_Initiative/Outputs/freesurfer/5.1/NYU_0050993/mri/T1.mgz'): {'command': 'addurl', 'file': None, 'success': False} [annexrepo.py:add_url_to_file:2086]]
addurls(error): /mnt/btrfs/datasets/datalad/crawl/labs/openneurolab/metasearch/gsp/sub-Sub0001_Ses1/ses-1/sub-0001_ses-01_T1w_rep-0.nii.gz (file) [AnnexBatchCommandError: command 'addurl'
Error, annex reported failure for addurl (url='https://s3.amazonaws.com/fcp-indi/data/Projects/BrainGenomicsSuperstructProject/orig_bids/sub-0001/ses-01/anat/sub-0001_ses-01_T1w.nii.gz'): {'command': 'addurl', 'file': None, 'success': False} [annexrepo.py:add_url_to_file:2086]]
addurls(error): /mnt/btrfs/datasets/datalad/crawl/labs/openneurolab/metasearch/ixi/sub-71/ses-1/IXI071-Guys-0770-T1_rep-0.nii.gz (file) [AnnexBatchCommandError: command 'addurl'
Error, annex reported failure for addurl (url='https://files.osf.io/v1/resources/5h7sv/providers/osfstorage/5839bc346c613b0210294263'): {'command': 'addurl', 'file': None, 'success': False} [annexrepo.py:add_url_to_file:2086]]

@satra
Copy link
Member

satra commented Apr 4, 2018

@yarikoptic - i think that would be a good idea (to use versioned URLs).

it would also be great if we knew when objects matched to each other across git-annex repos. so if i already have abide from datalad, it would be nice that openneurolab/metasearch would not duplicate files locally.

how do we make crawlers common? datalad has crawlers, metasearch has crawlers, and it seems we should be able to use datalad crawlers to generate metasearch csv.

@yarikoptic
Copy link
Author

versioned urls: I guess we could help with datalad.support.s3.get_versioned_url

matched objects: what would you expect then to be done, e.g. symlink to be created into another local (eg abide) dataset? or key file cp --reflink-auto'ed across? hardlinked?

common crawlers: I guess would indeed be nice if there was some "standard" or at least "common" collection of crawlers providing data about availability/versions/etc so different tools (metasearch, datalad,...) could use them. Someone should look into all the biocaddy and others I guess.

@satra
Copy link
Member

satra commented Apr 4, 2018

@yarikoptic - i don't know how it would work, so here are some thoughts:

let's say there is a global filesystem on my computer (could be at annex level or datalad level).

datalad config --global-store /path/to/store or
git-annex config --global-store /path/to/store

each git repo has its own local store (git annex), as normal. but, git annex would point to special local remote (global store). any file that's in global store will not be copied

when i do a get, and this remote is local, it will:

  1. try to fetch from global store
  2. fetch from other location and push to global store
  3. create a link locally

if i modify the file, the:

  1. the modification is staged locally, (if hard links are allowed, this is simple).
  2. moved to the global store on commit.

@yarikoptic
Copy link
Author

@joeyh what do you think about above? seems to go along our discussion while at montreal. Such generic global-store could be "web-like" special remote providing access to keys, and otherwise not being trusted etc. "It could be provided by some normal local git-annex remote which could be registered also as any other git remote, so content could be "copied to" to populate it.

@joeyh
Copy link

joeyh commented Apr 5, 2018 via email

@yarikoptic
Copy link
Author

FTR: regarding "global-store" -- understanding was achieved and implemented at git-annex level, see https://git-annex.branchable.com/tips/local_caching_of_annexed_files/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants