Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NF: Create updated dataset-level extractor for BIDS datasets #104

Merged
merged 27 commits into from
Sep 1, 2022

Conversation

jsheunis
Copy link
Member

@jsheunis jsheunis commented Feb 16, 2022

This PR is in response to #94. It adds a dataset-level extractor for BIDS datasets, called bids_dataset that:

  • uses gen4 metadata handling with datalad-metalad (and introduces this dependency, datalad-metalad>=0.3.1)
  • ensures the required files dataset_description.json and participants.tsv are available locally before proceeding with the extraction process
  • does not require locally available file content, other than the above mentioned or README text file content, which it makes part of the extraction output (automatically running get where applicable)
  • is compatible with pybids>=0.15.1 and BIDS v1.6.0
  • does not change the existing bids extractor in any way
  • extracts extra information about the BIDS dataset (compared to the existing bids extractor), including information about subjects, sessions, runs, tasks, entities, and variables.

Old PR comment:

This WIP PR intended to address #94 introduces a new extractor bids_dataset that:

  • builds on the new/next generation datalad-metalad functionality
  • only extracts metadata from content that does not require getting annexed data
  • replaces dataset-level metadata extraction functionality of the existing bids.py extractor

Main changes:

  • adds bids_dataset.py extractor
  • adds test_bids_dataset.py test
  • adds new extractor entrypoint to setup.cfg
  • adds metalad requirement to setup.cfg

@yarikoptic
Copy link
Member

  • only extracts metadata from content that does not require getting annexed data

you are killing my dream of using glorious datalad-fuse! ;-) and how do you know that those files would not be annexed?

@codecov
Copy link

codecov bot commented Feb 16, 2022

Codecov Report

Merging #104 (9cd3a3b) into master (caec500) will increase coverage by 0.69%.
The diff coverage is 90.44%.

❗ Current head 9cd3a3b differs from pull request most recent head 73cc22f. Consider uploading reports for the commit 73cc22f to get more accurate results

@@            Coverage Diff             @@
##           master     #104      +/-   ##
==========================================
+ Coverage   85.12%   85.82%   +0.69%     
==========================================
  Files          21       23       +2     
  Lines        1049     1206     +157     
==========================================
+ Hits          893     1035     +142     
- Misses        156      171      +15     
Impacted Files Coverage Δ
datalad_neuroimaging/extractors/bids_dataset.py 88.00% <88.00%> (ø)
...neuroimaging/extractors/tests/test_bids_dataset.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update caec500...73cc22f. Read the comment docs.

@jsheunis
Copy link
Member Author

you are killing my dream of using glorious datalad-fuse! ;-) and how do you know that those files would not be annexed?

good point. I don't know that they won't be annexed.

Something that I think is important with the use of metalad is to make a distinction between dataset and file level extractors. With this update focusing on the dataset level, does it perhaps make sense to require a specific set of BIDS-compatible files that would be necessary for extraction of dataset-level metadata (e.g. participants.tsv, any READMEs, dataset_description.json) and ignore the rest?

I realise that there could be any number of files in the BIDS dataset from which to extract file-level metadata but which, if combined/aggregated/derived, could turn into metadata that is interpretable on the dataset-level. But the amount or combination of such files is arbitrary and not easily definable on the dataset-level. And I think even without that information, the currently proposed dataset-level extractor could still extract useful information.

@jsheunis jsheunis closed this Feb 16, 2022
@jsheunis jsheunis reopened this Feb 16, 2022
from datalad.metadata.definitions import vocabulary_id
from datalad.utils import assure_unicode
from typing import Dict, List, Union
import json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually this should see the impact of python -m isort -m3 -fgw 2 -tc datalad_neuroimaging/extractors/bids_dataset.py

datalad_neuroimaging/extractors/bids_dataset.py Outdated Show resolved Hide resolved
datalad_neuroimaging/extractors/bids_dataset.py Outdated Show resolved Hide resolved
@jsheunis
Copy link
Member Author

jsheunis commented Apr 1, 2022

TODO for @jsheunis:

  • ensure dataset_description.json file is available locally, if not get it.
  • decide if README extraction is handled as is, or by tbd helper
  • UUID
  • proper logging, result yielding, error handling, and user messaging

@jsheunis
Copy link
Member Author

you are killing my dream of using glorious datalad-fuse! ;-) and how do you know that those files would not be annexed?

good point. I don't know that they won't be annexed.

Something that I think is important with the use of metalad is to make a distinction between dataset and file level extractors. With this update focusing on the dataset level, does it perhaps make sense to require a specific set of BIDS-compatible files that would be necessary for extraction of dataset-level metadata (e.g. participants.tsv, any READMEs, dataset_description.json) and ignore the rest?

I realise that there could be any number of files in the BIDS dataset from which to extract file-level metadata but which, if combined/aggregated/derived, could turn into metadata that is interpretable on the dataset-level. But the amount or combination of such files is arbitrary and not easily definable on the dataset-level. And I think even without that information, the currently proposed dataset-level extractor could still extract useful information.

The challenge of whether required files are annexed or not is addressed by correctly specifying the get_required_content functionality, as per f5655da

@jsheunis
Copy link
Member Author

Ok, this PR has been open for a long time and I want to merge it. I'm going to finish whatever is remaining and easy to complete, and will then merge unless there is strong disagreement @datalad/developers

@jsheunis
Copy link
Member Author

  • only extracts metadata from content that does not require getting annexed data

you are killing my dream of using glorious datalad-fuse! ;-) and how do you know that those files would not be annexed?

This initial statement is not correct anymore. Any annexed content that is necessary for the extraction process will be fetched before extraction starts via metalad's get_required_content functionality, or where applicable on a case by case basis (e.g. readme content).

@jsheunis
Copy link
Member Author

Remaining issue is the failing test, will sort that now.

@jsheunis
Copy link
Member Author

jsheunis commented Aug 31, 2022

Remaining test failure seems to be related to some dataset not containing gen4 metadata (probably outdated testdata?, or could be related to the metalad version bump?), while test_bids_dataset.py now succeeds. Will create a separate issue for remaining failure.

@jsheunis jsheunis changed the title [WIP] Update BIDS extractor Create updated dataset-level extractor for BIDS datasets Sep 1, 2022
@jsheunis jsheunis changed the title Create updated dataset-level extractor for BIDS datasets NF: Create updated dataset-level extractor for BIDS datasets Sep 1, 2022
@jsheunis jsheunis merged commit 186c294 into master Sep 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants