NF: Create updated dataset-level extractor for BIDS datasets #104

jsheunis · 2022-02-16T21:33:10Z

This PR is in response to #94. It adds a dataset-level extractor for BIDS datasets, called bids_dataset that:

uses gen4 metadata handling with datalad-metalad (and introduces this dependency, datalad-metalad>=0.3.1)
ensures the required files dataset_description.json and participants.tsv are available locally before proceeding with the extraction process
does not require locally available file content, other than the above mentioned or README text file content, which it makes part of the extraction output (automatically running get where applicable)
is compatible with pybids>=0.15.1 and BIDS v1.6.0
does not change the existing bids extractor in any way
extracts extra information about the BIDS dataset (compared to the existing bids extractor), including information about subjects, sessions, runs, tasks, entities, and variables.

Old PR comment:

This WIP PR intended to address #94 introduces a new extractor bids_dataset that:

builds on the new/next generation datalad-metalad functionality
only extracts metadata from content that does not require getting annexed data
replaces dataset-level metadata extraction functionality of the existing bids.py extractor

Main changes:

adds bids_dataset.py extractor
adds test_bids_dataset.py test
adds new extractor entrypoint to setup.cfg
adds metalad requirement to setup.cfg

…datessetup.py

datalad_neuroimaging/extractors/bids_dataset.py

yarikoptic · 2022-02-16T21:44:01Z

only extracts metadata from content that does not require getting annexed data

you are killing my dream of using glorious datalad-fuse! ;-) and how do you know that those files would not be annexed?

codecov · 2022-02-16T21:44:21Z

Codecov Report

Merging #104 (9cd3a3b) into master (caec500) will increase coverage by 0.69%.
The diff coverage is 90.44%.

❗ Current head 9cd3a3b differs from pull request most recent head 73cc22f. Consider uploading reports for the commit 73cc22f to get more accurate results

@@            Coverage Diff             @@
##           master     #104      +/-   ##
==========================================
+ Coverage   85.12%   85.82%   +0.69%     
==========================================
  Files          21       23       +2     
  Lines        1049     1206     +157     
==========================================
+ Hits          893     1035     +142     
- Misses        156      171      +15

Impacted Files	Coverage Δ
datalad_neuroimaging/extractors/bids_dataset.py	`88.00% <88.00%> (ø)`
...neuroimaging/extractors/tests/test_bids_dataset.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update caec500...73cc22f. Read the comment docs.

jsheunis · 2022-02-16T22:32:21Z

you are killing my dream of using glorious datalad-fuse! ;-) and how do you know that those files would not be annexed?

good point. I don't know that they won't be annexed.

Something that I think is important with the use of metalad is to make a distinction between dataset and file level extractors. With this update focusing on the dataset level, does it perhaps make sense to require a specific set of BIDS-compatible files that would be necessary for extraction of dataset-level metadata (e.g. participants.tsv, any READMEs, dataset_description.json) and ignore the rest?

I realise that there could be any number of files in the BIDS dataset from which to extract file-level metadata but which, if combined/aggregated/derived, could turn into metadata that is interpretable on the dataset-level. But the amount or combination of such files is arbitrary and not easily definable on the dataset-level. And I think even without that information, the currently proposed dataset-level extractor could still extract useful information.

datalad_neuroimaging/extractors/bids_dataset.py

mih · 2022-02-22T12:20:27Z

datalad_neuroimaging/extractors/bids_dataset.py

+from datalad.metadata.definitions import vocabulary_id
+from datalad.utils import assure_unicode
+from typing import Dict, List, Union
+import json


Eventually this should see the impact of python -m isort -m3 -fgw 2 -tc datalad_neuroimaging/extractors/bids_dataset.py

datalad_neuroimaging/extractors/bids_dataset.py

jsheunis · 2022-04-01T09:50:26Z

TODO for @jsheunis:

ensure dataset_description.json file is available locally, if not get it.
decide if README extraction is handled as is, or by tbd helper
UUID
proper logging, result yielding, error handling, and user messaging

…text-file

jsheunis · 2022-05-13T12:46:37Z

you are killing my dream of using glorious datalad-fuse! ;-) and how do you know that those files would not be annexed?

good point. I don't know that they won't be annexed.

Something that I think is important with the use of metalad is to make a distinction between dataset and file level extractors. With this update focusing on the dataset level, does it perhaps make sense to require a specific set of BIDS-compatible files that would be necessary for extraction of dataset-level metadata (e.g. participants.tsv, any READMEs, dataset_description.json) and ignore the rest?

I realise that there could be any number of files in the BIDS dataset from which to extract file-level metadata but which, if combined/aggregated/derived, could turn into metadata that is interpretable on the dataset-level. But the amount or combination of such files is arbitrary and not easily definable on the dataset-level. And I think even without that information, the currently proposed dataset-level extractor could still extract useful information.

The challenge of whether required files are annexed or not is addressed by correctly specifying the get_required_content functionality, as per f5655da

jsheunis · 2022-08-31T12:02:22Z

Ok, this PR has been open for a long time and I want to merge it. I'm going to finish whatever is remaining and easy to complete, and will then merge unless there is strong disagreement @datalad/developers

jsheunis · 2022-08-31T13:10:09Z

only extracts metadata from content that does not require getting annexed data

you are killing my dream of using glorious datalad-fuse! ;-) and how do you know that those files would not be annexed?

This initial statement is not correct anymore. Any annexed content that is necessary for the extraction process will be fetched before extraction starts via metalad's get_required_content functionality, or where applicable on a case by case basis (e.g. readme content).

jsheunis · 2022-08-31T13:35:26Z

Remaining issue is the failing test, will sort that now.

jsheunis · 2022-08-31T14:00:05Z

Remaining test failure seems to be related to some dataset not containing gen4 metadata (probably outdated testdata?, or could be related to the metalad version bump?), while test_bids_dataset.py now succeeds. Will create a separate issue for remaining failure.

This reverts commit 4d43f0b.

jsheunis added 4 commits February 2, 2022 21:54

creates 'bids_dataset' extractor using nextgen metalad, adds test, up…

77dea43

…datessetup.py

cleanup

cbf1942

Merge branch 'master' into update-bids-extractor

7723708

add bids_dataset extractor entrypoint and metalad requirement

f23c77e

yarikoptic reviewed Feb 16, 2022

View reviewed changes

datalad_neuroimaging/extractors/bids_dataset.py Show resolved Hide resolved

jsheunis closed this Feb 16, 2022

jsheunis reopened this Feb 16, 2022

christian-monch reviewed Feb 22, 2022

View reviewed changes

datalad_neuroimaging/extractors/bids_dataset.py Outdated Show resolved Hide resolved

mih reviewed Feb 22, 2022

View reviewed changes

remove reading file as binary; use

c6b57cc

jsheunis mentioned this pull request Apr 1, 2022

README extractor datalad/datalad-catalog#39

Closed

jsheunis added 4 commits April 1, 2022 12:35

adds v4 uuid

aff21c9

add progress logger

01b87b1

lock recent versions of metalad and pybids

52f4d9f

add get_required_content functionality; abstract getting string from …

f5655da

…text-file

jsheunis mentioned this pull request Apr 21, 2022

NF: adds get_text_from_file helper for extractors datalad/datalad-metalad#242

Closed

jsheunis added 9 commits May 13, 2022 15:54

update readme extraction; case insensitivity; use Path.read_text()

2b0bb82

set min python version at 3.7 to acount for pybids 0.15.1

fdaf314

install python 3.7 to account for pybids 0.15.1

9cd3a3b

add ability to find root of bids directory within datalad dataset

73cc22f

add ability to find root of bids directory within datalad dataset

ad2bc56

remove explicit call to get_required_content

13a4290

fix non-relative patterns unsupportued error

256503d

make global function for find_bids_root

e2234b6

pass dataset path as argument

f1537ca

mslw mentioned this pull request Jul 4, 2022

Bids metadata translation: jq: error (at <stdin>:79): string ("") and object ({"extension...) cannot be added jsheunis/fairly-big-catalog-workflow#8

Open

jsheunis added 2 commits August 31, 2022 14:33

black code formatting

10f7b05

docstrings and tweaks

4edf5f2

jsheunis added 3 commits August 31, 2022 15:15

cleanup imports

9fc7f58

set docs language to 'en'

dd3f16a

add pytest to development requirements

32e7f3c

jsheunis added 2 commits August 31, 2022 15:47

update correct test data

d6a1adf

bump datalad and metalad versions

4d43f0b

Revert "bump datalad and metalad versions"

19a6821

This reverts commit 4d43f0b.

mslw mentioned this pull request Sep 1, 2022

Document writing custom extractors datalad/datalad-metalad#281

Closed

downgrade datalad version as workaround to #106

cd1ab4a

jsheunis changed the title ~~[WIP] Update BIDS extractor~~ Create updated dataset-level extractor for BIDS datasets Sep 1, 2022

jsheunis changed the title ~~Create updated dataset-level extractor for BIDS datasets~~ NF: Create updated dataset-level extractor for BIDS datasets Sep 1, 2022

jsheunis merged commit 186c294 into master Sep 1, 2022

jsheunis mentioned this pull request Sep 9, 2022

ENH: adds workflow functionality datalad/datalad-catalog#182

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NF: Create updated dataset-level extractor for BIDS datasets #104

NF: Create updated dataset-level extractor for BIDS datasets #104

jsheunis commented Feb 16, 2022 •

edited

Loading

yarikoptic commented Feb 16, 2022

codecov bot commented Feb 16, 2022 •

edited

Loading

jsheunis commented Feb 16, 2022

mih Feb 22, 2022

jsheunis commented Apr 1, 2022 •

edited

Loading

jsheunis commented May 13, 2022

jsheunis commented Aug 31, 2022

jsheunis commented Aug 31, 2022

jsheunis commented Aug 31, 2022

jsheunis commented Aug 31, 2022 •

edited

Loading

NF: Create updated dataset-level extractor for BIDS datasets #104

NF: Create updated dataset-level extractor for BIDS datasets #104

Conversation

jsheunis commented Feb 16, 2022 • edited Loading

yarikoptic commented Feb 16, 2022

codecov bot commented Feb 16, 2022 • edited Loading

Codecov Report

jsheunis commented Feb 16, 2022

mih Feb 22, 2022

Choose a reason for hiding this comment

jsheunis commented Apr 1, 2022 • edited Loading

jsheunis commented May 13, 2022

jsheunis commented Aug 31, 2022

jsheunis commented Aug 31, 2022

jsheunis commented Aug 31, 2022

jsheunis commented Aug 31, 2022 • edited Loading

jsheunis commented Feb 16, 2022 •

edited

Loading

codecov bot commented Feb 16, 2022 •

edited

Loading

jsheunis commented Apr 1, 2022 •

edited

Loading

jsheunis commented Aug 31, 2022 •

edited

Loading