Add DatasetType="project" and rework existing "layout" example into a proper BIDS dataset #1861

yarikoptic · 2024-06-20T08:59:52Z

Rationale 1 (major): BIDS standard already provides reasonable structure to formalize organization of various components of a neuroscientific data project: where to place code, original (source) data, derivaitve data, README, CHANGES. Many projects (e.g. nipoppy, YODA, etc) propose similar and often might be even "inspired" templates . If we explicitly allow for BIDS standard to prescribe study level organization, IMHO it would help many people and projects decide on how to organize their studies/projects.
Rationale 2: IMHO BIDS standard should describe only what standard prescribe and not recommend some potential "non-standardized" layouts. That is why I "reworked" that example into a legitimate BIDS dataset merely by adding dataset_description.json.

TODOs:

provide more relation to existing approaches (attn @snastase)
[started] craft example(s) for bids-example : Add examples with DatasetType = "project" bids-examples#451
- ensure bids-validator with modified schema passes its validation

Some references

edit 1: fresh related quote from medium.com post (emphasis added):

DrivenData Labs, the team behind the popular Cookiecutter Data Science template with over 7.8k stars on GitHub, knows these frustrations well.

After working on over 100 data science projects with a range of organizations, from new startups to large foundations and Fortune 50 companies. They’ve identified a major issue: the lack of standardization in project organization, collaboration, and reproducibility.

edit 2: paper "A practical guide to data management and sharing for biomedical laboratory researchers" (paywall) https://doi.org/10.1016/j.expneurol.2024.114815 -- organization of data at lab level but so also at a project level. Check on more ideas and possibly comparison/transformation. Looking forward also for metadata to add to BIDS

This reverts commit a3c12f8 where I have tried to introduce it in bids-standard#1741 but it required a little more of further detailing.

src/common-principles.md

snastase · 2024-06-22T02:16:05Z

Here are a few examples of "project"- (or "study-") level directory structures emerging from different initiatives. The fact that several smaller initiatives are arriving at similar but different solutions is strong motivation to provide a unifying solution. BIDS is the most widely accepted of these proposed solutions, and is therefore well-positioned to provide the "project-level" standard. The recursive structure of BIDS datasets (e.g. where BIDS derivatives dataset is stored within a BIDS dataset) is already well-suited for this purpose.

Example 1: Nipoppy

Nipoppy provides on solution to this problem, but introduces some structures that diverge from BIDS. Converting this project-level Nipoppy directory to BIDS format requires only minor changes: (1) move proc into BIDS code; (2) move tabular data directory into BIDS sourcedata; (3) nest the BIDS derivatives inside the bids directory; (4) include additional project-level metadata files.

Nipoppy (original)

BIDS (minimal)

BIDS (optimal)

<dataset-root>
├── proc/
│   └── global_config.json
├── tabular
│   ├── manifest.csv
│   ├── demographics/
│   ├── assessments/
│   └── bagel.csv
├── sourcedata/
├── bids/
│   └── sub-001/
│       └── ses-A/
└── derivatives/
    ├── fmriprep/
    │   ├── 20.2.7/
    │   └── 23.1.3/
    ├── mriqc/
    │   └── 23.1.0/
    └── bagel.csv

project-nipoppy/
├── code/
│   └── global_config.json
├── sourcedata/
│   ├── tabular/
│   │   ├── manifest.csv
│   │   ├── demographics/
│   │   ├── assessments/
│   │   └── bagel.csv
│   └── raw/
│       └── sub-001/
│           └── ses-A/
├── derivatives/
│   ├── fmriprep/
│   │   ├── 20.2.7/
│   │   └── 23.1.3/
│   ├── mriqc/
│   │   └── 23.1.0/
│   └── bagel.csv
├── README 
├── dataset_description.json 
└── CHANGES

project-nipoppy/
├── code/
│   └── global_config.json
├── sourcedata/
│   ├── tabular/
│   │   ├── manifest.csv
│   │   ├── demographics/
│   │   ├── assessments/
│   │   └── bagel.csv
│   └── raw/
│       └── sub-001/
│           └── ses-A/
├── derivatives/
│   ├── fmriprep-20.2.7/
│   ├── fmriprep-23.1.3/
│   ├── mriqc-23.1.0/
│   └── neurobagel-0.0.1/
│       └── bagel.csv
├── README 
├── dataset_description.json 
└── CHANGES

Example 2: The Princeton Handbook for Reproducible Neuroimaging

In the Princeton Handbook for Reproducible Neuroimaging we pre-populate a project-level directory structure for code, data, etc—which will typically contain one or more BIDS datasets within it. This directory structure would be converted to a BIDS-compliant version by repositioning dicom and other data directories inside a sourcedata directory and adding the accompanying top-level metadata files.

Princeton (original)

BIDS (minimal)

BIDS (optimal)

new_study_template/
├── code/
│   ├── analysis/
│   ├── preprocessing/
│   └── task/
└── data/
    ├── behavioral/
    ├── bids/
    │   ├── sub-001/
    │   ├── sub-002/
    │   ├── sub-003/
    │   └── derivatives/
    │       ├── deface/
    │       ├── fmriprep/
    │       ├── freesurfer/
    │       └── mriqc/
    ├── dicom/
    └── work/

project-princeton/
├── code/
│   ├── analysis/
│   ├── preprocessing/
│   └── task/
├── sourcedata/
│   ├── behavior/
│   ├── raw/
│   │   ├── sub-001/
│   │   ├── sub-002/
│   │   ├── sub-003/
│   │   └── derivatives/
│   │       ├── deface/
│   │       ├── fmriprep/
│   │       ├── freesurfer/
│   │       └── mriqc/
│   ├── dicom/
│   └── work/
├── README 
├── dataset_description.json 
└── CHANGES

project-princeton/
├── code/
│   ├── analysis
│   ├── preprocessing/
│   └── task/
├── sourcedata/
│   ├── behavior/
│   ├── raw/
│   │   ├── sub-001/
│   │   ├── sub-002/
│   │   ├── sub-003/
│   └── dicom/
├── derivatives/
│   ├── deface/
│   ├── fmriprep/
│   ├── freesurfer/
│   ├── mriqc/
│   └── work/
├── README 
├── dataset_description.json 
└── CHANGES

Example 3: YODA

YODA introduces a set of principles for best practices for data analysis. Here, we nest several of the top-level example directories (ci, docs, andenvs) into the code directory. None of these changes interfere with the YODA principles. A critical principal of YODA is that source data are referenced from within a derivative dataset. This recursive structure is now the default for BIDS Apps like fMRIPrep (as of version 20.2.1).

YODA (original)

BIDS (minimal)

├── ci/
│   └── .travis.yml
├── code/
│   ├── tests/ 
│   │   └── test_myscript.py
│   └── myscript.py
├── docs/
│   ├── build/
│   └── source/
├── envs/
│   └── Singularity
├── inputs/ 
│   └── data/
│       ├── dataset1/
│       │   └── datafile_a
│       └── dataset2/
│           └── datafile_a
├── important_results/
│   └── figures/
├── CHANGELOG.md
├── HOWTO.md
└── README.md

project-yoda/
├── code/
│   ├── ci/
│   │   └── .travis.yml
│   ├── tests/ 
│   │   └── test_myscript.py
│   ├── envs/
│   │   └── Singularity
│   ├── docs/
│   │   ├── build/
│   │   └── source/
│   ├── myscript.py
│   └── HOWTO.md
├── sourcedata/ 
│   └── data/
│       ├── dataset-1/
│       │   └── datafile-a
│       └── dataset-2/
│           └── datafile-a
├── derivatives/
│   └── results-important/
│       └── figures/
├── CHANGES
├── dataset_description.json 
└── README

Example 4: BIDS-MEGA

The proposed top-level directory structure for BIDS-MEGA BEP035 is already nearly BIDS-compliant. The only substantive change is to nest the study-* directories within sourcedata.

BIDS-MEGA (original)

BIDS (minimal)

my_megaanalysis/
├── dataset_description.json
├── studies.json
├── studies.tsv
├── derivatives/
│   ├── nimare-0.0.10
│   :
│
├── study-doe2012/
│   ├── dataset_description.json
│   ├── participants.json
│   ├── participants.tsv
│   ├── derivatives/
│   ├── sub-001/
│   ├── sub-002/
│   :
│
├── study-mustermann2017/
│   ├── dataset_description.json
│   ├── participants.json
│   ├── participants.tsv
│   ├── derivatives/
│   ├── sub-001/
│   ├── sub-002/
│   :
│
├── study-smith2015/
:   ├── dataset_description.json
    ├── participants.json
    ├── participants.tsv
    ├── derivatives/
    ├── sub-001/
    ├── sub-002/
    :

project-megaanalysis/
├── code/
│   ├── studies.json
│   └── studies.tsv
├── sourcedata/
│   ├── study-doe2012/
│   │   ├── dataset_description.json
│   │   ├── participants.json
│   │   ├── participants.tsv
│   │   ├── derivatives/
│   │   ├── sub-001/
│   │   └── sub-002/
│   ├── study-mustermann2017/
│   │   ├── dataset_description.json
│   │   ├── participants.json
│   │   ├── participants.tsv
│   │   ├── derivatives/
│   │   ├── sub-001/
│   │   └── sub-002/
│   └── study-smith2015/
│       ├── dataset_description.json
│       ├── participants.json
│       ├── participants.tsv
│       ├── derivatives/
│       ├── sub-001/
│       └── sub-002/
├── derivatives/
│   └── nimare-0.0.10/
├── CHANGES
├── dataset_description.json 
└── README

nikhil153 · 2024-07-02T20:13:47Z

Hi @yarikoptic, @snastase, @Remi-Gau , @jbpoline, @michellewang,

Here is a revised nipoppy layout that conforms to BIDS (minimal) proposal based on our discussions. The key motivations of nipoppy are as follows:

We consider nipoppy primarily as a protocol for study coordinators / data managers dealing with iterative data capture, curation, processing, and tracking tasks. This widens our scope beyond typical BIDSification to include several additional files and processing support (e.g. Boutiques).
The protocol initiates with creation of two project/study level files i.e. nipoppy_manifest and nipoppy_config during data capture stage irrespective of data types and modalities. These files provide a starting point (i.e. ground truth) for all subsequent nipoppy protocol stages, and therefore are intuitively placed at the root of the dataset.
We expect phenotypic (i.e. tabular) and imaging data to be collected and organized via independent workflows and people. Thus we prefer the aggregated phenotypic mode and place this subdirectory on the same level as imaging subdirectory within sourcedata. This simplifies access control and tracking (i.e. bagels) of data availability.

Hope this makes sense!
Let me know your thoughts on the sample layout below. We would like to finalize it soon as we are training several collaborators and deploying it for their studies in coming weeks.

Thanks!

<DATASET_ROOT>/
├── nipoppy_config.json (goes into bidsignore)
├── nipoppy_manifest.json (goes into bidsigonre)
├── scratch/ (goes into bidsigonre)
├── sourcedata/
│   ├── downloads/ 
│   ├── raw_imaging/
│   │   ├── unorg/ (possibly this can be moved into downloads...) 
│   │   └── org/
│   ├── imaging/
│   │   ├── participants.tsv
│   │   ├── sub-01/
│   │   │   ├── anat
│   │   │   └── func
│   │   └── sub-02/
│   │       ├── anat
│   │       └── dwi
│   └── tabular/
│       ├── demographics/
│       ├── assessments/
│       └── bagel.csv
├── code/
│   ├── proc/
│   │   ├── invocations/
│   │   ├── descriptors/
│   │   ├── tracker_configs/
│   │   └── pybids/
│   │       ├── bids_db/
│   │       └── ignore_patterns/
│   ├── utils/
│   │   ├── generate_manifest.py
│   │   └── download_dicoms.py
│   └── analysis/
│       ├── run_func_connectivity.py
│       └── run_my_fancy_ML_model.py
├── derivatives/
│   ├── fmriprep/
│   │   ├── 20.2.7/
│   │   └── 23.1.3/
│   ├── mriqc/
│   │   └── 23.1.0/
│   └── bagel.csv
├── README 
├── dataset_description.json 
├── CHANGES 
└── .bidsignore

yarikoptic · 2024-07-03T20:54:33Z

Thank you @nikhil153 ! does make sense. Let me just run 1 more idea for you. Many tools (e.g. git, datalad, heudiconv) place their configuration, "state" etc files into a corresponding .dotdir. That then allows for better modular composition without affecting overall "well being" of otherwise unrelated to the project folder. (Imagine if all .git/* files were in top level?). Would you consider instead of

├── nipoppy_config.json (goes into bidsignore)
├── nipoppy_manifest.json (goes into bidsigonre)

having a .nipoppy/ (does not need to go into bidsignore - .dotdirs are ignored IIRC) with config.json and manifest.json ? That would also make it easy to extend etc (happen you decide to add more of various files etc). FWIW, even scratch/ could go there (like .git/tmp), but I am also ok with it top level - it is a generic pattern and we could provision for it explicitly even at the level of the standard (that there is scratch/ to be ignored etc).

nikhil153 · 2024-07-05T13:40:58Z

@yarikoptic - thanks for a quick review!

We did consider having a nipoppy dir to clearly separate BIDS specification from nipoppy-protocol related files and intermediate output. Although it could work, we had concerns about visibility of key files i.e. manifest.csv and config.json which we expect the users to interact with regularly. If these files are hidden by default, it is likely to be confusing for the study coordinators / data managers, who are responsible for owning and maintaining them. Apart from these two files, I believe everything else could go into .dotdir if that's preferred.

@jbpoline, @michellewang - Happy to hear thoughts / alternatives!

yarikoptic · 2024-07-05T19:10:09Z

indeed, agree, that if files being often worked on, .dotdir might be too hidden. But IMHO nipoppy/ folder is as visible and might be even more so in case of e.g. having also a good number of sub- folders below which would push files listing to the very bottom. Another benefit -- nipoppy users after changing to nipoppy/ folder might feel "home" as all nipoppy files would be conveniently groupped for them in that folder.

yarikoptic · 2024-07-18T21:18:36Z

@nikhil153 re above, I wonder if you intend to have sourcedata/imaging/ to

be BIDS dataset
only imaging BIDS dataset?

or alternatively -- where would you place "raw BIDS dataset" which might incorporate behavioral, phenotypic(, physiological, ...) data?

nikhil153 · 2024-07-28T19:42:29Z

Hi @yarikoptic -

re: I wonder if you intend to have sourcedata/imaging/ to
- be BIDS dataset
- only imaging BIDS dataset?

In that proposal, it was meant to be only imaging BIDS dataset.

However, there have been new developments based on the feedback we received in our training workshops with several collaborators over the last few week, including at ENIGMA-PD where we are trying to define and deploy a common end-to-end study SOP / protocol for many contributing sites.

The tricky thing for us has been the alignment of BIDS layout with the stages of nipoppy protocol (see Fig-1) i.e. capture --> curate --> process --> track --> extract. Most data managers currently, associate BIDS mainly with the curation of imaging data and leave curation of phenotypic and derivatives to the users. Thus for nipoppy protocol, BIDSification remains one of its many stages. This makes it less intuitive to see nipoppy as a one the submodules of BIDS specification - especially if we try to fit within current BIDS1.0 layout.

Specifically, dumping all non-derived nipoppy data directories inside sourcedata gets confusing. Especially because that diverges from the current definitions of directory levels of sourcedata and rawdata (see this and this).

So based on our discussions with collaborators, we have revised the proposed layout (see below). The major change from the last proposal comprises keeping sourcedata, imaging, phenotypic, derivative data directories to be at the top level. This results in:

Only imaging directory remains BIDS compliant in the current specification.
Allows more granular specification of imaging data wrangling and preparation of BIDSification within sourcedata and avoids confusion with rawdata
Simplifies (parallel) curation and access control of phenotypic and derived data.

These decisions are mainly motivated by our need to keep the study protocol as intuitive as possible - especially for ongoing and prospective studies. I am happy to discuss how this can be refined and possibly inform BIDS2.0 study specification.

Maybe we can setup a call?

my_study/
├── manifest.tsv (participant details: populate during recruitment / first step of the protocol) 
├── global_config.json (processing details) 
├── sourcedata (imaging + pheno + other stuff that is captured during the study) /
│   ├── imaging (dicoms / nifties/ parrec) /
│   │   ├── messy (data dumps)
│   │   └── ordered (e.g. subject/session/<files>) 
│   └── tabular (unorganized Excel sheets / RedCAP reports) 
├── imaging (the only BIDS compliant subdir now) /
│   ├── dataset_description.json 
│   ├── sub-01/
│   │   └── ses-A/
│   │       ├── anat
│   │       └── dwi
│   └── sub-02/
│       └── ses-A/
│           ├── anat
│           └── dwi
├── tabular (we could call this "pheno" but we have a slight preference for "tabular" because we don't store imaging-phenotypes here)/
│   ├── demographics.tsv (file with pre-specfied column names for universal variables such as age, sex, group) 
│   ├── assessments (optional study-specific behavioural, cognitive, physiological variables)/
│   │   ├── instrument_A.tsv
│   │   └── instrument_B.tsv 
│   └── status.tsv (tracked availability of tabular data)
├── derivatives /
│   ├── pipeline_A/
│   │   └── 1.0/
│   │       ├── output (pipeline output)
│   │       └── idp (extracted analysis-ready aggregated measures) 
│   └── status.tsv (tracked availability of imaging data)
├── workflow_configs /
│   ├── containers/
│   │   └── pipeline_A-1.0.sif (possibly a symlink)
│   ├── descriptors/
│   │   └── pipeline_A-1.0.json
│   ├── invocations/
│   │   └── pipeline_A-1.0.json
│   ├── tracker/
│   │   └── pipeline_A-1.0.json
│   └── pybids_ignore_patterns/
│       └── pipeline_A-1.0.json
├── code/
│   ├── utils /
│   │   └── generate_manifest.py 
│   └── analysis/
│       └── run_my_fancy_ML_model.py 
├── scratch (should be deletable) /
│   ├── pybids_db
│   ├── downloads (.tar.gz) 
│   └── work (tmp working dir for pipelines) 
└── logs

nikhil153 · 2024-08-08T17:51:07Z

Hi @yarikoptic, @jbpoline - just pinging to check if you had any comments?

jbpoline · 2024-08-08T19:16:48Z

Hi, this sounds reasonable to me, but would be great to discuss with @yarikoptic and others (@arokem, @dorahermes , @cmaumet, @CPernet ...) I think it also relates to the larger issue of the BIDS scope and whether it is beneficial for BIDS to have all type of data as a subcomponents, or not. There are pros and cons on that, but proposing something that makes adoption by other communities than the neuroimaging one would be important to me.

CPernet · 2024-08-09T10:44:25Z

I am not sure if the goal is to have a validation of the entire structure, ignoring explicitly some parts, or add a config file to specify which folder needs validation.

assuming you use the BIDS validator with some config file, the validation would work on
imaging
tabular/phenotype (seem easy enough to have a subfolder)
derivatives --> you can choose to make a derivatives datasets (ie add a json) with source pointing to the imaging folder (ok for now the validator does not do much on derivatives but in principle can)

yarikoptic · 2024-08-09T17:49:05Z

@nikhil153 sorry for the delays, I am still coming out of workation mode into proper work mode - need to more time to digest/reply.

@CPernet I think we somewhat hijacked this PR into something more important and probably more useful - an attempt to harmonize existing schemes (e.g. of nipoppy) into a prototypical "BIDS project". For the purpose of this PR I was not looking into extending BIDS beyond what it already has really in terms of "official folders" -- all those could be added later.

The main point behind this PR is that event current BIDS, with its existing specification for having CHANGES, dataset_description.json, README.md, code/, sourcedata/, and derivatives/ already covers a majority needs of existing schemes, without changes to BIDS standard at all. How we could extend it later, and how to bolt on various validation helpers is IMHO beyond the scope of this PR. For this PR validator indeed would just need to verify the DatasetType value (after fixed in validator) and otherwise inspect those other already standard files and locations.

I will take it out of draft since I think it is ready for the review (and it has been reviewed ;) ) although indeed to be merged/released , we better make validator ready for it.

yarikoptic · 2024-08-28T13:42:46Z

Dear all participants of this PR and @bids-maintenance folks -- I would like to move the discussions we had on "how to transform X into BIDS project" into some document so

they do not divert our attention from actual PR content
we have better convenience in discussing them (PR comments on text are much better than free flow of comments in the main post)
they do not get burried in the comments whenever this PR is merged -- I think they are great and worth their own life.

I do not think that bids-specification itself is a good place. But may be somewhere on the new website or "old" bids-starter-kit? (attn @Remi-Gau )

Remi-Gau · 2024-08-28T13:45:57Z

"How to" should go on the new bids website.

yarikoptic added 2 commits June 20, 2024 00:16

Add the notion that example layout can in fact be a valid BIDS dataset

37c32f5

This reverts commit a3c12f8 where I have tried to introduce it in bids-standard#1741 but it required a little more of further detailing.

Move and extend description and definition of DatasetType "project"

2d9bfdf

tsalo reviewed Jun 21, 2024

View reviewed changes

src/common-principles.md Show resolved Hide resolved

yarikoptic changed the title ~~Add DataType="project" and rework existing "layout" example into a proper BIDS dataset~~ Add DatasetType="project" and rework existing "layout" example into a proper BIDS dataset Jun 22, 2024

This was referenced Jun 22, 2024

Validate dataset_description.json contents bids-standard/bids-validator#2007

Closed

Add examples with DatasetType = "project" bids-standard/bids-examples#451

Draft

yarikoptic marked this pull request as ready for review August 9, 2024 17:49

yarikoptic requested review from erdalkaraca and DimitriPapadopoulos as code owners August 9, 2024 17:49

yarikoptic mentioned this pull request Aug 29, 2024

rawdata & root (top-level) BIDS dataset #1882

Open

This was referenced Sep 20, 2024

Allow composition of a BIDS dataset (dataset level) from smaller (subj or subj/ses) level bids-standard/bids-2-devel#59

Open

Use-case(s): BIDS-inspired/like standards bids-standard/bids-2-devel#62

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DatasetType="project" and rework existing "layout" example into a proper BIDS dataset #1861

Add DatasetType="project" and rework existing "layout" example into a proper BIDS dataset #1861

yarikoptic commented Jun 20, 2024 •

edited

Loading

snastase commented Jun 22, 2024 •

edited

Loading

nikhil153 commented Jul 2, 2024

yarikoptic commented Jul 3, 2024 •

edited

Loading

nikhil153 commented Jul 5, 2024

yarikoptic commented Jul 5, 2024

yarikoptic commented Jul 18, 2024 •

edited

Loading

nikhil153 commented Jul 28, 2024 •

edited

Loading

nikhil153 commented Aug 8, 2024

jbpoline commented Aug 8, 2024

CPernet commented Aug 9, 2024

yarikoptic commented Aug 9, 2024

yarikoptic commented Aug 28, 2024

Remi-Gau commented Aug 28, 2024

Add DatasetType="project" and rework existing "layout" example into a proper BIDS dataset #1861

Are you sure you want to change the base?

Add DatasetType="project" and rework existing "layout" example into a proper BIDS dataset #1861

Conversation

yarikoptic commented Jun 20, 2024 • edited Loading

Some references

snastase commented Jun 22, 2024 • edited Loading

Example 1: Nipoppy

Example 2: The Princeton Handbook for Reproducible Neuroimaging

Example 3: YODA

Example 4: BIDS-MEGA

nikhil153 commented Jul 2, 2024

yarikoptic commented Jul 3, 2024 • edited Loading

nikhil153 commented Jul 5, 2024

yarikoptic commented Jul 5, 2024

yarikoptic commented Jul 18, 2024 • edited Loading

nikhil153 commented Jul 28, 2024 • edited Loading

nikhil153 commented Aug 8, 2024

jbpoline commented Aug 8, 2024

CPernet commented Aug 9, 2024

yarikoptic commented Aug 9, 2024

yarikoptic commented Aug 28, 2024

Remi-Gau commented Aug 28, 2024

yarikoptic commented Jun 20, 2024 •

edited

Loading

snastase commented Jun 22, 2024 •

edited

Loading

yarikoptic commented Jul 3, 2024 •

edited

Loading

yarikoptic commented Jul 18, 2024 •

edited

Loading

nikhil153 commented Jul 28, 2024 •

edited

Loading