Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DatasetType="project" and rework existing "layout" example into a proper BIDS dataset #1861

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

yarikoptic
Copy link
Collaborator

@yarikoptic yarikoptic commented Jun 20, 2024

  • Rationale 1 (major): BIDS standard already provides reasonable structure to formalize organization of various components of a neuroscientific data project: where to place code, original (source) data, derivaitve data, README, CHANGES. Many projects (e.g. nipoppy, YODA, etc) propose similar and often might be even "inspired" templates . If we explicitly allow for BIDS standard to prescribe study level organization, IMHO it would help many people and projects decide on how to organize their studies/projects.
  • Rationale 2: IMHO BIDS standard should describe only what standard prescribe and not recommend some potential "non-standardized" layouts. That is why I "reworked" that example into a legitimate BIDS dataset merely by adding dataset_description.json.

TODOs:

Some references

  • edit 1: fresh related quote from medium.com post (emphasis added):

DrivenData Labs, the team behind the popular Cookiecutter Data Science template with over 7.8k stars on GitHub, knows these frustrations well.

After working on over 100 data science projects with a range of organizations, from new startups to large foundations and Fortune 50 companies. They’ve identified a major issue: the lack of standardization in project organization, collaboration, and reproducibility.

  • edit 2: paper "A practical guide to data management and sharing for biomedical laboratory researchers" (paywall) https://doi.org/10.1016/j.expneurol.2024.114815 -- organization of data at lab level but so also at a project level. Check on more ideas and possibly comparison/transformation. Looking forward also for metadata to add to BIDS

@snastase
Copy link
Contributor

snastase commented Jun 22, 2024

Here are a few examples of "project"- (or "study-") level directory structures emerging from different initiatives. The fact that several smaller initiatives are arriving at similar but different solutions is strong motivation to provide a unifying solution. BIDS is the most widely accepted of these proposed solutions, and is therefore well-positioned to provide the "project-level" standard. The recursive structure of BIDS datasets (e.g. where BIDS derivatives dataset is stored within a BIDS dataset) is already well-suited for this purpose.

Example 1: Nipoppy

Nipoppy provides on solution to this problem, but introduces some structures that diverge from BIDS. Converting this project-level Nipoppy directory to BIDS format requires only minor changes: (1) move proc into BIDS code; (2) move tabular data directory into BIDS sourcedata; (3) nest the BIDS derivatives inside the bids directory; (4) include additional project-level metadata files.

Nipoppy (original) BIDS (minimal) BIDS (optimal)
<dataset-root>
├── proc/
│   └── global_config.json
├── tabular
│   ├── manifest.csv
│   ├── demographics/
│   ├── assessments/
│   └── bagel.csv
├── sourcedata/
├── bids/
│   └── sub-001/
│       └── ses-A/
└── derivatives/
    ├── fmriprep/
    │   ├── 20.2.7/
    │   └── 23.1.3/
    ├── mriqc/
    │   └── 23.1.0/
    └── bagel.csv
project-nipoppy/
├── code/
│   └── global_config.json
├── sourcedata/
│   ├── tabular/
│   │   ├── manifest.csv
│   │   ├── demographics/
│   │   ├── assessments/
│   │   └── bagel.csv
│   └── raw/
│       └── sub-001/
│           └── ses-A/
├── derivatives/
│   ├── fmriprep/
│   │   ├── 20.2.7/
│   │   └── 23.1.3/
│   ├── mriqc/
│   │   └── 23.1.0/
│   └── bagel.csv
├── README 
├── dataset_description.json 
└── CHANGES 
project-nipoppy/
├── code/
│   └── global_config.json
├── sourcedata/
│   ├── tabular/
│   │   ├── manifest.csv
│   │   ├── demographics/
│   │   ├── assessments/
│   │   └── bagel.csv
│   └── raw/
│       └── sub-001/
│           └── ses-A/
├── derivatives/
│   ├── fmriprep-20.2.7/
│   ├── fmriprep-23.1.3/
│   ├── mriqc-23.1.0/
│   └── neurobagel-0.0.1/
│       └── bagel.csv
├── README 
├── dataset_description.json 
└── CHANGES 

Example 2: The Princeton Handbook for Reproducible Neuroimaging

In the Princeton Handbook for Reproducible Neuroimaging we pre-populate a project-level directory structure for code, data, etc—which will typically contain one or more BIDS datasets within it. This directory structure would be converted to a BIDS-compliant version by repositioning dicom and other data directories inside a sourcedata directory and adding the accompanying top-level metadata files.

Princeton (original) BIDS (minimal) BIDS (optimal)
new_study_template/
├── code/
│   ├── analysis/
│   ├── preprocessing/
│   └── task/
└── data/
    ├── behavioral/
    ├── bids/
    │   ├── sub-001/
    │   ├── sub-002/
    │   ├── sub-003/
    │   └── derivatives/
    │       ├── deface/
    │       ├── fmriprep/
    │       ├── freesurfer/
    │       └── mriqc/
    ├── dicom/
    └── work/
project-princeton/
├── code/
│   ├── analysis/
│   ├── preprocessing/
│   └── task/
├── sourcedata/
│   ├── behavior/
│   ├── raw/
│   │   ├── sub-001/
│   │   ├── sub-002/
│   │   ├── sub-003/
│   │   └── derivatives/
│   │       ├── deface/
│   │       ├── fmriprep/
│   │       ├── freesurfer/
│   │       └── mriqc/
│   ├── dicom/
│   └── work/
├── README 
├── dataset_description.json 
└── CHANGES 
project-princeton/
├── code/
│   ├── analysis
│   ├── preprocessing/
│   └── task/
├── sourcedata/
│   ├── behavior/
│   ├── raw/
│   │   ├── sub-001/
│   │   ├── sub-002/
│   │   ├── sub-003/
│   └── dicom/
├── derivatives/
│   ├── deface/
│   ├── fmriprep/
│   ├── freesurfer/
│   ├── mriqc/
│   └── work/
├── README 
├── dataset_description.json 
└── CHANGES 

Example 3: YODA

YODA introduces a set of principles for best practices for data analysis. Here, we nest several of the top-level example directories (ci, docs, andenvs) into the code directory. None of these changes interfere with the YODA principles. A critical principal of YODA is that source data are referenced from within a derivative dataset. This recursive structure is now the default for BIDS Apps like fMRIPrep (as of version 20.2.1).

YODA (original) BIDS (minimal)
├── ci/
│   └── .travis.yml
├── code/
│   ├── tests/ 
│   │   └── test_myscript.py
│   └── myscript.py
├── docs/
│   ├── build/
│   └── source/
├── envs/
│   └── Singularity
├── inputs/ 
│   └── data/
│       ├── dataset1/
│       │   └── datafile_a
│       └── dataset2/
│           └── datafile_a
├── important_results/
│   └── figures/
├── CHANGELOG.md
├── HOWTO.md
└── README.md
project-yoda/
├── code/
│   ├── ci/
│   │   └── .travis.yml
│   ├── tests/ 
│   │   └── test_myscript.py
│   ├── envs/
│   │   └── Singularity
│   ├── docs/
│   │   ├── build/
│   │   └── source/
│   ├── myscript.py
│   └── HOWTO.md
├── sourcedata/ 
│   └── data/
│       ├── dataset-1/
│       │   └── datafile-a
│       └── dataset-2/
│           └── datafile-a
├── derivatives/
│   └── results-important/
│       └── figures/
├── CHANGES
├── dataset_description.json 
└── README

Example 4: BIDS-MEGA

The proposed top-level directory structure for BIDS-MEGA BEP035 is already nearly BIDS-compliant. The only substantive change is to nest the study-* directories within sourcedata.

BIDS-MEGA (original) BIDS (minimal)
my_megaanalysis/
├── dataset_description.json
├── studies.json
├── studies.tsv
├── derivatives/
│   ├── nimare-0.0.10
│   :
│
├── study-doe2012/
│   ├── dataset_description.json
│   ├── participants.json
│   ├── participants.tsv
│   ├── derivatives/
│   ├── sub-001/
│   ├── sub-002/
│   :
│
├── study-mustermann2017/
│   ├── dataset_description.json
│   ├── participants.json
│   ├── participants.tsv
│   ├── derivatives/
│   ├── sub-001/
│   ├── sub-002/
│   :
│
├── study-smith2015/
:   ├── dataset_description.json
    ├── participants.json
    ├── participants.tsv
    ├── derivatives/
    ├── sub-001/
    ├── sub-002/
    :
project-megaanalysis/
├── code/
│   ├── studies.json
│   └── studies.tsv
├── sourcedata/
│   ├── study-doe2012/
│   │   ├── dataset_description.json
│   │   ├── participants.json
│   │   ├── participants.tsv
│   │   ├── derivatives/
│   │   ├── sub-001/
│   │   └── sub-002/
│   ├── study-mustermann2017/
│   │   ├── dataset_description.json
│   │   ├── participants.json
│   │   ├── participants.tsv
│   │   ├── derivatives/
│   │   ├── sub-001/
│   │   └── sub-002/
│   └── study-smith2015/
│       ├── dataset_description.json
│       ├── participants.json
│       ├── participants.tsv
│       ├── derivatives/
│       ├── sub-001/
│       └── sub-002/
├── derivatives/
│   └── nimare-0.0.10/
├── CHANGES
├── dataset_description.json 
└── README

@yarikoptic yarikoptic changed the title Add DataType="project" and rework existing "layout" example into a proper BIDS dataset Add DatasetType="project" and rework existing "layout" example into a proper BIDS dataset Jun 22, 2024
@nikhil153
Copy link

Hi @yarikoptic, @snastase, @Remi-Gau , @jbpoline, @michellewang,

Here is a revised nipoppy layout that conforms to BIDS (minimal) proposal based on our discussions. The key motivations of nipoppy are as follows:

  1. We consider nipoppy primarily as a protocol for study coordinators / data managers dealing with iterative data capture, curation, processing, and tracking tasks. This widens our scope beyond typical BIDSification to include several additional files and processing support (e.g. Boutiques).
  2. The protocol initiates with creation of two project/study level files i.e. nipoppy_manifest and nipoppy_config during data capture stage irrespective of data types and modalities. These files provide a starting point (i.e. ground truth) for all subsequent nipoppy protocol stages, and therefore are intuitively placed at the root of the dataset.
  3. We expect phenotypic (i.e. tabular) and imaging data to be collected and organized via independent workflows and people. Thus we prefer the aggregated phenotypic mode and place this subdirectory on the same level as imaging subdirectory within sourcedata. This simplifies access control and tracking (i.e. bagels) of data availability.

Hope this makes sense!
Let me know your thoughts on the sample layout below. We would like to finalize it soon as we are training several collaborators and deploying it for their studies in coming weeks.

Thanks!

<DATASET_ROOT>/
├── nipoppy_config.json (goes into bidsignore)
├── nipoppy_manifest.json (goes into bidsigonre)
├── scratch/ (goes into bidsigonre)
├── sourcedata/
│   ├── downloads/ 
│   ├── raw_imaging/
│   │   ├── unorg/ (possibly this can be moved into downloads...) 
│   │   └── org/
│   ├── imaging/
│   │   ├── participants.tsv
│   │   ├── sub-01/
│   │   │   ├── anat
│   │   │   └── func
│   │   └── sub-02/
│   │       ├── anat
│   │       └── dwi
│   └── tabular/
│       ├── demographics/
│       ├── assessments/
│       └── bagel.csv
├── code/
│   ├── proc/
│   │   ├── invocations/
│   │   ├── descriptors/
│   │   ├── tracker_configs/
│   │   └── pybids/
│   │       ├── bids_db/
│   │       └── ignore_patterns/
│   ├── utils/
│   │   ├── generate_manifest.py
│   │   └── download_dicoms.py
│   └── analysis/
│       ├── run_func_connectivity.py
│       └── run_my_fancy_ML_model.py
├── derivatives/
│   ├── fmriprep/
│   │   ├── 20.2.7/
│   │   └── 23.1.3/
│   ├── mriqc/
│   │   └── 23.1.0/
│   └── bagel.csv
├── README 
├── dataset_description.json 
├── CHANGES 
└── .bidsignore

@yarikoptic
Copy link
Collaborator Author

yarikoptic commented Jul 3, 2024

Thank you @nikhil153 ! does make sense. Let me just run 1 more idea for you. Many tools (e.g. git, datalad, heudiconv) place their configuration, "state" etc files into a corresponding .dotdir. That then allows for better modular composition without affecting overall "well being" of otherwise unrelated to the project folder. (Imagine if all .git/* files were in top level?). Would you consider instead of

├── nipoppy_config.json (goes into bidsignore)
├── nipoppy_manifest.json (goes into bidsigonre)

having a .nipoppy/ (does not need to go into bidsignore - .dotdirs are ignored IIRC) with config.json and manifest.json ? That would also make it easy to extend etc (happen you decide to add more of various files etc). FWIW, even scratch/ could go there (like .git/tmp), but I am also ok with it top level - it is a generic pattern and we could provision for it explicitly even at the level of the standard (that there is scratch/ to be ignored etc).

@nikhil153
Copy link

@yarikoptic - thanks for a quick review!

We did consider having a nipoppy dir to clearly separate BIDS specification from nipoppy-protocol related files and intermediate output. Although it could work, we had concerns about visibility of key files i.e. manifest.csv and config.json which we expect the users to interact with regularly. If these files are hidden by default, it is likely to be confusing for the study coordinators / data managers, who are responsible for owning and maintaining them. Apart from these two files, I believe everything else could go into .dotdir if that's preferred.

@jbpoline, @michellewang - Happy to hear thoughts / alternatives!

@yarikoptic
Copy link
Collaborator Author

indeed, agree, that if files being often worked on, .dotdir might be too hidden. But IMHO nipoppy/ folder is as visible and might be even more so in case of e.g. having also a good number of sub- folders below which would push files listing to the very bottom. Another benefit -- nipoppy users after changing to nipoppy/ folder might feel "home" as all nipoppy files would be conveniently groupped for them in that folder.

@yarikoptic
Copy link
Collaborator Author

yarikoptic commented Jul 18, 2024

@nikhil153 re above, I wonder if you intend to have sourcedata/imaging/ to

  • be BIDS dataset
  • only imaging BIDS dataset?

or alternatively -- where would you place "raw BIDS dataset" which might incorporate behavioral, phenotypic(, physiological, ...) data?

@nikhil153
Copy link

nikhil153 commented Jul 28, 2024

Hi @yarikoptic -

re: I wonder if you intend to have sourcedata/imaging/ to
- be BIDS dataset
- only imaging BIDS dataset?

In that proposal, it was meant to be only imaging BIDS dataset.

However, there have been new developments based on the feedback we received in our training workshops with several collaborators over the last few week, including at ENIGMA-PD where we are trying to define and deploy a common end-to-end study SOP / protocol for many contributing sites.

The tricky thing for us has been the alignment of BIDS layout with the stages of nipoppy protocol (see Fig-1) i.e. capture --> curate --> process --> track --> extract. Most data managers currently, associate BIDS mainly with the curation of imaging data and leave curation of phenotypic and derivatives to the users. Thus for nipoppy protocol, BIDSification remains one of its many stages. This makes it less intuitive to see nipoppy as a one the submodules of BIDS specification - especially if we try to fit within current BIDS1.0 layout.

Specifically, dumping all non-derived nipoppy data directories inside sourcedata gets confusing. Especially because that diverges from the current definitions of directory levels of sourcedata and rawdata (see this and this).

So based on our discussions with collaborators, we have revised the proposed layout (see below). The major change from the last proposal comprises keeping sourcedata, imaging, phenotypic, derivative data directories to be at the top level. This results in:

  1. Only imaging directory remains BIDS compliant in the current specification.
  2. Allows more granular specification of imaging data wrangling and preparation of BIDSification within sourcedata and avoids confusion with rawdata
  3. Simplifies (parallel) curation and access control of phenotypic and derived data.

These decisions are mainly motivated by our need to keep the study protocol as intuitive as possible - especially for ongoing and prospective studies. I am happy to discuss how this can be refined and possibly inform BIDS2.0 study specification.

Maybe we can setup a call?

my_study/
├── manifest.tsv (participant details: populate during recruitment / first step of the protocol) 
├── global_config.json (processing details) 
├── sourcedata (imaging + pheno + other stuff that is captured during the study) /
│   ├── imaging (dicoms / nifties/ parrec) /
│   │   ├── messy (data dumps)
│   │   └── ordered (e.g. subject/session/<files>) 
│   └── tabular (unorganized Excel sheets / RedCAP reports) 
├── imaging (the only BIDS compliant subdir now) /
│   ├── dataset_description.json 
│   ├── sub-01/
│   │   └── ses-A/
│   │       ├── anat
│   │       └── dwi
│   └── sub-02/
│       └── ses-A/
│           ├── anat
│           └── dwi
├── tabular (we could call this "pheno" but we have a slight preference for "tabular" because we don't store imaging-phenotypes here)/
│   ├── demographics.tsv (file with pre-specfied column names for universal variables such as age, sex, group) 
│   ├── assessments (optional study-specific behavioural, cognitive, physiological variables)/
│   │   ├── instrument_A.tsv
│   │   └── instrument_B.tsv 
│   └── status.tsv (tracked availability of tabular data)
├── derivatives /
│   ├── pipeline_A/
│   │   └── 1.0/
│   │       ├── output (pipeline output)
│   │       └── idp (extracted analysis-ready aggregated measures) 
│   └── status.tsv (tracked availability of imaging data)
├── workflow_configs /
│   ├── containers/
│   │   └── pipeline_A-1.0.sif (possibly a symlink)
│   ├── descriptors/
│   │   └── pipeline_A-1.0.json
│   ├── invocations/
│   │   └── pipeline_A-1.0.json
│   ├── tracker/
│   │   └── pipeline_A-1.0.json
│   └── pybids_ignore_patterns/
│       └── pipeline_A-1.0.json
├── code/
│   ├── utils /
│   │   └── generate_manifest.py 
│   └── analysis/
│       └── run_my_fancy_ML_model.py 
├── scratch (should be deletable) /
│   ├── pybids_db
│   ├── downloads (.tar.gz) 
│   └── work (tmp working dir for pipelines) 
└── logs

@nikhil153
Copy link

Hi @yarikoptic, @jbpoline - just pinging to check if you had any comments?

@jbpoline
Copy link
Contributor

jbpoline commented Aug 8, 2024

Hi, this sounds reasonable to me, but would be great to discuss with @yarikoptic and others (@arokem, @dorahermes , @cmaumet, @CPernet ...) I think it also relates to the larger issue of the BIDS scope and whether it is beneficial for BIDS to have all type of data as a subcomponents, or not. There are pros and cons on that, but proposing something that makes adoption by other communities than the neuroimaging one would be important to me.

@CPernet
Copy link
Collaborator

CPernet commented Aug 9, 2024

I am not sure if the goal is to have a validation of the entire structure, ignoring explicitly some parts, or add a config file to specify which folder needs validation.

assuming you use the BIDS validator with some config file, the validation would work on
imaging
tabular/phenotype (seem easy enough to have a subfolder)
derivatives --> you can choose to make a derivatives datasets (ie add a json) with source pointing to the imaging folder (ok for now the validator does not do much on derivatives but in principle can)

@yarikoptic
Copy link
Collaborator Author

@nikhil153 sorry for the delays, I am still coming out of workation mode into proper work mode - need to more time to digest/reply.

@CPernet I think we somewhat hijacked this PR into something more important and probably more useful - an attempt to harmonize existing schemes (e.g. of nipoppy) into a prototypical "BIDS project". For the purpose of this PR I was not looking into extending BIDS beyond what it already has really in terms of "official folders" -- all those could be added later.

The main point behind this PR is that event current BIDS, with its existing specification for having CHANGES, dataset_description.json, README.md, code/, sourcedata/, and derivatives/ already covers a majority needs of existing schemes, without changes to BIDS standard at all. How we could extend it later, and how to bolt on various validation helpers is IMHO beyond the scope of this PR. For this PR validator indeed would just need to verify the DatasetType value (after fixed in validator) and otherwise inspect those other already standard files and locations.

I will take it out of draft since I think it is ready for the review (and it has been reviewed ;) ) although indeed to be merged/released , we better make validator ready for it.

@yarikoptic yarikoptic marked this pull request as ready for review August 9, 2024 17:49
@yarikoptic
Copy link
Collaborator Author

Dear all participants of this PR and @bids-maintenance folks -- I would like to move the discussions we had on "how to transform X into BIDS project" into some document so

  • they do not divert our attention from actual PR content
  • we have better convenience in discussing them (PR comments on text are much better than free flow of comments in the main post)
  • they do not get burried in the comments whenever this PR is merged -- I think they are great and worth their own life.

I do not think that bids-specification itself is a good place. But may be somewhere on the new website or "old" bids-starter-kit? (attn @Remi-Gau )

@Remi-Gau
Copy link
Collaborator

"How to" should go on the new bids website.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants