New tool: Required publication references #236

ewels · 2019-01-02T11:50:30Z

It would be nice to make it easier for people to know what should be referenced if they use a pipeline in a manuscript. For example, nf-core references <pipeline-name> could return a list of the references that you need to add into your paper. (alt names: nf-core refs, nf-core bib..?)

Different flags could give different output formats, but perhaps the default could be prose text. For example:

Data was processed using nf-core/rnaseq [pipeline DOI, nf-core paper]. This pipeline is built using nextflow [nextflow paper] and uses the following tools: FastQC (Quality control of raw data) [ref], TrimGalore! (Trimming of adapter sequence contamination) [ref], STAR (Alignment of RNA-seq reads to the reference genome) [ref] …etc

Need to think about where and how to capture this information in the pipeline files. For example, a simple YAML file could work nicely:

- tools:
    - fastqc:
        - name: FastQC
        - description: Quality control of raw data
        - ref: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
    - trimgalore:
        - name: Trim Galore!
        - description: Trimming of adapter sequence contamination
        - ref:
            - https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/
            - 10.14806/ej.17.1.200
    - star:
        - name: STAR
        - description: Alignment of RNA-seq reads to the reference genome
        - ref: 10.1093/bioinformatics/bts635

Requirements:

Should handle either DOI or URL (DOI preferable where available)
Should be able to handle multiple references per tool
- Alternatively, force one per tool and instead list multiple tools? eg. have Cutadapt in its own entry above.
Name and reference should be mandatory
Additional text per tool should be as short as possible

Output options could be:

List of references alone
List of tool names and references
Full prose text
Prose text without additional tool descriptions
Option to give references in different formats, with a DOI lookup

The nextflow and nf-core references can be hardcoded. The workflow DOI can be lifted from README.md I guess. Or could potentially be added as a new workflow.metadata variable?

Thoughts / feedback?

Phil

The text was updated successfully, but these errors were encountered:

maxulysse · 2019-01-02T11:53:03Z

I would also add the versions of each tools too

drpatelh · 2019-01-02T12:18:32Z

It might be good to host a central database (e.g. yaml) of tools and their associated information. This can then be used to parse the conda yaml to create a tool specific publication description that would be linked by release to the pipeline. It would be much neater to just reference the pipeline in papers (if morally possible) - with a sentence pointing to the pipeline for all the tool-specific citations. I've often been asked to trim down text and a decision may need to be made as to which tools you cite... I generally provide a short description of the tool, version, reference and pubmed id. Maybe we can provide this as a file that gets bundled with the pipeline that can be linked on the pipeline home page?

sven1103 · 2019-01-02T12:32:40Z

Hm, I was just thinking that we get this information for free over the Anaconda API, right?

For example:
https://api.anaconda.org/package/bioconda/samtools

Although package maintainers do not always provide all fields info (which is bad!).

So instead of having another yaml file, we could use the environment.yml. If a package does not provide a description, it might be good practise to contact the package maintainer to do so?

maxulysse · 2019-01-02T12:52:31Z

We might want to add extra informations, like an actual publication or DOI for the pipeline

sven1103 · 2019-01-02T12:54:57Z

hm, i see. There is no such thing as a tool registry with DOI and publication URIs, right? Maybe we need this...

ewels · 2019-01-02T13:12:40Z

It might be good to host a central database (e.g. yaml) of tools and their associated information.

I see where you're going with this, however I quite like that all pipelines are totally self-sufficient currently. Especially if this will be used within tool execution, as many users run offline.

It would be much neater to just reference the pipeline in papers (if morally possible)

I don't think that it is morally good to do this. If people decide that they need to do this then that can be on their shoulders, but I don't think that we should help them.

I generally provide a short description of the tool, version, reference and pubmed id.

Yes - this is basically the information that I was thinking of listing (though DOI instead of pubmed). A table with this information would be a nice output option too though..

Maybe we can provide this as a file that gets bundled with the pipeline that can be linked on the pipeline home page?

Yes, that could be very nice actually. We have an ACKNOWLEDGMENTS.txt file that we deliver with all data from our centre to try to help people to mention us in their paper. The pipelines could do the same here, so that it's obviously alongside the results files when the pipeline runs.

ewels · 2019-01-02T13:17:14Z

Hm, I was just thinking that we get this information for free over the Anaconda API, right?

Not really - we're already using this for the nf-core licenceses command, but it doesn't have any info about publications that I'm aware of.. It's specifically the DOI / publication reference that I'm thinking of here.

Tying the names in with environment.yml and potentially using the descriptions would be a nice idea though 👍 The summary field where available should contain this. It will not describe how it's used in the pipeline though, so not as good as a specific string.

sven1103 · 2019-01-02T13:32:08Z

Maybe should activate this discussion again: nextflow-io/nextflow#866

Tools and parameters that are used in Nextflow should be descripted in a structured way, so humans and machines can work with it.

I also see the tools metadata such as URI, URL, description and parameters there combined... Just brain-storming here.

drpatelh · 2019-01-02T13:33:23Z

How about tool-specific parameters? e.g. if you aren't using the defaults. I generally provide these as a double-quoted string for full traceability and reproducibility. Would it be enough to have these defined within main.nf bearing in mind that these may also change between releases.

ewels · 2019-01-02T14:09:00Z

Yes, I wondered about putting this kind of information alongside the parameter schema described in that issue. However, parameters and tool metadata are distinct, so it may not make sense. For example, it could break parsing by the general tools form-building tools discussed on that thread. A section of nextflow.config dedicated towards describing tools could work though, especially alongside the feature request for parsing tool version numbers at run time. Any thoughts @pditommaso?

How about tool-specific parameters

This is getting a bit off-topic now 😅But yes, I think having them defined in main.nf is enough - this file is tagged with each release so easy to find again. They're also in the trace and reports that are saved with the results. Personally, I think it improves code readability if they're in main.nf alongside the command template, instead of separately held in a different location.

pditommaso · 2019-01-03T09:19:18Z

IMO maintaining a separate annotation file does not work because very easily it gets out of sync with the actual tools used in the pipeline script.

Ideally these info should be inferred during process execution nextflow-io/nextflow#879. Alternatively we could add an annotation in the module/process definition nextflow-io/nextflow#984.

Otherwise the best approximation could be the Conda environment file tho, if I'm understanding well, the problem is that it does not include the citation/paper DOI, right? Not sure but I think using the tool name and version it should be possible to infer the related metadata from biotools.

Pinging @bgruening and @ypriverol who should know about the state of the art of bioconda/containers /biotools interoperability.

bgruening · 2019-01-03T09:28:06Z

@ewels @pditommaso we actually do include identifier into conda, see here: https://github.com/bioconda/bioconda-recipes/blob/master/recipes/multiqc/meta.yaml#L137

This means you can infer this from the conda package or bio.tools. A DOI can, and should, be added to the conda package as well.

Does this answer your quesiton?

sven1103 · 2019-01-03T09:49:22Z

Uh, this is actually very nice.

Just checked the API request for fasttree:
https://bio.tools/api/tool/fasttree

Seems that we get the information we need from it, so no need to have an additional file.

ewels · 2019-01-03T10:00:46Z

Fantastic - this is is great news! Many thanks @bgruening - I didn't know that this lookup existed.
However, it looks like the identifier isn't given in the Anaconda API 😞 https://api.anaconda.org/package/bioconda/multiqc

Any ideas on how we can best fetch this information? If we can it would be great to use this method. If we want, we could even get the linter to warn if the biotools identifier is missing.

A DOI can, and should, be added to the conda package as well.

Also under the identifiers section, as done here I guess? Cool! I'll add this to the MultiQC recipe.

bgruening · 2019-01-03T10:05:35Z

Any ideas on how we can best fetch this information? If we can it would be great to use this method. If we want, we could even get the linter to warn if the biotools identifier is missing.

Short answer is that its part of the tarball and with this part of the installation, afaik.
Long answer is that we are working on a central service (bio.tools) to make this all way easier and also independent of conda ... so a unified interface to pkgs and containers.

Also under the identifiers section, as done here I guess? Cool! I'll add this to the MultiQC recipe.

Yes :)

ewels · 2019-01-03T10:17:00Z

ok cool, thanks!

Then I wonder if the best bet is to just try pinging the biotools API with the conda package name if it's in the bioconda channel. I guess that the two will essentially always be the same.. This won't match up versions and could in some weird edge cases give the wrong information, so not ideal. But I don't really fancy downloading and extracting all software just for this fast little utility command.

ewels · 2019-01-03T10:18:15Z

..could also just grab the raw bioconda meta.yml directly from GitHub and parse the identifiers from that. But again, will be tricky to match up versions so not a whole lot better I guess.

bgruening · 2019-01-03T10:23:46Z

This depends if you always have internet access during the workflow run. I guess querying the API is ok. I suppose digging the information out of conda is also easy - which should be already available locally.

ewels · 2019-01-03T10:45:41Z

Ah true, there are two different use cases here. I was thinking primarily about a new nf-core references cli tool which would run totally separately from the workflow.

For using the data within a workflow run (eg. saving it to an ACKNOWLEDGMENTS.txt file), I think we need all of the data locally because so many people run without internet access. How would we go about finding the information from a local conda install? I've had a quick dig around but haven't found the meta.yml file yet.

ewels · 2019-01-03T10:47:19Z

..but we'd still need an internet connection for bio.tools. I think that this needs to be a separate cli tool. If we want the output as a results file with the pipeline then this should probably be a static file which is saved separately I think. If we want automation, the lint tool could check that it exists and is up to date (maybe on --release only for the latter).

bgruening · 2019-01-03T10:55:35Z

Have a look at miniconda3/pkgs/samtools-1.8-3/info/recipe/meta.yaml

ewels · 2021-03-18T23:30:31Z

This issue is getting much more manageable with DSL2 modules, where we have a meta file for each tool that includes DOI 🎉 (typically taken from Bioconda).

This could potentially be used both for a command line tool but also within pipelines as the meta file should be bundled within each pipeline.

jfy133 · 2023-06-19T09:46:26Z

Following on from: #2326 (which starts providing a framework to insert this into a MultiQC report):

@maxulysse and @mashehu have both said we should automate this even more and should be possible via the DOIs in the meta.yml.

From @maxulysse a conceptual plan:

Adding bibtex to meta.yaml.
Ditch the get software version module and refactor the versions channel into a map like tools:versions+modules.
- @mashehu suggests do similar channel/mag as with versions here
Auto generate a nice version HTML in pure groovy based on the versions from the map (versions field)
Auto generate citations based on the map (modules field) and parse modules to get citations if available (edited)

Initial problems I see:

Do we really want to load meta.ml into memory/process for every module we execute?
- What about including an ext.arg in the module that holds that information and export that in a similar way to versions.yml?
How to get bibtext information for all modules
- Should be quite straightforward as a one time thing using a cROSSREF API or similar
- Would need to add functionality from tools to somehow pull this in when a DOI is added to meta.yml
How to format citations from BibTex stuff in Nextflow?
- There are a few Java libraries at least for this: jBibtex and something from JabRef

ewels added the command line tools Anything to do with the cli interfaces label Jan 2, 2019

ewels mentioned this issue Jan 3, 2019

MultiQC: add DOI in identifiers bioconda/bioconda-recipes#12890

Merged

5 tasks

ewels closed this as completed Jan 3, 2019

ewels reopened this Jan 3, 2019

ewels mentioned this issue Jan 7, 2019

Idea: Automated methods section generation nextflow-io/nextflow#786

Closed

ewels mentioned this issue Apr 27, 2019

Add citation guidelines to template #312

Closed

ewels mentioned this issue Jul 28, 2019

Add citation information following the citation file format #361

Closed

ewels mentioned this issue Sep 23, 2019

Add nag to cite pipeline in summary nf-core/rnaseq#281

Closed

sven1103 mentioned this issue Sep 24, 2019

Add example citation file format nf-core/hlatyping#65

Closed

8 tasks

jfy133 mentioned this issue Jun 18, 2023

Add option to dynamically insert tool citations into MultiQC report #2326

Merged

4 tasks

This was referenced Jul 4, 2023

Important! Template update for nf-core/tools v2.9 nf-core/rnaseq#1053

Merged

Important! Template update for nf-core/tools v2.9 nf-core/sarek#1140

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New tool: Required publication references #236

New tool: Required publication references #236

ewels commented Jan 2, 2019

maxulysse commented Jan 2, 2019

drpatelh commented Jan 2, 2019

sven1103 commented Jan 2, 2019

maxulysse commented Jan 2, 2019

sven1103 commented Jan 2, 2019

ewels commented Jan 2, 2019

ewels commented Jan 2, 2019

sven1103 commented Jan 2, 2019 •

edited by ewels

Loading

drpatelh commented Jan 2, 2019

ewels commented Jan 2, 2019

pditommaso commented Jan 3, 2019

bgruening commented Jan 3, 2019

sven1103 commented Jan 3, 2019

ewels commented Jan 3, 2019

bgruening commented Jan 3, 2019

ewels commented Jan 3, 2019

ewels commented Jan 3, 2019

bgruening commented Jan 3, 2019

ewels commented Jan 3, 2019

ewels commented Jan 3, 2019

bgruening commented Jan 3, 2019

ewels commented Mar 18, 2021

jfy133 commented Jun 19, 2023

New tool: Required publication references #236

New tool: Required publication references #236

Comments

ewels commented Jan 2, 2019

maxulysse commented Jan 2, 2019

drpatelh commented Jan 2, 2019

sven1103 commented Jan 2, 2019

maxulysse commented Jan 2, 2019

sven1103 commented Jan 2, 2019

ewels commented Jan 2, 2019

ewels commented Jan 2, 2019

sven1103 commented Jan 2, 2019 • edited by ewels Loading

drpatelh commented Jan 2, 2019

ewels commented Jan 2, 2019

pditommaso commented Jan 3, 2019

bgruening commented Jan 3, 2019

sven1103 commented Jan 3, 2019

ewels commented Jan 3, 2019

bgruening commented Jan 3, 2019

ewels commented Jan 3, 2019

ewels commented Jan 3, 2019

bgruening commented Jan 3, 2019

ewels commented Jan 3, 2019

ewels commented Jan 3, 2019

bgruening commented Jan 3, 2019

ewels commented Mar 18, 2021

jfy133 commented Jun 19, 2023

sven1103 commented Jan 2, 2019 •

edited by ewels

Loading