MRG: store taxids in lineageDB #2466

bluegenes · 2023-02-08T16:51:32Z

Summary: If NCBI taxpath is available in lineages csv file, parse + store taxids in LineageDB for later use.

Details:

remove final usage of lca_utils.LineagePair in tax_utils.py (from WIP: fix a few more uses of lca_utils.LineagePair #2465)
if taxpath column is present, read and store using tax_utils.LineagePair in LineageDB
add test NCBI lineages file, tests/test-data/tax/test.ncbi-taxonomy.csv
- provides NCBI taxonomy (including '|'-separated taxpath) for the test accessions in test1.gather.csv; generated via https://github.com/sourmash-bio/build-ncbi-lineages.
test that taxid is being utilized properly.
- kreport is the only existing function that outputs taxid (previously empty). kreport test added.

A future PR will add CAMI output function which will include taxid

codecov · 2023-02-08T16:59:20Z

Codecov Report

Merging #2466 (7995ad9) into latest (ac400fa) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           latest    #2466      +/-   ##
==========================================
- Coverage   84.57%   84.57%   -0.01%     
==========================================
  Files         132      132              
  Lines       15484    15492       +8     
  Branches     2507     2510       +3     
==========================================
+ Hits        13096    13102       +6     
- Misses       2084     2085       +1     
- Partials      304      305       +1

Flag	Coverage Δ
python	`92.53% <100.00%> (-0.02%)`	⬇️
rust	`57.79% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/sourmash/tax/__main__.py	`93.77% <100.00%> (ø)`
src/sourmash/tax/tax_utils.py	`97.91% <100.00%> (-0.17%)`	⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

bluegenes · 2023-02-09T03:58:26Z

@sourmash-bio/devs ready for review

ctb

Well that's nice and neat!

# sourmash release 4.7.0 Major new features: * provide an initial plugin architecture for sourmash that supports new signature saving & loading mechanisms (#2428) * add plugin support for new command-line subcommands (#2438) * debias all containment values (#2243) Minor new features: * Use RankLineageInfo to simplify reading lineages (#2467) * store taxids in lineageDB (#2466) * Use new tax classes for taxonomic summarization (#2443) * add tax summarization dataclasses for safety and flexibility (#2439) * add `--scaled` to sourmash compare (#2414) * replace `lca_utils.LineagePair` with `tax_utils.LineagePair` (#2441) * Add new classes for lineage manipulation (#2437) Cleanup and documentation updates: * ReadTheDocs updates (#2445) * update `sourmash compare` command-line docs (#2400) Developer updates: * fix python tests by bumping tox and pip cache versions (#2424) * Update sphinx requirement from <6,>=4.4.0 to >=4.4.0,<7 (#2430) * Build: replace milksnake with maturin (#2393) * importlib_metadata is a dependency on old Python versions (#2484) * Release docs: use two separate sed commands (#2483) * minor fixes to release behavior (#2479) * Use screed and maturin from nixpkgs in `flake.nix` (#2481) * update release procedure after v4.6.0 and v4.6.1 (#2386) * Update makefile and docs (#2432) Dependabot updates: * Bump once_cell from 1.17.0 to 1.17.1 (#2488) * Bump ouroboros from 0.15.5 to 0.15.6 (#2487) * Bump memmap2 from 0.5.8 to 0.5.9 (#2486) * Bump supercharge/redis-github-action from 1.4.0 to 1.5.0 (#2485) * Bump proptest from 1.0.0 to 1.1.0 (#2460) * Bump web-sys from 0.3.60 to 0.3.61 (#2461) * Bump serde_json from 1.0.91 to 1.0.93 (#2471) * Bump wasm-bindgen-test from 0.3.33 to 0.3.34 (#2463) * Bump cachix/install-nix-action from 18 to 19 (#2459) * Bump wasm-bindgen from 0.2.83 to 0.2.84 (#2464) * Bump typed-builder from 0.11.0 to 0.12.0 (#2451) * Bump bumpalo from 3.9.1 to 3.12.0 (#2450) * Bump pypa/cibuildwheel from 2.11.4 to 2.12.0 (#2447) * Bump bzip2 from 0.4.3 to 0.4.4 (#2444) * Bump once_cell from 1.14.0 to 1.17.0 (#2429) * Bump serde from 1.0.151 to 1.0.152 (#2423) * Bump pypa/cibuildwheel from 2.11.3 to 2.11.4 (#2422) * Bump serde_json from 1.0.89 to 1.0.91 (#2418) * Bump serde from 1.0.150 to 1.0.151 (#2419) * Bump thiserror from 1.0.37 to 1.0.38 (#2417) * Bump finch from 0.4.3 to 0.5.0 (#2416) * Bump rayon from 1.6.0 to 1.6.1 (#2404) * Bump serde from 1.0.149 to 1.0.150 (#2403) * Bump pypa/cibuildwheel from 2.11.2 to 2.11.3 (#2402) * Bump serde from 1.0.148 to 1.0.149 (#2397) * Bump capnp from 0.14.5 to 0.14.11 (#2396)

) ## Add taxonomic utilities for LINs; enable and test `tax metagenome` With taxonomy refactoring (#2437, #2439, #2443, #2446, #2466, #2467), we are (mostly) no longer tied to named ranks. Here, I add a class for LIN taxonomies and use it within `tax metagenome` to allow summarization up LINs and reporting at specified `lingroups`. With this PR, users can now use the flag `--lins` to read and use `lin` taxonomies from the provided tax (`-t`, `--taxonomy`) file. If used, `sourmash tax` will look for a `lin` column in the taxonomy file instead of looking for `superkingdom`...`strain` columns. The `lin` column should contain `;`-separated LINs, preferably with a standard number of positions (e.g. all 20 positions in length or all 10 positions in length). For `tax metagenome`: By default, `tax metagenome` will summarize up _all_ available ranks/LIN positions. If a `lingroup` file is provided, we will also report a subset of this summary: just the LIN prefixes that match groups in the `lingroup` file. The `lingroup` file requires two columns in any order: `name`, the name of the group, and `lin`, the lin prefix of the group. The prefix will be used to select results from the full summary for reporting. The `lingroup` format will build a file with the following name: `{base}.lingroup.tsv`, where `{base}` is the name provided via the `-o`,` --output-base` option. ## Demo / Tutorial A draft tutorial is available [here](https://sourmash--2469.org.readthedocs.build/en/2469/tutorial-lin-taxonomy.html). Note that it does not contain the installation info for this branch (see below). You can run the interactive version via binder [here](https://mybinder.org/v2/gh/bluegenes/2023-demo-sourmash-LIN/HEAD?labpath=sourmash-lin-demo.ipynb) ## Testing ### Option A: Use the Demo Binder You can test via the [binder](https://mybinder.org/v2/gh/bluegenes/2023-demo-sourmash-LIN/HEAD?labpath=sourmash-lin-demo.ipynb). You can add new cells or modify any existing cells, and even download additional files for testing. The downside is that you'll have to make sure to download and save your results, since the binder won't save them for you. ### Option B: Alternatively, install on your own computer/cluster: Here is one way to test this code before it gets fully integrated into sourmash: - If you don't have conda, I'd recommend installing `mamba`, [instructions here](https://mamba.readthedocs.io/en/latest/installation.html) instead. - if you do have `mamba`, replace the word `conda` with `mamba` in the following commands. Download an environment file that points to this branch: ``` curl -JLO https://github.com/bluegenes/2023-demo-sourmash-LIN/main/sourmashLIN.yml ``` Create a virtual environment using this file: ``` conda env create -f sourmashLIN.yml ``` Activate that environment: ``` conda activate smashLIN ``` make sure `--lins` is in the `--help` for `sourmash tax metagenome`: ``` sourmash tax metagenome --help ``` ## Command to run The command to run is this one: ``` sourmash tax metagenome -g $gather_csv -t $taxonomy_csv \ --lins --lingroup $lingroups_csv ``` ## Types of files you'll need 1. sketches of query metagenome 2. sketches of reference genomes (database) 3. taxonomy file with LIN information (two columns required: `ident`, `lin`) 4. lingroup information file (two columns required: `name`, `lin`) To exit the environment when you're done testing, use `conda deactivate` > Reminder, if you have `mamba`, you can use it in place of `conda` in the commands above. example `lingroup` output format. Note that the `1;0`.. paths are always grouped together, but may come before or after the `0;0` and `2;0` groups. ``` name lin percent_containment num_bp_contained lg3 2;0;0 1.56 192000 lg1 0;0;0 5.82 714000 lg2 1;0;0 5.05 620000 lg3 1;0;1 0.65 80000 lg4 1;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0 0.65 80000 ``` ``` name lin percent_containment num_bp_contained lg2 1;0;0 5.05 620000 lg3 1;0;1 0.65 80000 lg4 1;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0 0.65 80000 lg1 0;0;0 5.82 714000 lg3 2;0;0 1.56 192000 ``` ## A few implementation details: - In `tax_utils.py`, I add a `LINLineageInfo` class for using and manipulated LIN taxonomies. It implements new methods to enable specifically reading in `LIN` taxonomies into the class, but otherwise uses the taxonomic utilities available in `BaseLineageInfo`, e.g. taxonomic summarization up ranks, assessing whether two taxonomies are a match at a given rank. - In `tax_utils.py`, I add functionality for reading `lingroup` information and reporting taxonomic summarization specifically at these ranks. Changes and Additions: - [x] Add `LINLineageInfo` for working with `LIN` taxonomies - [x] Add method for reading `LIN`s into `LineageDB` - [x] Add methods for reading `LINgroups` and summarizing to these - [x] Add LineageTree that can use `LineageInfo` to perform `build_tree`, `find_lca` functions (originally in `lca_utils.py`) and produce an ordered list of lineage paths - [x] Add code + tests to use `LIN`s taxonomy in: - [x] tax metagenome - [x] tax annotate - [x] tax summarize The following require additional changes and will be punted to an issue/separate PR (see #2499): - tax genome - tax prepare - tax grep --------- Co-authored-by: C. Titus Brown <titus@idyll.org>

fix LineagePair usage?

7cc6e3f

bluegenes added 2 commits February 8, 2023 11:38

read in taxids if avail and use for kreport

95bcf8e

fix comment

3418594

bluegenes mentioned this pull request Feb 9, 2023

WIP: fix a few more uses of lca_utils.LineagePair #2465

Closed

bluegenes marked this pull request as ready for review February 9, 2023 03:16

bluegenes changed the title ~~WIP: store taxids in lineageDB~~ MRG: store taxids in lineageDB Feb 9, 2023

bluegenes added 2 commits February 8, 2023 19:48

update doc to reflect taxid in kreport

81545a4

undelete line

7995ad9

ctb approved these changes Feb 9, 2023

View reviewed changes

bluegenes merged commit fa3ead6 into latest Feb 9, 2023

bluegenes deleted the use-taxids branch February 9, 2023 15:42

bluegenes mentioned this pull request Feb 17, 2023

MRG: Add taxonomic utilities for LINs and enable tax metagenome #2469

Merged

8 tasks

ctb mentioned this pull request Mar 3, 2023

4.7.0 release #2497

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MRG: store taxids in lineageDB #2466

MRG: store taxids in lineageDB #2466

bluegenes commented Feb 8, 2023 •

edited

Loading

codecov bot commented Feb 8, 2023 •

edited

Loading

bluegenes commented Feb 9, 2023

ctb left a comment

MRG: store taxids in lineageDB #2466

MRG: store taxids in lineageDB #2466

Conversation

bluegenes commented Feb 8, 2023 • edited Loading

codecov bot commented Feb 8, 2023 • edited Loading

Codecov Report

bluegenes commented Feb 9, 2023

ctb left a comment

Choose a reason for hiding this comment

bluegenes commented Feb 8, 2023 •

edited

Loading

codecov bot commented Feb 8, 2023 •

edited

Loading