Add new classes for lineage manipulation #2437

bluegenes · 2023-01-06T01:02:24Z

makes new LineagePair of typetyping.NamedTuple in tax_utils.py. This adds taxid as optional parameter, ultimately needed for CAMI output. Note that we could make this a frozen dataclass to support additional features (e.g. type checking), but it's not necessary, and NamedTuple might have better speed/size trade-offs?
add BaseLineageInfo and RankLineageInfo dataclasses, and recreate all lineage manipulation tools as methods of these classes. These will enable lineage matching and summarization with automatic rank checking for any hierarchical rank system (e.g. LINS via LINSLineageInfo).
Add some columns to test1.gather.csv that will be needed for forthcoming tax changes, which will require >=4.4 to avoid the hacky query_bp calculation I initially implemented and allow query compatibility checks before summarizing any gather results (next PR).
Since test1.gather.csv now contains what is needed for ANI estimation, take query_nhashes out of test1.gather_ani.csv so we can use it as a pre-v4.4 test. Rename to test1.gather_old.csv.

Note: Some of our lca_db functions and tests rely on LineagePair being a tuple of 2 items. I will make a corresponding issue w/ transition strategy if this PR is accepted. For now, I'm using lca_utils.LineagePair where necessary, but any of these in tax_utils.py will be cleaned up as part of my next PR in this tax update series. The exception is LineageDB reading, since this interacts with the LCA database code.

codecov · 2023-01-06T01:08:52Z

Codecov Report

Merging #2437 (48af504) into latest (14d79c9) will increase coverage by 0.13%.
The diff coverage is 96.73%.

@@            Coverage Diff             @@
##           latest    #2437      +/-   ##
==========================================
+ Coverage   84.15%   84.28%   +0.13%     
==========================================
  Files         132      132              
  Lines       15105    15285     +180     
  Branches     2416     2476      +60     
==========================================
+ Hits        12711    12883     +172     
- Misses       2097     2098       +1     
- Partials      297      304       +7

Flag	Coverage Δ
python	`92.31% <96.73%> (+0.05%)`	⬆️
rust	`57.72% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/sourmash/lca/lca_utils.py	`88.70% <ø> (ø)`
src/sourmash/tax/tax_utils.py	`97.59% <96.73%> (-0.53%)`	⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

…ting tests

…w usage

bluegenes · 2023-01-06T17:33:50Z

@ctb @sourmash-bio/devs ready for review!

ctb · 2023-01-06T17:54:52Z

This looks good to me on a quick skim! I'd like to play with it a little bit before merging it, but if you set up a PR into this PR then when this PR is merged all will be well ;)

ctb

Still working my way through the code - want to dig into the tests next ;). Sorry for being slow!

src/sourmash/tax/tax_utils.py

…into upd-lineage-utils

tests/test_tax_utils.py

src/sourmash/tax/tax_utils.py

ctb · 2023-01-10T18:39:06Z

(sorry, may have just dropped a few new comments - I hadn't "finished" my review.)

bluegenes · 2023-01-10T19:02:23Z

no worries, take your time. Just figured I'd get started on the easy stuff!

Sorry to you (and me!) that this + these next PRs are a frustratingly large set of changes. I do think it'll be worth it in the end

ctb · 2023-01-10T19:42:57Z

Looks good! Suggest merging #2441 next ;).

) Note: PR into #2437 Wanted to try out the new code for myself! Co-authored-by: N. Tessa Pierce-Ward <ntpierce@gmail.com> Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>

## Taxonomy Refactor Overview In an attempt to allow usage of NCBI taxid (motivation: CAMI benchmarking) and alternate hierarchical taxonomic ranks (motivation: LINS), I ended up refactoring the taxonomy code in a four-PR series. Taxonomic summarization results should not change. Minor caveat: I was previously obtaining `query_bp` in a hacky manner to allow gather <4.4 results. The class methods are more robust, and I'd like to stop supporting gather <4.4 results. To allow this, I had to add the `query_bp`, `ksize`, and `scaled` columns into some testing results to keep tests functioning. 1. #2437 modifies `LineagePair` from a two-item `collections.namedtuple` to a three-item `typing.NamedTuple` containing an additional field, `taxid`, for storing NCBI taxid information. It also introduces classes (`BaseLineageInfo`, `RankLineageInfo`), which move lineage manipulation (from `lca_utils.py`) to class methods in order to support robust summarization across compatible lineages (lineages of same hierarchical ranks). To ensure these can be used as dictionary keys, these classes are frozen. 2. #2439 introduces classes that facilitate reading, summarization, and writing of gather results. First, it updates three prior `collections.namedtuple`s to `dataclasses` used for storing information about the gather query (`QueryInfo`), summarized gather information for metagenome queries (`SummarizedGatherResult`) and classification information for genome queries (`ClassificationResult`). It introduces three new classes for reading and manipulating gather results. `GatherRow`, is used for reading a each row from a gather file and automatically checking for required columns. `TaxResult` is used for storing a single row from gather file, optionally (and ideally) with taxonomic information, stored as `LineageInfo` class from PR 1. `QueryTaxResult` is used for storing all `TaxResult`s associated with a single query. `QueryTaxResult` add methods to replicate the summarization previously done within `summarize_gather_at` in `tax_utils.py` and the classification thresholding in `genome` within `__main__.py`. 3. #2443 replaces the actual taxonomic summarization code in `tax/__main__.py` with code that uses the new classes. Modifies gather loading code to read using `GatherRow`, `TaxResult`, and `QueryTaxResult`. 4. #2446 removes old, unused functions that are rendered redundant by the new classes. Also removes associated tests. ## Additional details for this PR (#2439) 1. Renamed existing `namedtuple`s. This renaming allows me to introduce modified `dataclasses` (below) with the same names without breaking functioning code. This is temporary, as these are removed in #2443. - `QueryInfo` --> `QueryInf` - `SummarizedGatherResult` -> `SumGathInf` - `ClassificationResult` --> `ClassInf` 2. Update these `namedtuple`s to dataclasses to allow additional functionality - `QueryInfo` - `SummarizedGatherResult` - `ClassificationResult` These contain several advantages over the prior namedtuples. In particular, dataclass post-initialization methods allows type checking and casting string--> float or int as appropriate. I also added methods to these dataclasses to estimate ANI and produce formatted dictionaries for each output format. This pulls formatting into one place, rather than independent output-writing functions. 3. Add dataclasses for reading and manipulating gather results. - `GatherRow`, for reading a each row from a gather file - innately checks that all required gather colnames are present when loading each gather csv - `TaxResult` for storing a single row from gather file - get and store `LineageInfo` taxonomic information directly with each row - `get_match_lineage` and `get_ident` are now methods - tracks whether lineage matching was attempted, including if the ident was `missed` or intentionally `skipped`. We don't actually allow skipping from the cli (yet?), but it's enabled in all methods to preserve existing functionality. I think I was using this to skip identical/exact database matches. - `QueryTaxResult` for storing all `TaxResult`s associated with a single query - only allows summarization over gather results from same query - more robust/simplified tracking of results with missed & skipped taxonomic identifiers + perfect matches - use `LineageInfo` classes (within `TaxResult`) for lineage tracking + summarization, to get all benefits (optional `taxid`, any hierarchical ranks, etc) - contains methods to replicate the summarization previously done within `summarize_gather_at` in `tax_utils.py` and the classification thresholding in `genome` within `__main__.py`. Co-authored-by: C. Titus Brown <titus@idyll.org>

## Taxonomy Refactor Overview In an attempt to allow usage of NCBI taxid (motivation: CAMI benchmarking) and alternate hierarchical taxonomic ranks (motivation: LINS), I ended up refactoring the taxonomy code in a four-PR series. Taxonomic summarization results should not change. Minor caveat: I was previously obtaining `query_bp` in a hacky manner to allow gather <4.4 results. The class methods are more robust, and I'd like to stop supporting gather <4.4 results. To allow this, I had to add the `query_bp`, `ksize`, and `scaled` columns into some testing results to keep tests functioning. 1. #2437 modifies `LineagePair` from a two-item `collections.namedtuple` to a three-item `typing.NamedTuple` containing an additional field, `taxid`, for storing NCBI taxid information. It also introduces classes (`BaseLineageInfo`, `RankLineageInfo`), which move lineage manipulation (from `lca_utils.py`) to class methods in order to support robust summarization across compatible lineages (lineages of same hierarchical ranks). To ensure these can be used as dictionary keys, these classes are frozen. 2. #2439 introduces classes that facilitate reading, summarization, and writing of gather results. First, it updates three prior `collections.namedtuple`s to `dataclasses` used for storing information about the gather query (`QueryInfo`), summarized gather information for metagenome queries (`SummarizedGatherResult`) and classification information for genome queries (`ClassificationResult`). It introduces three new classes for reading and manipulating gather results. `GatherRow`, is used for reading a each row from a gather file and automatically checking for required columns. `TaxResult` is used for storing a single row from gather file, optionally (and ideally) with taxonomic information, stored as `LineageInfo` class from PR 1. `QueryTaxResult` is used for storing all `TaxResult`s associated with a single query. `QueryTaxResult` add methods to replicate the summarization previously done within `summarize_gather_at` in `tax_utils.py` and the classification thresholding in `genome` within `__main__.py`. 3. #2443 replaces the actual taxonomic summarization code in `tax/__main__.py` with code that uses the new classes. Modifies gather loading code to read using `GatherRow`, `TaxResult`, and `QueryTaxResult`. 4. #2446 removes old, unused functions that are rendered redundant by the new classes. Also removes associated tests. ## Additional details for this PR (#2446) - Delete old functions that aren't used outside of taxonomic summarization + associated tests - Including old `namedtuple`s: `QueryInf`, `SumGathInf`, `ClassInf` - Make sure any old comments/documentation make it into new code - Don't use unnecessary empty `()` for dataclasses

# sourmash release 4.7.0 Major new features: * provide an initial plugin architecture for sourmash that supports new signature saving & loading mechanisms (#2428) * add plugin support for new command-line subcommands (#2438) * debias all containment values (#2243) Minor new features: * Use RankLineageInfo to simplify reading lineages (#2467) * store taxids in lineageDB (#2466) * Use new tax classes for taxonomic summarization (#2443) * add tax summarization dataclasses for safety and flexibility (#2439) * add `--scaled` to sourmash compare (#2414) * replace `lca_utils.LineagePair` with `tax_utils.LineagePair` (#2441) * Add new classes for lineage manipulation (#2437) Cleanup and documentation updates: * ReadTheDocs updates (#2445) * update `sourmash compare` command-line docs (#2400) Developer updates: * fix python tests by bumping tox and pip cache versions (#2424) * Update sphinx requirement from <6,>=4.4.0 to >=4.4.0,<7 (#2430) * Build: replace milksnake with maturin (#2393) * importlib_metadata is a dependency on old Python versions (#2484) * Release docs: use two separate sed commands (#2483) * minor fixes to release behavior (#2479) * Use screed and maturin from nixpkgs in `flake.nix` (#2481) * update release procedure after v4.6.0 and v4.6.1 (#2386) * Update makefile and docs (#2432) Dependabot updates: * Bump once_cell from 1.17.0 to 1.17.1 (#2488) * Bump ouroboros from 0.15.5 to 0.15.6 (#2487) * Bump memmap2 from 0.5.8 to 0.5.9 (#2486) * Bump supercharge/redis-github-action from 1.4.0 to 1.5.0 (#2485) * Bump proptest from 1.0.0 to 1.1.0 (#2460) * Bump web-sys from 0.3.60 to 0.3.61 (#2461) * Bump serde_json from 1.0.91 to 1.0.93 (#2471) * Bump wasm-bindgen-test from 0.3.33 to 0.3.34 (#2463) * Bump cachix/install-nix-action from 18 to 19 (#2459) * Bump wasm-bindgen from 0.2.83 to 0.2.84 (#2464) * Bump typed-builder from 0.11.0 to 0.12.0 (#2451) * Bump bumpalo from 3.9.1 to 3.12.0 (#2450) * Bump pypa/cibuildwheel from 2.11.4 to 2.12.0 (#2447) * Bump bzip2 from 0.4.3 to 0.4.4 (#2444) * Bump once_cell from 1.14.0 to 1.17.0 (#2429) * Bump serde from 1.0.151 to 1.0.152 (#2423) * Bump pypa/cibuildwheel from 2.11.3 to 2.11.4 (#2422) * Bump serde_json from 1.0.89 to 1.0.91 (#2418) * Bump serde from 1.0.150 to 1.0.151 (#2419) * Bump thiserror from 1.0.37 to 1.0.38 (#2417) * Bump finch from 0.4.3 to 0.5.0 (#2416) * Bump rayon from 1.6.0 to 1.6.1 (#2404) * Bump serde from 1.0.149 to 1.0.150 (#2403) * Bump pypa/cibuildwheel from 2.11.2 to 2.11.3 (#2402) * Bump serde from 1.0.148 to 1.0.149 (#2397) * Bump capnp from 0.14.5 to 0.14.11 (#2396)

) ## Add taxonomic utilities for LINs; enable and test `tax metagenome` With taxonomy refactoring (#2437, #2439, #2443, #2446, #2466, #2467), we are (mostly) no longer tied to named ranks. Here, I add a class for LIN taxonomies and use it within `tax metagenome` to allow summarization up LINs and reporting at specified `lingroups`. With this PR, users can now use the flag `--lins` to read and use `lin` taxonomies from the provided tax (`-t`, `--taxonomy`) file. If used, `sourmash tax` will look for a `lin` column in the taxonomy file instead of looking for `superkingdom`...`strain` columns. The `lin` column should contain `;`-separated LINs, preferably with a standard number of positions (e.g. all 20 positions in length or all 10 positions in length). For `tax metagenome`: By default, `tax metagenome` will summarize up _all_ available ranks/LIN positions. If a `lingroup` file is provided, we will also report a subset of this summary: just the LIN prefixes that match groups in the `lingroup` file. The `lingroup` file requires two columns in any order: `name`, the name of the group, and `lin`, the lin prefix of the group. The prefix will be used to select results from the full summary for reporting. The `lingroup` format will build a file with the following name: `{base}.lingroup.tsv`, where `{base}` is the name provided via the `-o`,` --output-base` option. ## Demo / Tutorial A draft tutorial is available [here](https://sourmash--2469.org.readthedocs.build/en/2469/tutorial-lin-taxonomy.html). Note that it does not contain the installation info for this branch (see below). You can run the interactive version via binder [here](https://mybinder.org/v2/gh/bluegenes/2023-demo-sourmash-LIN/HEAD?labpath=sourmash-lin-demo.ipynb) ## Testing ### Option A: Use the Demo Binder You can test via the [binder](https://mybinder.org/v2/gh/bluegenes/2023-demo-sourmash-LIN/HEAD?labpath=sourmash-lin-demo.ipynb). You can add new cells or modify any existing cells, and even download additional files for testing. The downside is that you'll have to make sure to download and save your results, since the binder won't save them for you. ### Option B: Alternatively, install on your own computer/cluster: Here is one way to test this code before it gets fully integrated into sourmash: - If you don't have conda, I'd recommend installing `mamba`, [instructions here](https://mamba.readthedocs.io/en/latest/installation.html) instead. - if you do have `mamba`, replace the word `conda` with `mamba` in the following commands. Download an environment file that points to this branch: ``` curl -JLO https://github.com/bluegenes/2023-demo-sourmash-LIN/main/sourmashLIN.yml ``` Create a virtual environment using this file: ``` conda env create -f sourmashLIN.yml ``` Activate that environment: ``` conda activate smashLIN ``` make sure `--lins` is in the `--help` for `sourmash tax metagenome`: ``` sourmash tax metagenome --help ``` ## Command to run The command to run is this one: ``` sourmash tax metagenome -g $gather_csv -t $taxonomy_csv \ --lins --lingroup $lingroups_csv ``` ## Types of files you'll need 1. sketches of query metagenome 2. sketches of reference genomes (database) 3. taxonomy file with LIN information (two columns required: `ident`, `lin`) 4. lingroup information file (two columns required: `name`, `lin`) To exit the environment when you're done testing, use `conda deactivate` > Reminder, if you have `mamba`, you can use it in place of `conda` in the commands above. example `lingroup` output format. Note that the `1;0`.. paths are always grouped together, but may come before or after the `0;0` and `2;0` groups. ``` name lin percent_containment num_bp_contained lg3 2;0;0 1.56 192000 lg1 0;0;0 5.82 714000 lg2 1;0;0 5.05 620000 lg3 1;0;1 0.65 80000 lg4 1;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0 0.65 80000 ``` ``` name lin percent_containment num_bp_contained lg2 1;0;0 5.05 620000 lg3 1;0;1 0.65 80000 lg4 1;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0 0.65 80000 lg1 0;0;0 5.82 714000 lg3 2;0;0 1.56 192000 ``` ## A few implementation details: - In `tax_utils.py`, I add a `LINLineageInfo` class for using and manipulated LIN taxonomies. It implements new methods to enable specifically reading in `LIN` taxonomies into the class, but otherwise uses the taxonomic utilities available in `BaseLineageInfo`, e.g. taxonomic summarization up ranks, assessing whether two taxonomies are a match at a given rank. - In `tax_utils.py`, I add functionality for reading `lingroup` information and reporting taxonomic summarization specifically at these ranks. Changes and Additions: - [x] Add `LINLineageInfo` for working with `LIN` taxonomies - [x] Add method for reading `LIN`s into `LineageDB` - [x] Add methods for reading `LINgroups` and summarizing to these - [x] Add LineageTree that can use `LineageInfo` to perform `build_tree`, `find_lca` functions (originally in `lca_utils.py`) and produce an ordered list of lineage paths - [x] Add code + tests to use `LIN`s taxonomy in: - [x] tax metagenome - [x] tax annotate - [x] tax summarize The following require additional changes and will be punted to an issue/separate PR (see #2499): - tax genome - tax prepare - tax grep --------- Co-authored-by: C. Titus Brown <titus@idyll.org>

bluegenes and others added 5 commits January 5, 2023 15:43

init new LineagePair and LineageInfo classes

b1ed900

test new LineagePair,BaseLineageInfo,RankLineageInfo

5aa5714

fix

ea7a163

add v4.4 columns that will be required

23e2480

Merge branch 'latest' into upd-lineage-utils

331336b

bluegenes added 4 commits January 5, 2023 17:54

make the newly added query_bp info in test1.gather.csv work with exis…

f458dc1

…ting tests

rename test1.gather_ani.csv to test1.gather_old.csv to reflect its ne…

592993d

…w usage

moar lineage tests

06fb848

test remaining codecov misses

ed055b6

bluegenes and others added 2 commits January 6, 2023 09:58

whoops, one last codecov miss

c15ff78

Merge branch 'latest' into upd-lineage-utils

960c728

ctb reviewed Jan 10, 2023

View reviewed changes

ctb mentioned this pull request Jan 10, 2023

MRG: replace lca_utils.LineagePair with tax_utils.LineagePair #2441

Merged

bluegenes added 4 commits January 10, 2023 10:11

fix spacing and quotes per review

74fc3cc

cleaner ranks spec for RankLineageInfo

354b4bb

Merge branch 'upd-lineage-utils' of github.com:sourmash-bio/sourmash …

c11e46f

…into upd-lineage-utils

use None for name, taxid when lineage unspecifed

bca598d

ctb reviewed Jan 10, 2023

View reviewed changes

tests/test_tax_utils.py Show resolved Hide resolved

src/sourmash/tax/tax_utils.py Outdated Show resolved Hide resolved

src/sourmash/tax/tax_utils.py Show resolved Hide resolved

add better descriptions

48af504

bluegenes mentioned this pull request Jan 10, 2023

add tax summarization dataclasses for safety and flexibility #2439

Merged

ctb approved these changes Jan 10, 2023

View reviewed changes

bluegenes merged commit 474e4b3 into latest Jan 10, 2023

bluegenes deleted the upd-lineage-utils branch January 10, 2023 20:27

This was referenced Jan 16, 2023

Use new tax classes for taxonomic summarization #2443

Merged

cleanup taxonomy code after refactor #2446

Merged

bluegenes mentioned this pull request Feb 17, 2023

MRG: Add taxonomic utilities for LINs and enable tax metagenome #2469

Merged

8 tasks

ctb mentioned this pull request Mar 3, 2023

4.7.0 release #2497

Merged

bluegenes mentioned this pull request May 4, 2023

EXP: work with RankLineageInfo and sourmash tax #2603

Open

4 tasks

ctb mentioned this pull request Oct 29, 2023

MRG: update sourmash and snakemake dependencies dib-lab/charcoal#239

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new classes for lineage manipulation #2437

Add new classes for lineage manipulation #2437

bluegenes commented Jan 6, 2023 •

edited

Loading

codecov bot commented Jan 6, 2023 •

edited

Loading

bluegenes commented Jan 6, 2023

ctb commented Jan 6, 2023

ctb left a comment

ctb commented Jan 10, 2023

bluegenes commented Jan 10, 2023

ctb commented Jan 10, 2023

Add new classes for lineage manipulation #2437

Add new classes for lineage manipulation #2437

Conversation

bluegenes commented Jan 6, 2023 • edited Loading

codecov bot commented Jan 6, 2023 • edited Loading

Codecov Report

bluegenes commented Jan 6, 2023

ctb commented Jan 6, 2023

ctb left a comment

Choose a reason for hiding this comment

ctb commented Jan 10, 2023

bluegenes commented Jan 10, 2023

ctb commented Jan 10, 2023

bluegenes commented Jan 6, 2023 •

edited

Loading

codecov bot commented Jan 6, 2023 •

edited

Loading