Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wfmash step to speed up #205

Open
OZTaekOppa opened this issue Aug 1, 2024 · 6 comments
Open

wfmash step to speed up #205

OZTaekOppa opened this issue Aug 1, 2024 · 6 comments
Labels
enhancement Improvement for existing functionality

Comments

@OZTaekOppa
Copy link

OZTaekOppa commented Aug 1, 2024

Description of feature

Dear nf-core & pangenome team,

I have a few questions about your great program.

Based on the link (https://github.com/nf-core/pangenome/blob/1.0.0/modules/nf-core/wfmash/main.nf), it appears that wfmash performs all-vs-all alignment on a single node.

wfmash \\
    ${fasta_gz} \\
    $query \\
    $query_list \\
    --threads $task.cpus \\
    $paf_mappings \\
    $args > ${prefix}.paf

From my trials, this is indeed the case.

I am trying to speed up the wfmash process on multiple nodes (PBSpro) by running parallel jobs. My idea is to perform one-vs-all alignments for each node from an input full genome dataset (120 human pangenomes), and then merge the results into a single paf file for further analysis.

  1. Do you have any recommendations for tweaking the wfmash code to achieve this?
  2. If I run one-vs-all alignments on each node, will the merged paf file be equivalent to an all-vs-all alignment? Theoretically, I assume the final outcome should be the same.

Looking forward to your insights.

Kind regards,

Taek

@OZTaekOppa OZTaekOppa added the enhancement Improvement for existing functionality label Aug 1, 2024
@subwaystation
Copy link
Collaborator

Dear @OZTaekOppa,

Per default, wfmash indeed only makes use of one node. However, there is a parameter called --wfmash_chunks https://nf-co.re/pangenome/1.1.2/parameters/#wfmash_chunks which allows nf-core/pangenome to scale the all-vs-all base pair level alignments across nodes of a cluster. This was also extensively evaluated in https://www.biorxiv.org/content/10.1101/2024.05.13.593871v1.

Just to be clear about wfmash again, when wfmash_chunks > 1:

  1. wfmash is run in approximate mapping mode which finds sequence homologies determined by the given wfmash parameters
    WFMASH_MAP(ch_wfmash_map,
  2. The resulting PAF is split into chunks of equal alignment problem size, the number of chunks is given by --wfmash_chunks
    SPLIT_APPROX_MAPPINGS_IN_CHUNKS(WFMASH_MAP.out.paf)
  3. For each such chunked PAF we can run wfmash in base pair level alignment mode on nodes of a cluster in paralleld
    WFMASH_ALIGN(ch_wfmash_align,

I hope this answers your question!

@subwaystation
Copy link
Collaborator

I didn't test it for one vs. all, but it should work out the same way.

@subwaystation
Copy link
Collaborator

This question is also discussed at pangenome/pggb#403.

@OZTaekOppa
Copy link
Author

Hi @subwaystation,

Thank you for your prompt reply.
I will get back to you after testing your suggestion.

Cheers,

Taek

@OZTaekOppa
Copy link
Author

Hi @subwaystation,

The current single-node approach requires significant RAM, CPUs, and extended walltime. The HPC team is exploring alternative solutions to run parallel jobs across multiple nodes.

From testing a small dataset, both the all-vs-all and one-vs-all approaches produced the same outcome. Currently, I am working with the team to optimize the partition and PGGB steps for Nextflow.

Cheers,

Taek

@subwaystation
Copy link
Collaborator

I am a little bit confused. There is an option to directly run wfmash across several nodes, as stated above.
Did you try this one?

Else I am curious, how your plans will turn out :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement for existing functionality
Projects
None yet
Development

No branches or pull requests

2 participants