-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel execution of MSA tools #399
base: main
Are you sure you want to change the base?
Conversation
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). For more information, open the CLA check for this pull request. |
how are the number of cores for the hhblits step determined? The example shows the three parallel steps are using 44 total cores on a 28 core machine; 8 cores for each of the two jackhmmer steps and 28 cores for the hhblits step. Or are there 44+ cores available on the machine in the example? |
The number of available physical cores on this machine is 28, and the number of logical cores (the number of cores visible to the OS) is 56 due to Hyper-Threading. Although the number of logical cores is 44 < 56, it has no significant effect on execution or performance. (For example, it is possible to run hhblits with 128 cores, and there will rarely be any difference in performance.) What is important in determining the optimal number of hhblits cores is the I/O performance of the storage where the BFD is stored. If the storage is HDD, the bottleneck will probably be I/O performance, and increasing the number of cores will not improve the execution time. If the storage is SSD, it may be worthwhile to increase the number of cores to a level that does not exceed the number of physical cores. In this example, the reason for increasing the number of cores is that the storage uses the Lustre File System, which improves I/O performance when the number of parallel cores is increased. In the case of this file system, it makes sense to increase the parallelization number even if the physical storage is HDD. The number of cores used here is 28, which is derived from the number of physical cores, but there was no difference in performance from about 16. If this PR improves the execution time rather than simply changing the number of cores, the following factors can be considered. If you are using HDDs and the execution time of hhblits is extremely slow, you may want to move the file containing |
when i set n_parallel_msa=3 and run multimer, i found it will start Jackhmmer(uniref90), Jackhmmer(mgy_clusters_2018_12) and HHblite at the same time. And after all those have done, it will start Jackhmmer(uniprot). |
@xlminfei I am sorry, but the current PR can not simultaneous execution of Jackhmmer(uniprot). |
Hi, When your mods are run on T1050, the size of the uniref90_hits.sto file increases from 75 MB to 1.16 GB. The size of the mgnify_hits.sto increases from 3.6 MB to 1.9 GB. The other two MSA files are very similar in size to the original AF implementation (whatever the latest version). The final models are similar in both cases. What is going on? Why concurrent execution of MSA results in such a massive increase in the size of MSA files? Thanks, Petr |
Hi, After some additional testing, this parallel implementation works very well. On short to medium-sized sequences the speedup of the complete run is up to 25%. Very impressive! Thank you @fuji8 for this implementation! I hope this will be incorporated into the main branch soon. Petr |
Hi, I am working on this MSA parallel issue and found your PR. You may consider my fork for this implementation. this modification has successfully passed both monomer and heterodimer modeling.
Howerver, I'm sure the parallel execution requires extremely high instant reading speed provided by SSD. |
Hi @fuji8, Would you have time to modify the latest release of AF (2.3) to make the tools run concurrently? Your v2.2 implementation was much faster than the official distro. It saved all of us a lot of time... Disappointingly, your contribution wasn't implemented in the new release. I hope it won't take too much of your time and effort to convert the v.2.2 implementation to v.2.3. Thanks so much! Petr |
@@ -124,7 +126,8 @@ def __init__(self, | |||
use_small_bfd: bool, | |||
mgnify_max_hits: int = 501, | |||
uniref_max_hits: int = 10000, | |||
use_precomputed_msas: bool = False): | |||
use_precomputed_msas: bool = False, | |||
n_parallel_msa: int = 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful if you were to add a comment here about the number of logical threads involved with a non-default value of n_parallel_msa. Alternatively, if the number of available cores could be supplied, a function could be devised to adjust both this and the n_cpu variables in the tools folder to optimize the run
Sorry, I made a mistake and temporarily closed PR. @Phage-structure-geek
As for this one, I think it's because I forgot to reflect the one below. It has already been fixed, so the file size difference should disappear. |
Hi @fuji8 |
Hi , I'm sorry but general question, but how can install it PR into my PC?, I've a workstation with 32 cores and 4 P100 GPus, and will like use all resource to modeling 5000 proteins. It is suitable using it PR or general AlphaFold? |
Replace the three files containing the changes (see the tab at the top "Files changed") in your AF distro and recompile the docker. Note that the speedup is substantial but jackhhmer still often uses only two threads, no matter how many threads you compiled it with. It is nevertheless so much more satisfying to see jackhhmer and hhblits run in parallel and at least hhblits using a good amount of CPU. Petr |
Based on the dependencies among the data, the MSA tool execution is divided into the following three parts and the tools are called asynchronously to execute them in parallel.
Execution time may be reduced if sufficient cpu, memory and I/O performance are available.
The implementation uses
concurrent.futures.ThreadPoolExecutor
andmax_workers
can be specified with the--n_parallel_msa
flag.If
--n_parallel_msa
flag is 1, the execution is not parallelized.The case where
--n_parallel_msa
flag is 3 is the maximum and potentially the fastest.example
The following is a partial log of a T1041 run with 28 CPU cores and 235 GB of RAM.