Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/issue 56 #57

Merged
merged 19 commits into from
Jun 27, 2023
Merged

Fix/issue 56 #57

merged 19 commits into from
Jun 27, 2023

Conversation

mmolari
Copy link
Collaborator

@mmolari mmolari commented Jun 19, 2023

Fixes issue 56, due to an edge-case in the block-merging procedure. It also:

  • makes pangraph deterministic, by introducing unique random seeds for each graph mergers (fixes random names of blocks) and fixing the order of mutations/insertions/deletions using ordered dictionaries instead of simple dictionaries
  • introduces the -v flag. If activated consistency checks are performed at each merger, and it is verified that all of the input sequences can be exactly reconstructed from the graph.

@mmolari mmolari linked an issue Jun 19, 2023 that may be closed by this pull request
@vercel
Copy link

vercel bot commented Jun 19, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
pangraph ❌ Failed (Inspect) Jun 19, 2023 2:52pm

@vercel vercel bot temporarily deployed to Preview June 19, 2023 14:52 Inactive
@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Jun 23, 2023

@mmolari Just a few thoughts (perhaps you've already considered some of these):

The -v flag is most commonly used in command-line interfaces to show the version of the executable (alias for --version), or sometimes to request verbose mode (so that the tool prints more output; alias for --verbose). In order to avoid breaking of the public interface of the software in the future, we should probably consider another letter for the argument requesting checks. Maybe --consistency-checks and -C would work better?

It is also customary to expose random seeds as CLI arguments (or as function parameters, in libraries). This way you don't need to invent any internal magic, such as deducing seeds from input data or from current weather, and users requiring deterministic, reproducible results could simply provide the same value through CLI (e.g. --random-seed=12345) consistently for each run.

From my experience, in order to achieve reproducible results, it is essential for multi-threaded applications (where thread scheduling is determined by the operating system and is outside of control of the application) to either ensure consistent ordering when joining results (by e.g. sorting them according to the order in the original inputs) or also by allowing users to opt-out from multi-threading (with something like --jobs=1).

If you opt for sorting or similar technique, it sometimes also makes sense to allow the user to toggle the sorting on and off, with something like --ensure-order (speed vs accuracy trade-off).

In general, offloading decisions to the user often makes code simpler and reduces the feeling of magic when using a piece of software, let alone that it reduces amount of work for you :). Also, advanced users often appreciate more control.

Although this is not without downsides - CLI args is a public interface to maintain forever, and when there's many flags they may make the interface more difficult to use for beginners (or to use it correctly). So sane defaults are very important.

@mmolari
Copy link
Collaborator Author

mmolari commented Jun 24, 2023

Thanks @ivan-aksamentov! All of these suggestions are much appreciated.

Concerning the flag, I originally chose -v because I wanted it to be sort of a verbose mode, in which other than performing consistency checks also some more explicit logging of the process would be performed. But you are right, it's better to separate the two. I changed it to -t and --test, since -c was already taken.

For the random seeding, I was thinking of setting always the same random seed, and not give the user the option to control it, since the only thing that this seed will control is the random name of blocks. It should not impact the graph structure in any other way. And I do the seeding in a way that is robust to parallelization. Irrespective of number of threads and scheduling of operations, the results are always the same and saved in the same order. For these reasons I was thinking of setting a standard random seed in the code and not exposing any interface to change it. But do you think it's better to still give the user the option?

@mmolari mmolari merged commit 54c5019 into master Jun 27, 2023
1 check passed
@mmolari mmolari deleted the fix/issue-56 branch June 27, 2023 15:02
@mmolari
Copy link
Collaborator Author

mmolari commented Jun 27, 2023

This pull request:

  • fixes Inaccurate output for Klebsiella pneumonia dataset #56
  • makes the build command deterministic. The -r option can be used to set a random seed.
  • build and merge commands have a -v flag. When set the graph undergoes optional consistency checks.
  • fasta input files are checked for duplicated records, and white lines between records are tolerated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inaccurate output for Klebsiella pneumonia dataset
2 participants