Fix/issue 56 #57

mmolari · 2023-06-19T14:50:21Z

Fixes issue 56, due to an edge-case in the block-merging procedure. It also:

makes pangraph deterministic, by introducing unique random seeds for each graph mergers (fixes random names of blocks) and fixing the order of mutations/insertions/deletions using ordered dictionaries instead of simple dictionaries
introduces the -v flag. If activated consistency checks are performed at each merger, and it is verified that all of the input sequences can be exactly reconstructed from the graph.

vercel · 2023-06-19T14:50:24Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated (UTC)
pangraph	❌ Failed (Inspect)		Jun 19, 2023 2:52pm

ivan-aksamentov · 2023-06-23T17:04:23Z

@mmolari Just a few thoughts (perhaps you've already considered some of these):

The -v flag is most commonly used in command-line interfaces to show the version of the executable (alias for --version), or sometimes to request verbose mode (so that the tool prints more output; alias for --verbose). In order to avoid breaking of the public interface of the software in the future, we should probably consider another letter for the argument requesting checks. Maybe --consistency-checks and -C would work better?

It is also customary to expose random seeds as CLI arguments (or as function parameters, in libraries). This way you don't need to invent any internal magic, such as deducing seeds from input data or from current weather, and users requiring deterministic, reproducible results could simply provide the same value through CLI (e.g. --random-seed=12345) consistently for each run.

From my experience, in order to achieve reproducible results, it is essential for multi-threaded applications (where thread scheduling is determined by the operating system and is outside of control of the application) to either ensure consistent ordering when joining results (by e.g. sorting them according to the order in the original inputs) or also by allowing users to opt-out from multi-threading (with something like --jobs=1).

If you opt for sorting or similar technique, it sometimes also makes sense to allow the user to toggle the sorting on and off, with something like --ensure-order (speed vs accuracy trade-off).

In general, offloading decisions to the user often makes code simpler and reduces the feeling of magic when using a piece of software, let alone that it reduces amount of work for you :). Also, advanced users often appreciate more control.

Although this is not without downsides - CLI args is a public interface to maintain forever, and when there's many flags they may make the interface more difficult to use for beginners (or to use it correctly). So sane defaults are very important.

mmolari · 2023-06-24T11:52:36Z

Thanks @ivan-aksamentov! All of these suggestions are much appreciated.

Concerning the flag, I originally chose -v because I wanted it to be sort of a verbose mode, in which other than performing consistency checks also some more explicit logging of the process would be performed. But you are right, it's better to separate the two. I changed it to -t and --test, since -c was already taken.

For the random seeding, I was thinking of setting always the same random seed, and not give the user the option to control it, since the only thing that this seed will control is the random name of blocks. It should not impact the graph structure in any other way. And I do the seeding in a way that is robust to parallelization. Irrespective of number of threads and scheduling of operations, the results are always the same and saved in the same order. For these reasons I was thinking of setting a standard random seed in the code and not exposing any interface to change it. But do you think it's better to still give the user the option?

mmolari · 2023-06-27T15:09:51Z

This pull request:

fixes Inaccurate output for Klebsiella pneumonia dataset #56
makes the build command deterministic. The -r option can be used to set a random seed.
build and merge commands have a -v flag. When set the graph undergoes optional consistency checks.
fasta input files are checked for duplicated records, and white lines between records are tolerated

mmolari added 7 commits June 7, 2023 18:18

feat: deterministic block name and pair ordering

9f065a1

feat: deterministic graph construction

450e987

fix: ins-map consistency

ed8366c

fix for missing insertions

af5f804

chore: modified changelog

95a984a

feat: added flag to enforce consistency checks

e4e4fa8

feat: activate consistency check at trace

f7c0ae2

mmolari linked an issue Jun 19, 2023 that may be closed by this pull request

Inaccurate output for Klebsiella pneumonia dataset #56

Closed

chore: updated version and changelog

10ba8e6

vercel bot temporarily deployed to Preview June 19, 2023 14:52 Inactive

mmolari added 6 commits June 19, 2023 17:04

chore: added -v flag to cli-tests

59112a5

feat: OrderedDict in unmarshall

a2fe79e

feat: implement deep copy for graphs

88b1411

feat: added consistency check in marginalize

a9439fb

fix: copy -> copy_graph

067789f

chore: changelog and cli-tests

08ef580

mmolari added 2 commits June 24, 2023 13:40

feat: add gap positions consistency check

53bc71b

chore: consistency check flag v -> t

c461450

mmolari added 3 commits June 27, 2023 16:14

feat: added random seed in CLI

9f7f3c1

chore: changed docs

e1c5775

chore: small changelog correction

9c0af7b

mmolari merged commit 54c5019 into master Jun 27, 2023
1 check passed

mmolari deleted the fix/issue-56 branch June 27, 2023 15:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/issue 56 #57

Fix/issue 56 #57

mmolari commented Jun 19, 2023

vercel bot commented Jun 19, 2023 •

edited

Loading

ivan-aksamentov commented Jun 23, 2023 •

edited

Loading

mmolari commented Jun 24, 2023

mmolari commented Jun 27, 2023

Fix/issue 56 #57

Fix/issue 56 #57

Conversation

mmolari commented Jun 19, 2023

vercel bot commented Jun 19, 2023 • edited Loading

ivan-aksamentov commented Jun 23, 2023 • edited Loading

mmolari commented Jun 24, 2023

mmolari commented Jun 27, 2023

vercel bot commented Jun 19, 2023 •

edited

Loading

ivan-aksamentov commented Jun 23, 2023 •

edited

Loading