-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support CIGARs longer than 65535 operations #263
Support CIGARs longer than 65535 operations #263
Conversation
Codecov Report
@@ Coverage Diff @@
## master #263 +/- ##
==========================================
+ Coverage 88.79% 88.85% +0.06%
==========================================
Files 78 78
Lines 6406 6444 +38
Branches 454 454
==========================================
+ Hits 5688 5726 +38
Misses 264 264
Partials 454 454
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this feature 👍
As this is related to existing basic BAM I/O, we need to guarantee that these changes don't significantly affect the performance.
- Add a benchmarking script
- Share a comparison between master and the PR
- Update your patch until the performance diff gets small enough
b2a70e6
to
5d1b980
Compare
I added benchmark scripts for BAM I/O and did a run. Benchmark cljam/bench/cljam/io/sam_bench.clj Lines 9 to 22 in f98e865
cljam/bench/cljam/io/sam_bench.clj Lines 24 to 37 in f98e865
cljam/bench/cljam/io/sam_bench.clj Lines 39 to 49 in f98e865
cljam/bench/cljam/io/sam_bench.clj Lines 51 to 59 in f98e865
master encode-alignment-short-bench (sam_bench.clj:9)
["test-resources/sam/test.sam"]
time: 77.576443 µs, sd: 0.002337 ns
["test-resources/sam/medium.sam"]
time: 200.517613 ms, sd: 10.751187 µs
encode-alignment-long-bench (sam_bench.clj:24)
["/Users/tmatsuda/clj/cljam-master/cljam/.cavia/large.bam"]
time: 21.834391 sec, sd: 338.138984 µs
decode-bam-alignment-short-bench (sam_bench.clj:38)
["test-resources/bam/test.bam" false]
time: 65.173949 µs, sd: 0.001506 ns
["test-resources/bam/test.bam" true]
time: 67.371902 µs, sd: 0.000932 ns
["test-resources/bam/medium.bam" false]
time: 33.901350 ms, sd: 196.473189 ns
["test-resources/bam/medium.bam" true]
time: 45.329754 ms, sd: 196.549624 ns
decode-bam-alignment-long-bench (sam_bench.clj:50)
["/Users/tmatsuda/clj/cljam-master/cljam/.cavia/large.bam"]
time: 22.927089 sec, sd: 28.104160 ms PR ["test-resources/sam/test.sam"]
time: 77.841573 µs, sd: 0.004062 ns
["test-resources/sam/medium.sam"]
time: 202.487821 ms, sd: 47.342113 µs
encode-alignment-long-bench (sam_bench.clj:24)
["/Users/tmatsuda/clj/cljam_original/cljam/.cavia/large.bam"]
time: 21.959993 sec, sd: 7.933135 ms
decode-bam-alignment-short-bench (sam_bench.clj:39)
["test-resources/bam/test.bam" false]
time: 68.806201 µs, sd: 0.004801 ns
["test-resources/bam/test.bam" true]
time: 69.716472 µs, sd: 0.000884 ns
["test-resources/bam/medium.bam" false]
time: 34.506268 ms, sd: 249.058201 ns
["test-resources/bam/medium.bam" true]
time: 46.024133 ms, sd: 254.015404 ns
decode-bam-alignment-long-bench (sam_bench.clj:51)
["/Users/tmatsuda/clj/cljam_original/cljam/.cavia/large.bam"]
time: 23.349393 sec, sd: 65.608371 ms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding bench scripts 👍
I think we need to reorganize commits so please force-push to this branch after editing.
- Rebase to master. This PR contains an irrelevant commit.
- Reorder commits. Add benchmarks first, then modify implementations so that you can go back and forth between two implementations without changing benchmark scripts.
Please keep in mind that we need to update the benchmark results when you changed the scripts.
5d1b980
to
9caba03
Compare
Thanks for the review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for revising the patch 👍
I added two suggestions and I found there's a performance degradation of about >40% when these changes are applied.
As this is unacceptable, would you investigate the cause of this problem and try to improve the performance?
While editing the code, I noticed that in some of the cljam/test/cljam/test_common.clj Line 143 in 843e30b
P.S. The implementation is now capable of supporting blank( |
9caba03
to
912535f
Compare
912535f
to
c80d76e
Compare
1014eda
to
f39806d
Compare
I have updated the commits and 'PR' benchmark results : #263 (comment). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check the warnings raised by linters
https://github.com/chrovis/cljam/actions/runs/3393192271
f39806d
to
157df00
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, please check the unresolved previous comments:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this feature! I've added some comments as far as I can check so far.
src/cljam/io/bam/encoder.clj
Outdated
(defn encode-alignment [wrtr aln refs] | ||
(let [aln (update aln :seq #(if (= % "*") "" %))] | ||
(let [aln (update aln :seq #(if (= % "*") "" %)) | ||
cigar-count-op (cigar/count-op (:cigar aln)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cigar-count-op
is a little bit tricky naming and it's somewhat hard to read it as "the number of the CIGAR ops".
How about naming it cigar-ops-count
or n-cigar-ops
?
src/cljam/io/bam/encoder.clj
Outdated
(let [aln (update aln :seq #(if (= % "*") "" %))] | ||
(let [aln (update aln :seq #(if (= % "*") "" %)) | ||
cigar-count-op (cigar/count-op (:cigar aln)) | ||
[cigar->placeholder? cigar-count-op* op*] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this function, the word op
is used for two different meanings (the one for a CIGAR op and the one for an option), which is a little confusing. How about renaming the latter to opt
/opts
for readability?
test/cljam/io/bam/encoder_test.clj
Outdated
@@ -112,3 +114,105 @@ | |||
0x00 0x00 0x00 0x00 | |||
0x00 0x00 0xb8 0x40 | |||
0xff 0xff 0x7f 0x7f]))) | |||
|
|||
(deftest add-cigar-to-options-test | |||
(are [cigar expected-option-list sample-option-list] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you put the expected value in the middle, we would hardly tell which value is which at a glance, so I think
[cigar sample-option-list expected-option-list]
or
[expected-option-list cigar sample-option-list]
would be preferable.
Also, to be consistent with other test cases, these placeholder names should be prefixed with ?
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for updating. We're almost done 👍
I added some comments that I missed to point out last time 🙇
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your quick response. The patch looks good now 👍
Please squash your commits after getting approved by all reviewers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for patiently improving the PR! It looks good now 👍
Let me know when the commits are ready for merging after squashing.
3544797
to
1183ebd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this PR slipped my mind! I'm merging it.
This PR resolves #142 .
This PR enables decoding and encoding support for SAM/BAM files containing CIGARs with operation longer than 65535 operations.
According to Sequence Alignment/Map Format Specification, if the CIGAR in the BAM file alignment has an operation longer than 65535, the CIGAR value is replaced by a placeholder of the form 'kSmN' and the original CIGAR value is moved into the options.
Previously, cljam's decoding and encoding process did not support this behavior.
In particular, I would like you to pay attention to the following items in your review.