Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support CIGARs longer than 65535 operations #263

Merged

Conversation

matsutomo81
Copy link
Contributor

This PR resolves #142 .
This PR enables decoding and encoding support for SAM/BAM files containing CIGARs with operation longer than 65535 operations.

According to Sequence Alignment/Map Format Specification, if the CIGAR in the BAM file alignment has an operation longer than 65535, the CIGAR value is replaced by a placeholder of the form 'kSmN' and the original CIGAR value is moved into the options.

Previously, cljam's decoding and encoding process did not support this behavior.

In particular, I would like you to pay attention to the following items in your review.

  • Are function and variable names reasonable?
  • Should comments be added to make the code easier to understand?
  • Does this modification significantly reduce processing speed?

@codecov
Copy link

codecov bot commented Sep 1, 2022

Codecov Report

Merging #263 (6255a16) into master (843e30b) will increase coverage by 0.06%.
The diff coverage is 91.66%.

❗ Current head 6255a16 differs from pull request most recent head 1183ebd. Consider uploading reports for the commit 1183ebd to get more accurate results

@@            Coverage Diff             @@
##           master     #263      +/-   ##
==========================================
+ Coverage   88.79%   88.85%   +0.06%     
==========================================
  Files          78       78              
  Lines        6406     6444      +38     
  Branches      454      454              
==========================================
+ Hits         5688     5726      +38     
  Misses        264      264              
  Partials      454      454              
Impacted Files Coverage Δ
src/cljam/io/bam/encoder.clj 70.70% <78.57%> (+3.67%) ⬆️
src/cljam/io/bam/decoder.clj 93.06% <95.00%> (+0.04%) ⬆️
src/cljam/io/sam/util/cigar.clj 94.62% <100.00%> (+0.95%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this feature 👍
As this is related to existing basic BAM I/O, we need to guarantee that these changes don't significantly affect the performance.

  • Add a benchmarking script
  • Share a comparison between master and the PR
  • Update your patch until the performance diff gets small enough

@matsutomo81 matsutomo81 force-pushed the support_CIGARs_longer_than_65535_operations branch 4 times, most recently from b2a70e6 to 5d1b980 Compare September 28, 2022 02:23
@matsutomo81
Copy link
Contributor Author

matsutomo81 commented Sep 28, 2022

I added benchmark scripts for BAM I/O and did a run.
Code and results have been updated in response to the review.

Benchmark

(defbench encode-alignment-short-bench
(are [f]
(util/with-temp-dir [d "encode-alignment-short-bench"]
(with-open [r (sam/reader f)
w (sam/writer (cio/file d "out.bam"))]
(let [header (sam/read-header r)
xs (vec (sam/read-alignments r))]
(sam/write-header w header)
(sam/write-refs w header)
(doseq [x xs o (:options x)] o)
(c/quick-bench
(sam/write-alignments w xs header)))))
tcommon/test-sam-file
tcommon/medium-sam-file))

(defbench encode-alignment-long-bench
(tcommon/with-before-after {:before (tcommon/prepare-cavia!)}
(are [f]
(util/with-temp-dir [d "encode-alignment-long-bench"]
(with-open [r (sam/reader f)
w (sam/writer (cio/file d "out.bam"))]
(let [header (sam/read-header r)
xs (vec (take 2000000 (sam/read-alignments r)))]
(sam/write-header w header)
(sam/write-refs w header)
(doseq [x xs o (:options x)] o)
(c/quick-bench
(sam/write-alignments w xs header)))))
tcommon/large-bam-file)))

(defbench decode-bam-alignment-short-bench
(are [f decode-opts?]
(c/quick-bench
(with-open [r (sam/reader f)]
(doseq [a (sam/read-alignments r)
opts (when decode-opts? (:options a))]
opts)))
tcommon/test-bam-file false
tcommon/test-bam-file true
tcommon/medium-bam-file false
tcommon/medium-bam-file true))

(defbench decode-bam-alignment-long-bench
(tcommon/with-before-after {:before (tcommon/prepare-cavia!)}
(are [f]
(c/quick-bench
(with-open [r (sam/reader f)]
(doseq [a (vec (take 10000000 (sam/read-alignments r)))
opts (:options a)]
opts)))
tcommon/large-bam-file)))

master

encode-alignment-short-bench (sam_bench.clj:9)

  ["test-resources/sam/test.sam"]
  time: 77.576443 µs, sd: 0.002337 ns

  ["test-resources/sam/medium.sam"]
  time: 200.517613 ms, sd: 10.751187 µs

encode-alignment-long-bench (sam_bench.clj:24)

  ["/Users/tmatsuda/clj/cljam-master/cljam/.cavia/large.bam"]
  time: 21.834391 sec, sd: 338.138984 µs

decode-bam-alignment-short-bench (sam_bench.clj:38)

  ["test-resources/bam/test.bam" false]
  time: 65.173949 µs, sd: 0.001506 ns

  ["test-resources/bam/test.bam" true]
  time: 67.371902 µs, sd: 0.000932 ns

  ["test-resources/bam/medium.bam" false]
  time: 33.901350 ms, sd: 196.473189 ns

  ["test-resources/bam/medium.bam" true]
  time: 45.329754 ms, sd: 196.549624 ns

decode-bam-alignment-long-bench (sam_bench.clj:50)

  ["/Users/tmatsuda/clj/cljam-master/cljam/.cavia/large.bam"]
  time: 22.927089 sec, sd: 28.104160 ms

PR

  ["test-resources/sam/test.sam"]
  time: 77.841573 µs, sd: 0.004062 ns

  ["test-resources/sam/medium.sam"]
  time: 202.487821 ms, sd: 47.342113 µs

encode-alignment-long-bench (sam_bench.clj:24)

  ["/Users/tmatsuda/clj/cljam_original/cljam/.cavia/large.bam"]
  time: 21.959993 sec, sd: 7.933135 ms

decode-bam-alignment-short-bench (sam_bench.clj:39)

  ["test-resources/bam/test.bam" false]
  time: 68.806201 µs, sd: 0.004801 ns

  ["test-resources/bam/test.bam" true]
  time: 69.716472 µs, sd: 0.000884 ns

  ["test-resources/bam/medium.bam" false]
  time: 34.506268 ms, sd: 249.058201 ns

  ["test-resources/bam/medium.bam" true]
  time: 46.024133 ms, sd: 254.015404 ns

decode-bam-alignment-long-bench (sam_bench.clj:51)

  ["/Users/tmatsuda/clj/cljam_original/cljam/.cavia/large.bam"]
  time: 23.349393 sec, sd: 65.608371 ms

Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding bench scripts 👍

I think we need to reorganize commits so please force-push to this branch after editing.

  • Rebase to master. This PR contains an irrelevant commit.
  • Reorder commits. Add benchmarks first, then modify implementations so that you can go back and forth between two implementations without changing benchmark scripts.

Please keep in mind that we need to update the benchmark results when you changed the scripts.

bench/cljam/io/bam/decoder_bench.clj Outdated Show resolved Hide resolved
bench/cljam/io/bam/decoder_bench.clj Outdated Show resolved Hide resolved
bench/cljam/io/bam/encoder_bench.clj Outdated Show resolved Hide resolved
bench/cljam/io/bam/decoder_bench.clj Outdated Show resolved Hide resolved
bench/cljam/io/bam/encoder_bench.clj Outdated Show resolved Hide resolved
bench/cljam/io/bam/encoder_bench.clj Outdated Show resolved Hide resolved
bench/cljam/io/bam/decoder_bench.clj Outdated Show resolved Hide resolved
@matsutomo81 matsutomo81 force-pushed the support_CIGARs_longer_than_65535_operations branch from 5d1b980 to 9caba03 Compare September 30, 2022 09:45
@matsutomo81
Copy link
Contributor Author

Thanks for the review!
Code and benchmark results have been updated in response to the review.

Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for revising the patch 👍
I added two suggestions and I found there's a performance degradation of about >40% when these changes are applied.
As this is unacceptable, would you investigate the cause of this problem and try to improve the performance?

bench/cljam/io/sam_bench.clj Outdated Show resolved Hide resolved
src/cljam/io/bam/decoder.clj Outdated Show resolved Hide resolved
@matsutomo81
Copy link
Contributor Author

matsutomo81 commented Oct 5, 2022

While editing the code, I noticed that in some of the medium-bam-file alignments, the cigar was blank (:cigar "").
Is such a case possible in the actual data? Should we support this kind of pattern?

(def medium-bam-file "test-resources/bam/medium.bam")

P.S. The implementation is now capable of supporting blank(:cigar "").

@matsutomo81 matsutomo81 force-pushed the support_CIGARs_longer_than_65535_operations branch from 9caba03 to 912535f Compare October 11, 2022 10:20
@matsutomo81 matsutomo81 force-pushed the support_CIGARs_longer_than_65535_operations branch from 912535f to c80d76e Compare October 12, 2022 02:00
src/cljam/io/bam/decoder.clj Outdated Show resolved Hide resolved
src/cljam/io/bam/decoder.clj Outdated Show resolved Hide resolved
src/cljam/io/sam/util/cigar.clj Outdated Show resolved Hide resolved
src/cljam/io/bam/decoder.clj Outdated Show resolved Hide resolved
src/cljam/io/bam/encoder.clj Outdated Show resolved Hide resolved
test/cljam/io/bam/decoder_test.clj Show resolved Hide resolved
test/cljam/io/bam/decoder_test.clj Outdated Show resolved Hide resolved
test/cljam/io/bam/decoder_test.clj Outdated Show resolved Hide resolved
test/cljam/io/bam/decoder_test.clj Outdated Show resolved Hide resolved
src/cljam/io/bam/decoder.clj Outdated Show resolved Hide resolved
@alumi alumi marked this pull request as ready for review October 13, 2022 13:30
@alumi alumi requested a review from a team as a code owner October 13, 2022 13:30
@alumi alumi requested review from athos and removed request for a team October 13, 2022 13:30
@matsutomo81 matsutomo81 force-pushed the support_CIGARs_longer_than_65535_operations branch 2 times, most recently from 1014eda to f39806d Compare November 4, 2022 10:51
@matsutomo81
Copy link
Contributor Author

I have updated the commits and 'PR' benchmark results : #263 (comment).

Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the warnings raised by linters
https://github.com/chrovis/cljam/actions/runs/3393192271

@matsutomo81 matsutomo81 force-pushed the support_CIGARs_longer_than_65535_operations branch from f39806d to 157df00 Compare November 7, 2022 12:19
Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/cljam/io/bam/encoder.clj Outdated Show resolved Hide resolved
src/cljam/io/bam/encoder.clj Outdated Show resolved Hide resolved
src/cljam/io/sam/util/cigar.clj Outdated Show resolved Hide resolved
test/cljam/io/bam/decoder_test.clj Outdated Show resolved Hide resolved
src/cljam/io/bam/encoder.clj Outdated Show resolved Hide resolved
test/cljam/io/bam/decoder_test.clj Outdated Show resolved Hide resolved
test/cljam/io/bam/encoder_test.clj Outdated Show resolved Hide resolved
test/cljam/io/bam/encoder_test.clj Outdated Show resolved Hide resolved
test/cljam/io/bam/encoder_test.clj Outdated Show resolved Hide resolved
test/cljam/io/bam/encoder_test.clj Outdated Show resolved Hide resolved
Copy link
Member

@athos athos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this feature! I've added some comments as far as I can check so far.

(defn encode-alignment [wrtr aln refs]
(let [aln (update aln :seq #(if (= % "*") "" %))]
(let [aln (update aln :seq #(if (= % "*") "" %))
cigar-count-op (cigar/count-op (:cigar aln))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cigar-count-op is a little bit tricky naming and it's somewhat hard to read it as "the number of the CIGAR ops".
How about naming it cigar-ops-count or n-cigar-ops?

(let [aln (update aln :seq #(if (= % "*") "" %))]
(let [aln (update aln :seq #(if (= % "*") "" %))
cigar-count-op (cigar/count-op (:cigar aln))
[cigar->placeholder? cigar-count-op* op*]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this function, the word op is used for two different meanings (the one for a CIGAR op and the one for an option), which is a little confusing. How about renaming the latter to opt/opts for readability?

@@ -112,3 +114,105 @@
0x00 0x00 0x00 0x00
0x00 0x00 0xb8 0x40
0xff 0xff 0x7f 0x7f])))

(deftest add-cigar-to-options-test
(are [cigar expected-option-list sample-option-list]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you put the expected value in the middle, we would hardly tell which value is which at a glance, so I think

[cigar sample-option-list expected-option-list]

or

[expected-option-list cigar sample-option-list]

would be preferable.

Also, to be consistent with other test cases, these placeholder names should be prefixed with ?.

Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for updating. We're almost done 👍
I added some comments that I missed to point out last time 🙇

src/cljam/io/bam/encoder.clj Outdated Show resolved Hide resolved
test-resources/sam/long_cigar_operations.sam Outdated Show resolved Hide resolved
src/cljam/io/bam/decoder.clj Outdated Show resolved Hide resolved
Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your quick response. The patch looks good now 👍
Please squash your commits after getting approved by all reviewers.

Copy link
Member

@athos athos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for patiently improving the PR! It looks good now 👍
Let me know when the commits are ready for merging after squashing.

Copy link
Member

@athos athos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this PR slipped my mind! I'm merging it.

@athos athos merged commit 2cfa13e into chrovis:master Dec 13, 2022
@matsutomo81 matsutomo81 deleted the support_CIGARs_longer_than_65535_operations branch July 25, 2024 08:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support CIGARs longer than 65535 operations
3 participants