Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heatmap for Per Sequence Quality Scores is an extra base long #812

Closed
MattBashton opened this issue Aug 6, 2018 · 7 comments
Closed

Heatmap for Per Sequence Quality Scores is an extra base long #812

MattBashton opened this issue Aug 6, 2018 · 7 comments
Labels
bug: core Bug in the main MultiQC code

Comments

@MattBashton
Copy link

Description of bug:
Per Sequence Quality Scores heatmap is incorrect on the final base (which is actually non existent), my reads are 100bp long, the individual sample report will show positions numbered 1 to 100 with the final base (100) having sane levels of A, T, C and G i.e. 24-26% in my various input files. Using the heatmap however will produce an image with bases numbered positions 0-100 (i.e. 101 bp in length) here the final base is always 100% G however this data point is bogus. I have confirmed the actual length of my input reads to be 100bp using a text editor (emacs with column numbering mode enabled). Additionally these two plots should ideally either be both 1 indexed or 0 indexed but not a mixture of both

MultiQC Error log:
None generated

Files that triggers the error:
Case_001_R2_fastqc.zip
Case_001_R1_fastqc.zip

MultiQC run details (please complete the following):

  • Command used to run MultiQC: multiqc <dir>
  • MultiQC Version: MultiQC v1.16
  • Operating System: Ubuntu 16.04.5 LTS
  • Python Version: 3.5.2
  • Method of MultiQC installation: pip install --user

Additional context
FastQC v0.11.7

@ewels ewels added the bug: core Bug in the main MultiQC code label Aug 17, 2018
@ewels
Copy link
Member

ewels commented Aug 17, 2018

Hi @MattBashton,

Thanks for the detailed report. I guess you are specifically talking about the hover text when moving the cursor over the heatmap? The colouring of the heatmap looks fine to me when I make a report with your files, but I see that the text does report 0-100 with 100% on base 100 as you say.

I've just had a look at the code and this hover text is generated dynamically by taking the cursor position and reverse-engineering the colour code underneath it. The 100% G is because the browser is incorrectly reporting rgb(0,0,0) at the far extreme of the plot. I don't think I can do anything about this as it's due to the core browser canvas rendering. The base position 0 is because it's doing Math.floor of the current x position, so when you get to less than 1bp it reports 0. I've just switched this for Math.round and added a check to not allow it to report 0bp. This should now be available in MultiQC v1.7dev.

Note that it can be a little tricky to get precisely the correct base numbers because of the way that FastQC reports this data. It does so with a base position range, which can vary:

94-95	23.746249462902973	26.6517600021941	26.067876874506705	23.534113660396216
96-97	23.967289520566574	26.26382242694542	25.883346696990806	23.885541355497196
98-99	23.938792540213242	26.38431566079754	26.243589803749835	23.43330199523938
100-101	24.42109582916103	26.361364398196013	25.320439466630525	23.897100306012433

The data behind the MultiQC plot needs to have single data points (for the line graph), so I get around this by just picking the average value of the range and using that.

In short - it's best to be a little careful with either a MultiQC or FastQC report if you really need per-base accuracy, especially at the end of reads.

Interestingly the FastQC HTML reports themselves say that your reads are 101bp long! Maybe a bug in FastQC..?

Phil

@ewels ewels closed this as completed in 26f894e Aug 17, 2018
@MattBashton
Copy link
Author

Hi Phil, yes just to confirm, will the fix prevent the 100% G base being reported on mouse over? I feel this is important because checking for poor calls at the end of reads is one of them main usage cases of both FastQC and MulltiQC prior to trimming, so eliminating this bug is a good thing. Given this behaviour (at present) one would incorrectly assume their reads were all 100% G at position 100, (which is of course a bogus position 101) if we incorrectly assuming 1 based indexing (which is reasonable), but incorrectly interpreting this plot readout could cause the end user to consider trimming base 100 - which would be unnecessary. You sate that you can't prevent the 100% G being reported? This is a critical issue IMO.

Another problem this plot has too is that the averages (par the final 100 position) appear to be reported in bins of two, so positions 0-1, have the same Q score, as do 1-2, and 2-3, i.e for 0,1 is the same value as is 1-2 and 2-3. However here is a another issue, every base is reported twice, so moving left to right the mouse over shows stats at the top of the plot for 0,1,1,2,2 but each successive report of frequencies for the bases differs, so bp 2 is reported as: 34, 21, 22, 23% initially but move 1 pixel to the left and the readout for bp 2 is now: 26, 24, 25, 25 for read 1. This is somewhat broken because how is the end user supposed to interpret the correct value for the 2nd base pair when two are reported. I fully understand that underlying this data FastQC is reporting them in 2bp bins, which is why this happens but the heatmap needs to be transparent to this, otherwise mis interpretation can occur. Perhaps you should render the heatmap at 2pb resolution, rather than reporting every data point twice with different values?

Yes I had noticed that current FastQC reports these reads to be 101bp in length, although it's plots don't runt to 101bp, suspect this is also a bug in FastQC, although as downstream input it's one you might want to be aware off.

@MattBashton
Copy link
Author

I've attached my report output to show what I mean.

multiqc_report_error.html.zip

@MattBashton
Copy link
Author

I should update this the reads are indeed 101bp long, forgot emacs uses 0 based indexing whilst checking. However the 101 based being reported as 100% G is clearly not correct.

ewels added a commit that referenced this issue Aug 17, 2018
Show the original base pair range label when hovering on the sequence content line plot data. see #812.
ewels added a commit that referenced this issue Aug 17, 2018
Use the original data instead of calculating it from the visible pixels. See issue #812
@ewels
Copy link
Member

ewels commented Aug 17, 2018

I was a little concerned that implementing the changes you suggest would involve rewriting a tonne of the code. However, after having a poke around and refreshing myself with how the plot is created, I realised that most of the pieces were already in place. I've refactored the code for both the heatmap and the line plot to show the original FastQC base pair range. I've also updated the heatmap to show the raw base percentages instead of back-converting from the colours (though with no decimal places showing it's unlikely that this will make a detectable difference). The side effect of pulling the original data in this way is that it's fixed the 100% G bug 🎉

I quite enjoyed fixing this one! Please give it a try and confirm that it works as expected now.

ewels added a commit that referenced this issue Aug 17, 2018
Added 3px x offset for picking up the base position when hovering on the sequence content heatmap. Done empirically...! See #812
@MattBashton
Copy link
Author

Hey glad to hear you enjoyed fixing it :) I've re ran my end, and the output looks flawless with respect to this issue, so glad this got fixed!!! Which was my motivation for reporting it! Attached is the output.
multiqc_report_fixed.html.zip

@ewels
Copy link
Member

ewels commented Aug 17, 2018

Great! Thanks for reporting and testing 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug: core Bug in the main MultiQC code
Projects
None yet
Development

No branches or pull requests

2 participants