Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory issue #25

Open
shernadi opened this issue Jan 29, 2021 · 11 comments
Open

Memory issue #25

shernadi opened this issue Jan 29, 2021 · 11 comments

Comments

@shernadi
Copy link

Hi, I am trying to cluster some ONT cDNA data using RATTLE however I am getting the following error:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)

I guess it is a memory allocation issue. I am running RATTLE on an aws instance with 2T RAM and 128 cores
It seems RATTLE crashes while it only uses around 150Gb of RAM. I tried to run it using less cores (64, 32) but I got the same issue. My samples contains 14 million reads but I also tried to downsample it to 2.5 million. Still ended up with the same error.
Would you have any suggestion what the problem can be?

my example code:
./rattle cluster -I 2.5.million.fastq -t 64 -o out/ --fastq --iso

Many thanks,

Szabolcs

@novikk
Copy link
Collaborator

novikk commented Jan 30, 2021

Hello,

yes, it will probably have problems with 14M reads (either time or memory problems) but should be OK with the 2.5M reads. Does this problem happen at the very beginning?

Can you check if there are reads smaller than the k-mer size? (we normally filter out reads smaller than 100 or 150bp anyway before running RATTLE).

Best,
Ivan

@jamlover
Copy link

jamlover commented Mar 2, 2021

I am experiencing a memory issue too, perhaps because of the amount of data I am dealing with. I have 56M ONT reads that are about 108GB of data. Sequences were filtered to include only reads at least 150 bases long. I have played around with reducing the number of threads and number of input reads, but during clustering I keep getting either the bad_alloc error or segmentation fault (the latter first appearing as after I reduced the size of my input to about 1/4 of the total reads I have). As I write I am currently running roughly 1/16 of my 56M reads using 24 threads, waiting to see what happens.

My computer has 56 available threads and 512GB RAM. I just attempted the same 1/16 input (3.5M reads) on 48 threads and got the segmentation fault. When I first tested clustering with a single 4000 read fastq file, clusrtering worked. I have 14,157 such 4000 read files.

  1. I'd appreciate any help in getting the program to run properly for me.
  2. If it turns out that parsing my input data into smaller sets that I run individually ends up being a way for me to get the clustering to work, what would be the appropriate work flow to merge the clustering results and continue, or else, what would be a general suggested overall work flow to produce the final set of transcript isoforms that I am trying to produce.

FYI, here is my most recent failed command with error message.

rattle cluster -i all.14Mlines.150filter.fastq -t 48 --fastq --iso -o rattle14M_output
RNA mode: 0
Reading fasta file... Done
Segmentation fault ] 1/3536141 (2.82794e-05%)

Thanks,
John Martinson

@jamlover
Copy link

jamlover commented Mar 3, 2021

Update to my previous comment. I thought by reducing the amount of input to about 1.77M reads I had finally succeeded as the cluster command ran for over six hours, but then then a seg fault kicked again....

rattle cluster -i all.7Mlines.150filter.fastq -t 48 --fastq --iso -o rattle7M_output
RNA mode: 0
Reading fasta file... Done
[================================================================================] 1768069/1768069 (100%)99%)
[================================================================================] 200218/200218 (100%)95%)
Iteration 0.35 complete
Segmentation fault ] 1/121301 (0.000824396%)

I have cut the input in half again hoping it succeeds, and if it does, again would like advice on how to merge my results.

John Martinson

@EduEyras
Copy link
Member

EduEyras commented Mar 3, 2021 via email

@jamlover
Copy link

jamlover commented Mar 3, 2021

Thanks for the response. I will definitely look into your suggestions and see if some of them can be helpful. At present however, it does look as though my current clustering attempt will almost certainly complete successfully. This attempt used 1/64 of the total input reads I have available. If I find that I am able to complete clustering in this manner, as 64 subsets of my total available input reads, would it make sense to run each 1/64th through the subsequent steps of correction and polishing separately, and then try to somehow merge the 64 final sets of isoforms after doing that, or would you suggest that there is a better approach (perhaps using some of the suggestions you made)?

Thanks again,
John

Update: Clustering failed on one of my 1/64th fractions; seg fault again. I am now focusing on your suggestion of removing long reads (I had already removed those less that 150 bases from my input).

@jamlover
Copy link

Eduardo,

I am providing another update on the progress of my clustering. Dividing my input data into fourths, trimming adapters with porechop, and keeping sequences only in the range of 150 to 40K bases appears to have worked. I also used the parameters you suggested above. I have completed two fourths of my clustering runs and the other two should complete in the next day or two. Each clustering run takes about 5-6 days total, where there are approximately 12-13 million input reads in each of my four input sets, and I am using 24 threads.

I said "appears to have worked" for the following reason. The output from the clustering command for one of my data sets said the following: "144861 gene clusters found". The .csv clusters summary file produced however seems to show about 4.5 times as many clusters, and when I extract the clusters there are 645,571 cluster fastq files created. Can you explain the discrepancy?

Thanks,
John

@EduEyras
Copy link
Member

EduEyras commented Mar 26, 2021 via email

@jamlover
Copy link

Some more fodder for the discussion. Regarding your comment about consensus sequences, the sequences in the the cluster sequence files and clusters in the summary file do not include consensus sequences; they both sum perfectly to the number of input reads, 12,415,585. I then thought that the 144861 number I reported above perhaps reflected the number of clusters with at least X number of reads in them, so I tried a few values. X=4 got me close, but was not quite right; clusters with>= 4 members=142459.

John

@EduEyras
Copy link
Member

EduEyras commented Mar 27, 2021 via email

@novikk
Copy link
Collaborator

novikk commented Mar 27, 2021

Hi @jamlover!

Glad you managed to run the clustering step on those big datasets. Regarding your question, it looks like you run RATTLE with the --iso option. This creates isoform clusters instead of gene clusters, but internally RATTLE generates gene-clusters which are then split into isoform clusters. Thus the number you see is the number of gene clusters that RATTLE internally found, but these were later split into 645,571 isoform clusters.

From here, you have several options. If you just want to work with the clusters, you are ready to go, but you should probably filter out those that contain a low number of reads (i.e. <= 5 reads). You can do this with the cluster summary file.

If you want to continue the pipeline, the next step is to correct the reads using the rattle correct option. This will generate a file with the corrected reads (by default, only those reads that belong to a cluster with >5 reads are corrected, but you can set this as a parameter). It will also generate a file with the uncorrected reads, and another file (consensi.fq) that contains one consensus sequence for each of the corrected clusters.

Once you have this consensi file from each part of your dataset, you can merge them together and run a rattle polish step with this file. This will generate a final "transcriptome.fq" file with the final transcripts from your dataset. If the polish step takes too long, you might want to do another round of cluster + correct with the merged consensi files.

If you need the actual quantification of the whole dataset you might need to do something a bit different. If that's the case, contact me via email when you have all the consensi.fq files and I will help you with that, since it's not implemented yet in the main RATTLE program.

Ivan

@jamlover
Copy link

jamlover commented Mar 27, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants