Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clair3 cannot produce gvcf file (?) #88

Closed
tuannguyen8390 opened this issue Apr 3, 2022 · 11 comments
Closed

Clair3 cannot produce gvcf file (?) #88

tuannguyen8390 opened this issue Apr 3, 2022 · 11 comments

Comments

@tuannguyen8390
Copy link

Hi Clair3 team,

I recently need to use Clair3 gVCF instead of normal VCF. In our system, we have population-scale dataset from hundreds of ONT samples. I recalled it ran OK back then before i insert the flag --gvcf

Wonder if disk space is an issue, similar to that highlighted in #48.

Happy to send over the log files if need be.

Best,

Tuan

[INFO] 7/7 Merge pileup VCF and full-alignment VCF
parallel: Error: Output is incomplete.
parallel: Error: Cannot append to buffer file in /tmp.
parallel: Error: Is the disk full?
parallel: Error: Change $TMPDIR with --tmpdir or use --compress.
Warning: unable to close filehandle properly: Cannot allocate memory during global destruction.
real	9m49.897s
user	9m17.384s
sys	0m24.180s


�[93m[WARNING] No vcf file found, output empty vcf file�[0m
�[93m[WARNING] Copying pileup.vcf.gz to /group/dairy/Tuan/Recessive_lethal/Nextflow/results/SNP/Daisy/merge_output.vcf.gz�[0m
[INFO] Removing intermediate files in /group/dairy/Tuan/Recessive_lethal/Nextflow/results/SNP/Daisy/tmp

[INFO] Finish calling, output file: /group/dairy/Tuan/Recessive_lethal/Nextflow/results/SNP/Daisy/merge_output.vcf.gz

real	2593m55.463s
user	40179m37.328s
sys	1029m34.529s
@zhengzhenxian
Copy link
Collaborator

Hi,

It should be a disk space issue, as parallel cannot write logs parallel: Error: Cannot append to buffer file in /tmp.. The disk is full and thus fails to write any files. As you have reached the last steps (STEP 7), you might need to clean up more hard disk space to finish the last VCF and GVCF merging steps.

@tuannguyen8390
Copy link
Author

Yes it was 100% disk space issue. My latest two samples (standalone samples) weren't incur this error. I believe this is due to too many samples deploying on the queue at the same time.

I'm closing the ticket for now,

Many thanks,

Tuan

@tuannguyen8390
Copy link
Author

Sorry for re-opening the issue again, but after fixing the disk drive issue, this pops up in the error log. Any suggestion what is this about ?

Cheers,

Tuan

/home/vicsuwd/anaconda3/envs/clair3/bin/scripts/clair3.sh: line 300: 156800 Bus error               (core dumped) ${PYPY} ${CLAIR3} SortVcf --input_dir ${TMP_FILE_PATH}/merge_output --vcf_fn_prefix "merge" --vcf_fn_suffix ".gvcf" --output_fn ${OUTPUT_FOLDER}/merge_output.gvcf --sampleName ${SAMPLE} --ref_fn ${REFERENCE_FILE_PATH} --contigs_fn ${TMP_FILE_PATH}/CONTIGS

@tuannguyen8390 tuannguyen8390 reopened this Apr 4, 2022
@aquaskyline
Copy link
Member

Hi, it's the first time we received a report on bus error. It's an error raised by hardware to notify the operating system of invalid memory access, thus the problem is more likely to be at the operating system's level (or might be the python or pypy installation, we are just guessing). If the problem is with Clair3, a segmentation fault that indicates also invalid memory access but is stopped by the operating system would be more likely.

@tuannguyen8390
Copy link
Author

Hi Ruibang,

Thanks for your reply, I currently run Clair with 24 CPUs & 64 Gb of RAM, but I'm able to increase RAM if need be.

I'm rerunning the test samples again, now with a slightly modified pipeline.

As mentioned previously, it seems that dumping multiple runs into the same disk drive location caused some issues at our institute server (the original issue). I now direct the output from Clair3 in separate scratch location - as each compute node was attached to a temp drive). Only when clair3 finish the output files will be moved back to the disk drive.

Happy to clarify should any of the above doesn't make sense,

Tuan

@aquaskyline
Copy link
Member

The size of intermediate files for GVCF output is not small and dependent on depth and sequencing quality, so hosting the intermediate files in a larger space makes sense. Thanks and keep us updated.

@tuannguyen8390
Copy link
Author

Hi @aquaskyline,

Some interesting stats I gather from my last run, where I give clair3 48 cores, and 600.00 GB RAM.

1st run - Normal run without --gvcf (please disregard the FAILED code, as it was my cp command back from scratch to local is wrong - Clair3 finished fine with this mode). Stats looks pretty nice

Job ID: 12540740
Cluster: basc
User/Group: vicsuwd/vicsuwd_g
State: FAILED (exit code 1)
Nodes: 1
Cores per node: 48
CPU Utilized: 15-06:46:54
CPU Efficiency: 85.64% of 17-20:18:24 core-walltime
Job Wall-clock time: 08:55:23
Memory Utilized: 22.26 GB
Memory Efficiency: 3.71% of 600.00 GB

2nd run - Run with --gvcf, memory utilized is a bit...erm...

Job ID: 12540735
Cluster: basc
User/Group: vicsuwd/vicsuwd_g
State: OUT_OF_MEMORY (exit code 0)
Nodes: 1
Cores per node: 48
CPU Utilized: 12-17:54:17
CPU Efficiency: 71.05% of 17-22:32:00 core-walltime
Job Wall-clock time: 08:58:10
Memory Utilized: 504.14 GB
Memory Efficiency: 84.02% of 600.00 GB

I chase out the log file, see below, everything looks OK step 1->6, the bus error message pops up as well as the oom-kill event(s).

[INFO] 7/7 Merge pileup VCF and full-alignment VCF

real	4m23.991s
user	0m0.256s
sys	0m0.638s
/home/vicsuwd/anaconda3/envs/clair3/bin/scripts/clair3.sh: line 287: 131277 Bus error               (core dumped) ${PYPY} ${CLAIR3} SortVcf --input_dir ${TMP_FILE_PATH}/merge_output --vcf_fn_prefix "merge" --output_fn ${OUTPUT_FOLDER}/merge_output.vcf --sampleName ${SAMPLE} --ref_fn ${REFERENCE_FILE_PATH} --contigs_fn ${TMP_FILE_PATH}/CONTIGS
cp: target ‘/group/dairy/Tuan/Recessive_lethal/Nextflow/results/SNP/Cow2181’ is not a directory
slurmstepd: error: Detected 5 oom-kill event(s) in step 12540735.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

My guess is that something blow up the mem in the parallel merging of gvcf files. I'm trying 2 things at the moment, unsure if any of these work to be honest...

1 - modifying --chunk_size=1000000 , I personally think this ain't going to work, but my naive assumption is that smaller chunk size means smaller gvcf files ?
2 - lower the CPU number, while keeping the ram. At the moment I ran it with 48 CPUs & 600 GB, mean each CPU will have ~ 12.5 GB RAM to work with in step 7. In slurm there is also a flag call --mem-per-cpu that might allocate specifically each CPU with say 32 GB of RAM, maybe useful.

My BAM file is ~ 35 GB by the way, if you think of any probable solution for this, please let me know and I will try it out on my system. We have nodes with 1+ TB RAM also, but in limited number and I don't want to go through that route as it would take ages to submit to these nodes.

Many thanks,

Tuan Nguyen

@aquaskyline
Copy link
Member

aquaskyline commented Apr 7, 2022

Clair3 is not supposed to use up that much memory, either a bug in Clair3 or a glitch in the input is possible. Could you archive the log folder in your output folder and send the archive to my email rbluo at cs dot hku dot hk. We will turn back to you after some inspections. BTW, being OOM killed, a bus error now makes sense.

@tuannguyen8390
Copy link
Author

tuannguyen8390 commented Apr 15, 2022

Addressed with r11-Minor 1

@santoshe1
Copy link

Hello Team, I am seeing a runtime of more than 50% when running a sample with gVCF mode turned on. I don't see the official release of r11-minor. Could you please push this version on git ?

@aquaskyline
Copy link
Member

All installation options and code are with r11 minor fixes included. And yes, 50% additional time for gVCF output is somewhat expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants