Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request cache = file_only #4791

Open
lindenb opened this issue Mar 3, 2024 · 2 comments
Open

Feature request cache = file_only #4791

lindenb opened this issue Mar 3, 2024 · 2 comments

Comments

@lindenb
Copy link
Contributor

lindenb commented Mar 3, 2024

New feature

Hi all, I'd like a see a new cache type where the presence of a file is the only criteria for the checksum calculation (ignoring the file size, the file timestamp, etc...). I understand such cache must be used carefully .

Usage scenario

This kind of cache would be used in a linear (many steps, without branching) workflow where a file needs to be updated.

Example 1

For example a pipeline creates a SQLite3 database and the database is updated for each process (without making a copy of the DB itself).

ch1 = CREATE_DATABASE()
ch2 = UPDATE_DATABASE_STEP1(ch1.db)
ch3 = UPDATE_DATABASE_STEP2(ch1.db, ch2.output)
ch4 = UPDATE_DATABASE_STEP3(ch1.db, ch3.output)
ch5 = DUMP_DB(ch1.db, ch4.output)

Example 2

It could be a way to implement a workflow requiring to delete the files before #452 is in production.

For example, a VCF is annotated, each time a new annotation is added, the previous file in the workflow is deleted an re-created with a zero-size using touch, avoiding a growing space required for the storage of each annotation step.

ch1 = CREATE_SUB_VCF(vcf,intervals)
ch2  = ADD_ANNOTATION1(ch1.vcf)
ch3  = ADD_ANNOTATION2(ch2.vcf)
ch4  = ADD_ANNOTATION3(ch3.vcf)
ch5  = ADD_ANNOTATION4(ch4.vcf)
etc...

with something like:

process ADD_ANNOTATION2 {
cache "file_only"
input:
   path(invcf)
output:
   path("out.vcf"),emit:vcf
script:
"""
mytool ${invcf} >  out.vcf

rm ${invcf.toRealPath()}
touch  ${invcf.toRealPath()}
"""
}

Suggest implementation

well it must be somewhere in modules/nf-commons/src/main/nextflow/util/CacheHelper.java :-P

@robsyme
Copy link
Collaborator

robsyme commented Mar 3, 2024

It seems to me that both examples would make resuming the workflow impossible, right? If a task failed and the pipeline had to be re-run, there would be no way of skipping the first n steps because the intermediate files would have been removed or overwritten. Is that correct?

@lindenb
Copy link
Contributor Author

lindenb commented Mar 3, 2024

@robsyme

It seems to me that both examples would make resuming the workflow impossible, right?

yes that's why I said "I understand such cache must be used carefully" :-P

if the workflow must be resumed, the first file in the workflow should be deleted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants