Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically delete files marked as temp as soon as not needed anymore #452

Open
andreas-wilm opened this issue Sep 15, 2017 · 55 comments · Fixed by #2135 · May be fixed by #3849
Open

Automatically delete files marked as temp as soon as not needed anymore #452

andreas-wilm opened this issue Sep 15, 2017 · 55 comments · Fixed by #2135 · May be fixed by #3849

Comments

@andreas-wilm
Copy link
Contributor

andreas-wilm commented Sep 15, 2017

To reduce the footprint of larger workflows it's very useful if temporary files (which are marked as such) could be automatically deleted once they are not used anymore. Yes, this breaks reruns, but for easily recomputed files or large ones (footprint), this makes sense. Using scratch (see #230) is not always possible/wanted (e.g. for very large files and small scratch). It's also not always possible to delete those files as a user (except at the very end of the workflow), because multiple downstream processes running at different times might require them. This feature is for example implemented in snakemake, but maybe it's easily done there because the DAG is computed in advance?

Note, this is different from issue #165 where the goal was to remove non-declared files. The issue contains a useful discussion of the topic nevertheless.

Andreas

@pditommaso
Copy link
Member

I'm adding this for reference. I agree that intermediate files handling needs to be improved but it will require some interval refactoring. Need to investigate. cc @joshua-d-campbell

@pditommaso
Copy link
Member

pditommaso commented Aug 28, 2018

I've brainstormed a bit more about this issue and actually it should be possible to remove intermediate output files without compromising the resume feature.

First problem, runtime generated DAG: tho execution graph is only generated at runtime, it's generally fully resolved immediately after the workflow execution starts. Therefore it would be enough to defer the output delete after the full resolution of the execution DAG. That's just after the run invocation and before the terminate.

Second problem is how to identify tasks eligible for output remove. This could be done intercepting a task (successful) completion event. Infer the upstream task in the DAG (easy) and if ALL dependant tasks have been successfully completed then cleanup the task work directory (note that each task can have more than one downstream task). Finally the task for which output have been removed must be marked with a special flag e.g. cached=true in the trace record.

Third, the resume process need to be re-implemented to take in consideration this logic. Currently when the -resume flag is specified the pipeline is just re-executed from the beginning, skipping the processes for which the output files already exists. However all (dataflow) output channel are created binding the output files to those channel.

Using the new approach this is not possible any more because the files are deleted therefore the execution has to be skipped up to the first task successfully executed task for which the (above) cached flag is not true. This that the output files of the last executed task can be picked and re-injected in the dataflow network and restart it.

This may require to introduce a new resume command #544. It could be also used to implement a kind of dry-run feature as suggested by #844. Finally this could also solve #828.

@lucacozzuto
Copy link

lucacozzuto commented Oct 2, 2018

My two cents: if you can use a flag for indexing the processes (i.e. the sample name) you can define a terminal process that once completed triggers a deletion of the folders connected to that ID (if completed).
I'm imagin a situation like this:
[sampleID][PROCESS 1] = COMPLETED
[sampleID][PROCESS 2] = COMPLETED
[sampleID][PROCESS 3] = COMPLETED
[sampleID][PROCESS 4 / TERMINAL] = COMPLETED

remove the folders of PROCESS 1 / 2 / 3/ 4 for that [sampleID]

In case you need to resume the pipeline, these samples will be re-run if the data are still in the input folder.

@lucacozzuto
Copy link

Quick comment: with this feature it will be possible to keep an instance of nextflow running (with watchPath) without having storage problems.

@PeteClapham
Copy link

So at a high level, I think I'm missing something. If the state data remains in files, the removal of old items is a good thing to do, but will this increase the filesystem IO contenition and locking as we increase the scale of analysis ?

@pditommaso
Copy link
Member

Since each nextflow task has its own work directory and those directories would be deleted when the data is not needed (read accessed) any more I don't see why there's should be an IO contention on those files. I'm missing something?

@lucacozzuto
Copy link

lucacozzuto commented Sep 20, 2019

I was thinking that maybe a directive that allows the removal of input files when a process is finished will allow to reduce the amount of space needed by a workflow.
This should allow to remove the whole folders containing the input files, so that we reduce the number of folders too.

Of course this will not work if these files will be needed by other processes.

Maybe with the new DSL2 where you have to make the graph explicitly this can be achieved. If the cleaning conflicts with a workflow / process an error can be triggered.

@stale
Copy link

stale bot commented Apr 27, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@fmorency
Copy link

I like the ideas in this thread. Automatic removal of "intermediate process files" would be great.

@olavurmortensen
Copy link

This feature would be a game changer. As an example, one of our pipeline has to process ~10 GB of data produces ~100 GB of temporary data, and as a result the bottleneck is not CPU or memory, but diskspace. This severely limits the throughput of our lab, and results in poor utilization of processing power.

@jvivian-atreca
Copy link

jvivian-atreca commented Nov 23, 2020

I'm running into this with a pipeline that has similar characteristics to what @olavurmortensen is describing — the temporary files produced by one tool are very large, so while this workflow's output is maybe a couple hundred gigs, it will need something like 7,000+ GB of disk space during execution.

That said, is there any reason that temporary file cleanup isn't the purview of the process's script? There are several ways to delete anything that doesn't match a specific pattern in bash, thereby removing all temporary files except the known inputs/outputs.

@jfy133
Copy link
Contributor

jfy133 commented Aug 18, 2022

cleanup = true

Hi, where do you put cleanup=true? In config file? what branch? Thanks

https://www.nextflow.io/docs/latest/config.html?highlight=cleanup#miscellaneous

@emosyne
Copy link

emosyne commented Aug 18, 2022

thanks a lot.

@bobamess
Copy link

I usually put
cleanup = true
near the top of the config file after defining things like taskName, workspace, workDir and before any named blocks.
However, in my experience it only deletes files in the subdir of workDir and not the subdir in the workDir, which would be nice. Unless this is because the last time I checked this it was with an older version of Nextflow.

@jgarces02
Copy link

@bentsherman's solutions seems very attractive (clean_work_files.sh). Is there any possibilities to include it within next versions? (or how can I tweak sarek to include it?)

@bentsherman
Copy link
Member

@jgarces02 for now you have to wire the dependencies yourself, see the GEMmaker pipeline script for an example. I'm currently trying to automate this behavior by specifying e.g. temporary: true on a path process output. I'm nearly at the point of understanding the codebase well enough to actually know how to do it. 😅

@spficklin
Copy link

Wonderful @bentsherman !

@hw538
Copy link

hw538 commented Dec 10, 2022

any exciting news about this feature request? :)

@bentsherman
Copy link
Member

This feature has been on the backburner this year due to other pressing efforts, but we're finally beginning to make some headway. I'm currently working on a PR (#3463) that will allow Nextflow to track the full task graph, which will comprise the "first half" of this feature (but it's also useful for other things like provenance).

The second half will be to use the task graph to figure out when an output file can be deleted, something like:

  1. process outputs can be marked as temporary: path(bam_file, temporary: true)
  2. given a temporary output file F, delete F when all consumers of F are complete
  3. on a resumed run, mark F as cached if all consumers of F are cached

Still kinda fuzzy about point (3). but I think there are a number of possible ways to do it.

@spvensko
Copy link

I am currently working on a blog post that will hopefully be published either later this week or early next week to go over examples of implementing GEMMaker's clean_work_files.sh strategy. The blog post will go over syntactical considerations of implementation and also a few pitfalls I encountered. I realize issue #3463 and associated future issues will hopefully make this issue absolute, but I think it's worth having a tutorial to help those that want to implement a solution in the meantime.

We've implemented this into our rather large neoantigen workflow (LENS) and it appears it will save us tons of storage.

@bentsherman
Copy link
Member

@spvensko that's great, I agree it would be good to have a general example for people to reference in the meantime. I'd like to have such an example for the Nextflow patterns website (or wherever that content ends up in the website revamp), but I never got around to writing it myself. Looking forward to your blog post.

@mribeirodantas
Copy link
Member

Please share it when it's done, @spvensko 😄

@spvensko
Copy link

Blog post is available now: https://pirl.unc.edu/blog/tricking-nextflows-caching-system-to-drastically-reduce-storage-usage

I'm going to be on PTO for the rest of the year, so hopefully there aren't any major issues with it. 😅

@bentsherman
Copy link
Member

Folks, it's happening: #3818

Basically a minimal implementation of GEMmaker's "clean work files" approach directly into Nextflow. Several caveats and limitations to consider, but even this piece should be enough to make production pipelines much more storage efficient. Testing and feedback are appreciated! Feel free to message me on Slack if you don't want to clog up this issue.

@stevekm
Copy link
Contributor

stevekm commented Feb 2, 2024

@bentsherman just wanted to follow up, is this feature 100% complete? was not sure since this Issue is still marked as Open. Thanks.

@bentsherman
Copy link
Member

The automatic cleanup works but the resumability still has some issues. I had to focus on other things for a while but I have picked up this effort again, hope to finish the resumability in the next few months. See #3849 for updates.

If there are lots of people who don't care about the resumability piece, I could push to have the basic cleanup merged ASAP and complete the resumability in a separate effort. That would mean that for now, if you enable automatic cleanup and e.g. your pipeline fails half-way through due to some bug, you might not be able to resume because some task outputs will have been deleted.

cc @pditommaso @marcodelapierre for their thoughts

@ewels
Copy link
Member

ewels commented Feb 2, 2024

I would be in favour of getting the automatic cleanup feature in ASAP, with or without resumability 👍🏻

For quite a few people this can make the difference between being able to run a pipeline at all or not, at which point being able to resume it is purely a nicety.

We should definitely aim to have the full cake, but getting in the basic cleanup quickly would be very nice.

@lescai
Copy link

lescai commented Feb 3, 2024

I agree with @ewels
when you analyse large datasets you might not have a choice, and ideally you’d be in production and any source for failure at least not pipeline-related.

@lucacozzuto
Copy link

I also agree. To resume a pipeline is important but in some context you cannot even run the pipeline for lack of space

@pinin4fjords
Copy link
Contributor

Agreed. Getting the footprint down to the point of feasibility where it's currently lacking is a worthwhile goal even sacrificing resumability in the short term.

@pditommaso
Copy link
Member

Disagree. The reusability of pipelines is not a feature that can be compromised.

@spvensko
Copy link

spvensko commented Feb 5, 2024

It's worth noting that Stephen Ficklin's (@spficklin) solution allows intermediate file deletion (either in line with the workflow or at the end, depending on how it's coded) and resumability. There are limitations and it's relatively tedious to implement manually, but it's completely possible for us to have our cake and enjoy a slice or two.

@bentsherman
Copy link
Member

@spvensko agree, it's just a matter of whether the resumability should be tied to the core cleanup if it will take longer to implement.

Paolo and I have discussed. I have prepared a PR (#4713) with only the cleanup piece for him to compare.

@bentsherman
Copy link
Member

Folks, since the automatic cleanup is not going to make it into core Nextflow until the resumability is implemented, I found a way to provide the basic cleanup functionality in a plugin:

https://github.com/bentsherman/nf-boost

The README has everything you need to use it. I will also publish it to the plugins index soon.

Please feel free to use it, just keep in mind that resume isn't supported yet and the cleanup itself is experimental. I haven't tested it on very large pipelines, I believe it is robust, but it might still need some performance tuning.

I would love to get some testing feedback from anyone who is interested. If you run into any problems, you can submit an issue on the nf-boost repo and I'll work with you to resolve it. Any fixes / improvements we make over there will make it into the final implementation here.

I'll keep working to get resume to work correctly, but since I don't know how long it will be until it's merged into Nextflow, I wanted to give you guys a stopgap solution based on what I have so far.

Happy cleanup! 🧹

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment