Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevents files from being staged twice #43

Merged
merged 7 commits into from
Jul 25, 2023

Conversation

BWMac
Copy link
Contributor

@BWMac BWMac commented Jul 20, 2023

This PR implements a fix for the bug first reported by @adamjtaylor regarding a name collision error in Nextflow Tower runs of nf-dcqc. The error occurred because files were being staged twice in the Nextflow process work directory. I identified the error as having to do with one failed Batch job attempt leading to a successful re-try which was presumable happening after the file was initially staged. So, after the second attempt succeeded there were two copies of the target file in the work directory for the Nextflow process.

In order to prevent this from happening in this and other edge cases, I have added the file.already_staged method, which looks in the temporary directory to see if the file has already been staged in a dcqc-staged-* pattern-matched folder. If it has already been staged more than once somehow, it throws an error ahead of the inevitable Nextflow name collision error. If the file has been staged 1 or 0 times, already_staged returns a list of file paths.

In file.stage, if this list of file paths is empty the file has yet to be staged and the function continues as it has, staging the file. If it has already been staged, the Path to the previously staged file is returned.

This change has been tested in local and tower runs of nf-dqc. Unfortunately, because of the inconsistent and non-reproducible nature of the batch error which prompted this work it is difficult to be certain if it will prevent future issues.

@swarmia
Copy link

swarmia bot commented Jul 20, 2023

@BWMac BWMac changed the title adds double dile staging prevention Prevents files from being staged twice Jul 20, 2023
@codecov
Copy link

codecov bot commented Jul 21, 2023

Codecov Report

Merging #43 (6b852fa) into main (778f228) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main       #43   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           23        23           
  Lines         1150      1164   +14     
  Branches       189       192    +3     
=========================================
+ Hits          1150      1164   +14     
Files Changed Coverage Δ
src/dcqc/file.py 100.00% <100.00%> (ø)

@BWMac BWMac marked this pull request as ready for review July 21, 2023 20:07
@BWMac BWMac requested a review from thomasyu888 July 21, 2023 20:08
src/dcqc/file.py Show resolved Hide resolved
src/dcqc/file.py Show resolved Hide resolved
Copy link
Contributor

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 I'm going to pre-approve, this looks great. Just some comments.

@@ -62,6 +64,27 @@ def test_that_a_local_temporary_path_is_created_when_staging_a_remote_file(test_
assert remote_file.local_path == staged_path


def test_that_error_is_raised_when_a_file_has_been_staged_multiple_times(test_files):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we need a test for when the file already exists?

Copy link
Contributor Author

@BWMac BWMac Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and I also realized that now that we are preventing files from being staged more than once, that outcome is occurring after the first successful execution of stage in a test for a remote file (or when I create fake staged files in a test. I.e. we are not testing what we used to be testing after that first test. I will add a helper function to the tests that clears out the dcqc-staged-* directories to be used where necessary.

src/dcqc/file.py Show resolved Hide resolved
@BWMac BWMac merged commit 088fac5 into main Jul 25, 2023
12 checks passed
@BWMac BWMac deleted the bwmac/orca-258/fix_double_file_staging branch July 25, 2023 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants