Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V2.2.0 convenient downloadscripts #446

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

cmeesters
Copy link

Hi,

I tried to add a few features to the download scripts as a remedy to some potentially annoying issues causing tickets and to speed up the download-uncompress processes.

Specifically:

  • when using rsync in a multi-user system (e.g. an HPC cluster), some sites choose not to allow immediate internet access, but force to use a proxy on head nodes. In this case, rsync will run into a timeout. When accepting the pull request, this case is caught and a hint with regard to setting the RSYNC_PROXY variable is printed. (might save you some issue tickets)
  • as downloads might take quite some time, errors cannot be ruled out. In this case, users are forced to make a new attempt. When accepting the pull request, now, users are asked whether they want to proceed. If yes, the scripts will remove the ROOT_DIR first. Else, the scripts will cowardly refuse to proceed. Why? Because triggering a specific download script in error, will else lead to operate again. (might yield some less-annoyed users)
  • some files are rather large, so when finding pigz in PATH, uncompressing with pigz is attempted. The parallelism is NOT in the decompression, however, as the file handling is separated from the decompressing step a minor speed-up can be achieved.
  • particularly, with the uncompress step for the mmcif download script, there is a line like find "${RAW_DIR}/" -type f -iname "*.gz" -exec gunzip {} +, which takes ages to complete. Here, switching to find "${RAW_DIR}/" -type f -iname "*.gz" -print0 | xargs -0 -P2 "${uncompress_cmd}" yields a speed-up of about factor 2. The hardcoded -P2 is a bit unfortunate, yet I do not know whether it makes sense to figure out, what parallelism is allowed for the user (e.g. reading number of processors, reading the c-group, taking the minimum value), because much will depend on the file system and the current status (strain) it is in.

Your comments are most appreciated. I hope, that you find my contribution worth considering.

Best regards
Christian Meesters

…irectory - allowing for faster decompression
possible to use unpigz
fixed error, due to pushing into the root dir and then trying to 'mv /<file> ...
…error when pushing to and trying to gunzip /
…tion to parallel uncompress (at least a bit)
@cmeesters
Copy link
Author

cmeesters commented Apr 27, 2022

In the last two commits, I had to notice by means of a user report, that only the executing user had read permissions. This, of course, needed to be fixed for a multi-user system.

Essentially, I did chmod 444 - this, however, is questionable. If the database should be versioned, which could be accomplished, with a grep on a versioned setup script (see #447 - a git describe --tags --abbrev=0 is not possible for non-git directories as in extracted release downloads), then read-only files should be ensured. Else, they need to be user-writable, too.

Opinions?

@cmeesters
Copy link
Author

Hi,

I hoped to at least spark a bit of a discussion, as the mentioned issues still persists for multiuser systems. Whether the work with pgzip is appreciated is admittedly perhaps not worth a discussion. Yet, downloading the db and getting the user permissions right should not be an issue, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant