Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 with Jzarr #89

Merged
merged 13 commits into from
Jun 29, 2021
Merged

S3 with Jzarr #89

merged 13 commits into from
Jun 29, 2021

Conversation

joshmoore
Copy link
Contributor

@joshmoore joshmoore commented Mar 17, 2021

Built on top of #80 following suggestions from bcdev/jzarr#16

Example:

$ gradle build installDist
$ build/install/bioformats2raw/bin/bioformats2raw --output-options s3fs_path_style_access=true  a.fake s3://${USER}:${PASS}@s3.embassy.ebi.ac.uk/idr-upload/josh/bfpr89/4

note: there are a number of other forks of the original s3fs library including https://github.com/nextflow-io/nextflow-s3fs

Adds a `--remote` option which if present will trigger
the creation of an alternative `FileSystem` based on
the URI prefix.

see: https://jzarr.readthedocs.io/en/latest/amazonS3.html
@joshmoore
Copy link
Contributor Author

Note: looks like the 404 is a minio/permissions issue. The nginx logs show an attempt to access / when this user is limited to /idr-upload/josh. Likely something that needs to be dealt with in a lower library.

Taking minio out of the middle, bioformats2raw runs to completion but prints:

2021-03-17 13:29:22,365 [pool-1-thread-2] WARN  c.a.s.s.i.S3AbortableInputStream - Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.

@joshmoore joshmoore mentioned this pull request Mar 17, 2021
@joshmoore
Copy link
Contributor Author

While we wait for #80 to go in: a few questions:

  • would another CLI argument be preferable? --endpoint etc.
  • would reading these options from a file help? (to keep passwords safe, take environment options for the FS, etc.)
  • do we want to start up minio in this repo for tests?
  • and/or go whole hog and add in an interface to wrap some of the logic?

@joshmoore
Copy link
Contributor Author

Green after merging in origin/master.

@joshmoore
Copy link
Contributor Author

Merged in the mainline to solve the conflict. @chris-allan, can you see getting this into 0.3.0 or would you rather see this externally in a wrapping layer?

@chris-allan
Copy link
Member

Merged in the mainline to solve the conflict. @chris-allan, can you see getting this into 0.3.0 or would you rather see this externally in a wrapping layer?

Definitely not a priority for us for 0.3.0.

would another CLI argument be preferable? --endpoint etc.

Personally, I like the style of just using a URI as the target and not having --remote or similar command line option. If we want to also be pedantic and add file:///... that works for me too.

would reading these options from a file help? (to keep passwords safe, take environment options for the FS, etc.)

The use of the AWS_* environment variables and .aws environment used by aws-cli is pretty ubiquitous now. Seems like the most sane option. URL encoding and parsing keys is probably a bad idea.

@joshmoore
Copy link
Contributor Author

Pushed a commit to drop the --remote option. Usage is now:

build/install/bioformats2raw/bin/bioformats2raw a.fake --output-options s3fs_path_style_access=true s3://$USER:$PASS@$HOST/$BUCKET/$REST

I haven't tested the AWS_* properties and support ingnon-path-style access will require more work. Trying to test against minio, I needed to use --output-options s3fs_path_style_access=true,s3fs_protocol=HTTP, but then I ran into a NPE deep within the S3FileSystem implementation. I started looking into new implementations which led me to nextflow-io/nextflow-s3fs#21

@joshmoore
Copy link
Contributor Author

@chris-allan : I think this has been properly sanitized:

@chris-allan
Copy link
Member

Remove dependency on the S3 filesystem?

I'm okay with including it. There's probably a substantial argument for bioformats2raw being "batteries included."

Any suggestions for https://github.com/glencoesoftware/bioformats2raw/pull/89/files#diff-156407914fc2609796db393053797aa5ce0446652d302b8bb7a6885b6adfa642R1550? (looks to be working)

For the use of Path? setId() only takes a string at present unfortunately. We'd probably have to look to a completely different strategy if we were going to maintain that functionality for S3. Since that method is legacy and I don't think used by anything anymore it is probably safe to just remove it entirely in this PR.

@chris-allan
Copy link
Member

Needs to have master merged in again and conflicts resolved. Anything else you want to discuss here @joshmoore?

@joshmoore
Copy link
Contributor Author

Conflict resolved and saveExtraImage method removed. Nothing else from my side if this stays green.

@chris-allan
Copy link
Member

👍

Just needs the style fixes and we'll be green and ready to merge I think.

@joshmoore
Copy link
Contributor Author

whew

@carlspring
Copy link

HI there,

A bit late to the party, but if you're interested in an S3 implementation for Java, we've made a spin-off of the Upplication/Amazon-S3-FileSystem-NIO2 on which nextflow-s3fs was based and have released it to Maven Central. We've called it s3fs-nio, cut a 1.0.0 release and it's available via Maven Central (here).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants