Skip to content

Setting up Google Cloud Storage

William Silversmith edited this page Feb 7, 2023 · 29 revisions

Google Cloud Storage (GCS) is a popular object storage backend that is supported by CloudVolume. Follow these step by step directions to get started with using CloudVolume and Neuroglancer with GCS. The following instructions assume a unix or linux based operating system.

CloudVolume also supports AWS S3, the BOSS, and your local file system. Instructions for these systems are not covered in this article.

Setting up a CloudVolume Bucket

In order to use CloudVolume with GCS, you'll need to create a bucket, a namespace within Google's system that holds your files. Then you'll make a service account with read/write permissions to the bucket in order to access it programatically with CloudVolume. You can also access the bucket using the gsutil tool using your command line and through the web interface.

1. Creating Your Account

If you don't already have a GCS account, you'll need to set one up and link a credit or debit card to it. Be advised that as of this writing (Sept. 2018), the default storage cost is 0.026 $/GiB/mo or in connectomics terms, $312 per TiB per year. There are cheaper options, Nearline and Coldline, but they have strings attached. Transferring data out of GCS (termed "egress") to your local system or another provider will cost between $0.08 to $0.12 per GB so you may wish to consider the vendor lock-in implications of that. AWS S3 is similar in their cost structure, though their actual prices may vary significantly.

Once you've decide to move forward, follow the steps here.

2. Create a GCS Bucket

Follow the instructions here. Bucket names are globally (as in worldwide on Earth) unique, so you'll probably want to pick something like "$MYLAB" or "$MYDATASET".

3. Create Service Account Keys

Only authorized users can write to your GCS bucket. In order for CloudVolume to read from or write to your bucket, you'll need to create a Service Account. There are many different authorization schemes in use in the world, but the Service Account is a simple one. You create a programatic user with a secret token that can be revoked. You give the token to CloudVolume, and it will act as that special user.

  1. Go to this page (you may have to select your project on the top-left to see anything).
  2. Click "Create Service Account" (near top)
  3. Name the service account and click "Create".
  4. On the next page, you'll be able to grant it roles. Give it "Storage Object Admin", "Storage Object Creator", and "Storage Object Viewer" and click "Continue".
  5. On the next page, grant users you'd like using CloudVolume access to the service account.
  6. On the same page, click "Create Key" and select the JSON option and download this private key to your local machine. This key grants access to your bucket! Protect it!

The key looks like this:

{
  "type": "service_account",
  "project_id": "$PROJECT_ID",
  "private_key_id": "$DIGITS",
  "private_key": "-----BEGIN PRIVATE KEY-----\n$LOTSOFRANDOMCHARACTERS----END PRIVATE KEY-----\n",
  "client_email": "$SERVICEACCOUNTNAME@$PROJECTNAME.iam.gserviceaccount.com",
  "client_id": "$DIGITS",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/$SERVICEACCOUNTNAME%$PROJECTNAME.iam.gserviceaccount.com"
}

4. Grant Service Account Permissions

  1. Go to the page that displays the buckets in your account.
  2. On the far right hand side is an icon displaying three vertical dots meaning options. Click it and select "Change Bucket Permissions".
  3. Under the "Add Principal" section, type in the name of the service account you created in section 3.
  4. Under "Select a role", go to "Storage Legacy" and add "Storage Legacy Object Owner" and "Storage Legacy Object Reader"

5. Configure CloudVolume with Secrets

You'll need the service account secret you downloaded in section 3.

  1. Open your terminal and cd to your home directory.
  2. Create the CloudVolume secrets directory: mkdir -p ~/.cloudvolume/secrets/
  3. mv $NAME_OF_SERVICE_ACCOUNT_SECRET ~/.cloudvolume/secrets/google-secret.json

google-secret.json will be the default secret used when accessing all GCS buckets. If you are collaborating with another laboratory on a particular bucket, you can name it $BUCKETNAME-google-secret.json and that service account secret will be used for that bucket only.

Configuring the Bucket for Neuroglancer

Most people using CloudVolume do so in part because it is compatible with Neuroglancer, a browser based 3D image visualization app.

1. Install gsutil

gsutil is a helpful command line tool that lets you interact with your bucket. You can use it to list, copy, and otherwise manipulate your files. You'll need it to set your CORS headers on your bucket.

Follow the installation instructions here.

You'll be able to do so with commands like:

gsutil -m ls gs://bucketname # -m means use multiple threads and processes
gsutil cp file.json gs://bucketname/

2. Set CORS Headers

Neuroglancer is a web app, which means it is bound to browser security restrictions. One of these restrictions is called the "same origin policy", which causes browsers to deny requests for resources external to a domain unless that domain includes a special header "Access-Control-Allow-Origin: *" (the * means all other domains). This policy is intended to guard against cookie theft e.g. using hidden iframes.

However, we have no such problems here, we want Neuroglancer to read from our buckets located on another domain.

On your desktop, create a file named cors.json with the following contents. It means that any website can pull public data from your bucket and display it. Notably, non-public data won't be able to be pulled. If this setting isn't set, Neuroglancer won't be able to display your data, but programs like curl would be able to as they are not subject to the "same origin policy". To support a next generation file format called "sharded" we also include the "Range" response header.

[
    {
      "origin": ["*"],
      "responseHeader": ["Content-Type", "Range"],
      "method": ["GET", "HEAD"],
      "maxAgeSeconds": 3600
    }
]

Then run the following command in your terminal:

gsutil cors set ~/Desktop/cors.json gs://$BUCKETNAME

To visualize it using neuroglancer, you need to make it public:

gsutil iam ch allUsers:objectViewer gs://your-bucket

You can also use this sample script that automates some of the steps.

./setup_bucket.sh gs://bucket_name
#!/bin/bash

BUCKET=$1
CORS=$(echo /tmp/cors-$(date +"%s").json)
USER=my-service-account@my-project.iam.gserviceaccount.com

echo "Setting up $BUCKET"

echo "[
    {
      \"origin\": [\"*\"],
      \"responseHeader\": [\"Content-Type\", \"Range\"],
      \"method\": [\"GET\", \"HEAD\"],
      \"maxAgeSeconds\": 3600
    }
]" >> $CORS
gsutil cors set $CORS $BUCKET
gsutil iam ch allUsers:objectViewer $BUCKET
gsutil iam ch serviceAccount:$USER:objectAdmin serviceAccount:$USER:legacyBucketReader $BUCKET
rm $CORS

Ingesting Data

Now you're ready to ingest some data! The CloudVolume library this tutorial is attached to can do that for you. It accepts numpy arrays and uploads them to the cloud. Follow the README.md on the front page to learn how to do this.

We also have a sample upload script you can riff off of on this wiki.