Skip to content
This repository has been archived by the owner on Sep 3, 2022. It is now read-only.

Getting Started

Graham Wheeler edited this page May 25, 2016 · 21 revisions

Getting Started

Using DataLab in Google Cloud

The easiest way to use Google Cloud DataLab is on Google Cloud Platform. Head over the DataLab site, and deploy an instance into a Cloud Project. Note that this way is being deprecated though in favor of using Datalab locally or on your own GCE instance, as described below.

Using DataLab locally

DataLab is built and packaged as a docker container. You will need docker configured and running locally. If you're on Mac or Windows, the easiest way to get docker is via the Docker Toolbox. Download and install that, then open the Kitematic app that it installed which will create and start a 'default' VM and start the Docker server.

Add a port mapping so you can use localhost for Datalab:

VBoxManage modifyvm default --natpf1 "datalab,tcp,,8081,,8081"

If you want to build from source (e.g. you want to customize the packages in the container, or want the most up-to-date version), then clone the Datalab repo, build it and run it:

git clone https://github.com/GoogleCloudPlatform/datalab.git
cd datalab
source ./tools/initenv.sh
rm -rf build/
cd sources/
./build.sh
cd ../containers/datalab
./build.sh
# Replace the MyProjectID value in the next line with your project ID
PROJECT_ID=MyProjectID ./run.sh

Note: if you want to customize the container the place to start is in containers/datalab/Dockerfile.in.

Alternatively, you can use a prebuilt version of the container, although you may not get the latest changes. On Linux or a Mac, use:

CONTENT=~  # Set this to the place you will keep your notebooks
ID='MyProjectID'  # Set this to an appropriate project ID.

eval $(docker-machine env default)  # This sets the Docker environment variables
docker run -it -p 8081:8080 -v "$CONTENT:/content" -e "PROJECT_ID=$ID" gcr.io/cloud-datalab/datalab:local

On Windows in a command shell window (cmd.exe, not PowerShell) use:

REM Set the Docker environment variables
FOR /f "tokens=*" %%i IN ('docker-machine.exe env --shell cmd default') DO %%i
SET _home=/c%HOMEPATH:\=/%
REM Set the directory below to where you keep your notebooks
SET _content=%_home%/Documents
REM Replace the MyProjectID in the line below with your project ID
SET _id=MyProjectID
docker run -it -p 8081:8080 -v "%_content%:/content" -e "PROJECT_ID=%_id%" gcr.io/cloud_datalab/datalab:local

You can save the above to a text file with a .cmd extension to make a script. Don't forget to do the VirtualBox port mapping command above for Windows too; if you don't do this you won't be able to use localhost:8081 but will have to use the dynamic IP allocated by Docker!

Then open your browser to http://localhost:8081.

Note that to use any Google Cloud functionality you will need to set the project ID. You can do this through the -e argument to docker run as shown above, or by calling:

set_project_id('myproject')

within a cell in your notebook.

The first time you open Datalab you will have to accept the EULA. You may also get prompted to authenticate with your Google account. Once you have done this you should not need to do it again, unless you change the volume where your notebooks are stored; your access tokens will be stored in a file in that volume (as opposed to the usual ~/.config/gcloud location).

If you want to use the same token file as gcloud creates (and thus avoid having to authenticate in Datalab entirely, provided you have done a gcloud auth login on your local machine), then as long as the gcloud credentials file is in the mapped-in volume you can do so by pointing to it with the GOOGLE_APPLICATION_CREDENTIALS environment variable. For example, on Mac/Linux you would change the docker run command to:

docker run -it -p 8081:8080 -v "$CONTENT:/content" -e "PROJECT_ID=$ID" -e "GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/credentials" gcr.io/cloud-datalab/datalab:local

On Windows this is a bit more tricky. You would need to map your home directory in as the content volume, not your Documents folder, as the gcloud credentials file is not stored under documents. In this case you could use:

docker run -it -p 8081:8080 -v "%_home%:/content" -e "PROJECT_ID=%_id%" -e "GOOGLE_APPLICATION_CREDENTIALS=%_home%/AppData/Roaming/gcloud/credentials" gcr.io/cloud_datalab/datalab:local

Using Datalab on GCE

First, build the container and push it to a container registry (see the previous section for building, and look at the stage.sh script instead of the run.sh script for an example of how to push the container).

Make sure to set the project to the one you want to use for Datalab:

PROJECT_ID=my-datalab-project  # Set an appropriate ID here
gcloud config set project $PROJECT_ID

You need to create a persistent disk for storing your notebooks. You can do that with:

ZONE=us-central1-a  # pick a zone here
SIZE=500GB  # Choose an appropriate size
DISKNAME=my-datalab-disk  # Choose an appropriate name

gcloud compute disks create --size=$SIZE --zone=$ZONE $DISKNAME

Then create a containers.yaml file. In the example below, replace IMAGE with the path to the container on GCR, Docker Hub or wherever you publish your container, and replace DISKNAME with the disk name from above.

apiVersion: v1
kind: Pod
metadata:
  name: datalab
spec:
  containers:
    - name: datalab
      image: IMAGE
      command: ['/datalab/run.sh']
      imagePullPolicy: IfNotPresent
      ports:
        - containerPort: 8080
          hostPort: 8080
      env:
      - name: DATALAB_ENV
        value: GCE
      volumeMounts:
        - mountPath: /content
          name: datalab-volume
  volumes:
  - name: datalab-volume
    gcePersistentDisk:
      pdName: DISKNAME
      fsType: ext4

You can then use the YAML file to create a running compute instance on GCE:

LABEL=my-datalab  # pick a name here
ZONE=us-central1-a  # pick a zone here
MACHINE_TYPE=n1-highmem-2  # pick a machine type here; we recommend decent memory but not many cores

gcloud compute instances create $LABEL --image container-vm --metadata-from-file google-container-manifest=containers.yaml --zone $ZONE --machine-type $MACHINE_TPE --scopes cloud-platform

Expose the port through ssh tunneling:

gcloud compute ssh --zone $ZONE --ssh-flag="-L" --ssh-flag="8082:localhost:8080" $LABEL

This will open a tunnel from localhost:8082 to the GCE instance. The tunnel will stay open until you close the SSH shell. If you have just created the instance then it has to do a docker pull to get the container which can take about four minutes on first run. You can check when Datalab is ready by running:

sudo docker ps

in the SSH shell. You should see one or two containers running; once there are two Datalab is ready. You can now connect to http://localhost:8082 on your local machine to start using Datalab.

Note on first execution you will (i) have to click through a EULA page and (ii) may have to enable the resource manager API for your project if it is not yet enabled; this is so we can get the project ID for the service account (and enable the new project listing APIs that are used for that).

When you no longer need the GCE instance, you can delete it:

gcloud compute instances delete $LABEL 

The persistent disk will still have your notebooks and you can create a new compute instance later and reuse the disk. If you want to delete the disk, including your notebooks, use:

gcloud compute disks delete $DISKNAME

You can create periodic snapshots of the disk for backup purposes. See here for more info about how to do that.

Using MathJax

The MathJax package is not loaded by default in Datalab, due to its size. If you need to use it, you can load it with:

%extension mathjax
Clone this wiki locally