Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/main' into local-telemetry-agent
Browse files Browse the repository at this point in the history
  • Loading branch information
jlewitt1 committed Sep 5, 2024
2 parents 9a53277 + 70c7c87 commit 2f52ff7
Show file tree
Hide file tree
Showing 65 changed files with 1,072 additions and 4,678 deletions.
1 change: 0 additions & 1 deletion MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,2 +1 @@
include runhouse/builtins/*
include runhouse/resources/hardware/sagemaker/*
1 change: 0 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,6 @@ Please reach out (first name at run.house) if you don't see your favorite comput
- Amazon Web Services (AWS)
- EC2 - **Supported**
- EKS - **Supported**
- SageMaker - **Supported**
- Lambda - **Alpha**
- Google Cloud Platform (GCP)
- GCE - **Supported**
Expand Down
147 changes: 3 additions & 144 deletions docs/api/python/cluster.rst
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
Cluster
=======
A Cluster is a Runhouse primitive used for abstracting a particular hardware configuration.
This can be either an :ref:`on-demand cluster <OnDemandCluster Class>` (requires valid cloud credentials), a
:ref:`BYO (bring-your-own) cluster <Cluster Class>` (requires IP address and ssh creds), or a
:ref:`SageMaker cluster <SageMakerCluster Class>` (requires an ARN role).
This can be either an :ref:`on-demand cluster <OnDemandCluster Class>` (requires valid cloud credentials or a
local Kube config if launching on Kubernetes), or a
:ref:`BYO (bring-your-own) cluster <Cluster Class>` (requires IP address and ssh creds).

A cluster is assigned a name, through which it can be accessed and reused later on.

Expand All @@ -14,8 +14,6 @@ Cluster Factory Methods

.. autofunction:: runhouse.ondemand_cluster

.. autofunction:: runhouse.sagemaker_cluster

Cluster Class
~~~~~~~~~~~~~

Expand Down Expand Up @@ -75,141 +73,6 @@ See the `SkyPilot docs <https://skypilot.readthedocs.io/en/latest/cloud-setup/cl
for more details on configuring a VPC.


SageMakerCluster Class
~~~~~~~~~~~~~~~~~~~~~~
.. note::

SageMaker support is an alpha and under active development. Please report any bugs or let us know of any
feature requests.

A SageMakerCluster is a cluster that uses a SageMaker instance under the hood.

Runhouse currently supports two core usage paths for SageMaker clusters:

- **Compute backend**: You can use SageMaker as a compute backend, just as you would a
:ref:`BYO (bring-your-own) <Cluster Class>` or an :ref:`on-demand cluster <OnDemandCluster Class>`.
Runhouse will handle launching the SageMaker compute and creating the SSH connection
to the cluster.

- **Dedicated training jobs**: You can use a SageMakerCluster class to run a training job on SageMaker compute.
To do so, you will need to provide an
`estimator <https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html>`__.

.. note::

Runhouse requires an AWS IAM role (either name or full ARN) whose credentials have adequate permissions to
create create SageMaker endpoints and access AWS resources.

Please see :ref:`SageMaker Hardware Setup` for more specific instructions and
requirements for providing the role and setting up the cluster.

.. autoclass:: runhouse.SageMakerCluster
:members:
:exclude-members:

.. automethod:: __init__

SageMaker Hardware Setup
------------------------

IAM Role
^^^^^^^^

SageMaker clusters require `AWS CLI V2 <https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-welcome.html>`__ and
configuring the SageMaker IAM role with the
`AWS Systems Manager <https://docs.aws.amazon.com/systems-manager/latest/userguide/ssm-agent.html>`__.


In order to launch a cluster, you must grant SageMaker the necessary permissions with an IAM role, which
can be provided either by name or by full ARN. You can also specify a profile explicitly or
with the :code:`AWS_PROFILE` environment variable.

For example, let's say your local :code:`~/.aws/config` file contains:

.. code-block:: ini
[profile sagemaker]
role_arn = arn:aws:iam::123456789:role/service-role/AmazonSageMaker-ExecutionRole-20230717T192142
region = us-east-1
source_profile = default
There are several ways to provide the necessary credentials when :ref:`initializing the cluster <Cluster Factory Methods>`:

- Providing the AWS profile name: :code:`profile="sagemaker"`
- Providing the AWS Role ARN directly: :code:`role="arn:aws:iam::123456789:role/service-role/AmazonSageMaker-ExecutionRole-20230717T192142"`
- Environment Variable: setting :code:`AWS_PROFILE` to :code:`"sagemaker"`

.. note::

If no role or profile is provided, Runhouse will try using the :code:`default` profile. Note if this default AWS
identity is not a role, then you will need to provide the :code:`role` or :code:`profile` explicitly.

.. tip::

If you are providing an estimator, you must provide the role ARN explicitly as part of the estimator object.
More info on estimators `here <https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html>`__.

Please see the `AWS docs <https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html>`__ for further
instructions on creating and configuring an ARN Role.


AWS CLI V2
^^^^^^^^^^

The SageMaker SDK uses AWS CLI V2, which must be installed on your local machine. Doing so requires one of two steps:

- `Migrate from V1 to V2 <https://docs.aws.amazon.com/cli/latest/userguide/cliv2-migration-instructions.html#cliv2-migration-instructions-migrate>`_

- `Install V2 <https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html>`_


To confirm the installation succeeded, run ``aws --version`` in the command line. You should see something like:

.. code-block::
$ aws-cli/2.13.8 Python/3.11.4 Darwin/21.3.0 source/arm64 prompt/off
If you are still seeing the V1 version, first try uninstalling V1 in case it is still present
(e.g. ``pip uninstall awscli``).

You may also need to add the V2 executable to the PATH of your python environment. For example, if you are using conda,
it’s possible the conda env will try using its own version of the AWS CLI located at a different
path (e.g. ``/opt/homebrew/anaconda3/bin/aws``), while the system wide installation of AWS CLI is located somewhere
else (e.g. ``/opt/homebrew/bin/aws``).

To find the global AWS CLI path:

.. code-block::
$ which aws
To ensure that the global AWS CLI version is used within your python environment, you’ll need to adjust the
PATH environment variable so that it prioritizes the global AWS CLI path.

.. code-block::
$ export PATH=/opt/homebrew/bin:$PATH
SSM Setup
^^^^^^^^^
The AWS Systems Manager service is used to create SSH tunnels with the SageMaker cluster.

To install the AWS Session Manager Plugin, please see the `AWS docs <https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-install-plugin.html>`_
or `SageMaker SSH Helper <https://github.com/aws-samples/sagemaker-ssh-helper#step-4-connect-over-ssm>`__. The SSH Helper package
simplifies the process of creating SSH tunnels with SageMaker clusters. It is installed by default if
you are installing Runhouse with the SageMaker dependency: :code:`pip install runhouse[sagemaker]`.

You can also install the Session Manager by running the CLI command:

.. code-block::
$ sm-local-configure
To configure your SageMaker IAM role with the AWS Systems Manager, please
refer to `these instructions <https://github.com/aws-samples/sagemaker-ssh-helper/blob/main/IAM_SSM_Setup.md>`__.


Cluster Authentication & Verification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Runhouse provides a couple of options to manage the connection to the Runhouse API server running on a cluster.
Expand All @@ -228,10 +91,6 @@ be started on the cluster on port :code:`32300`.
- ``none``: Does not use any port forwarding or enforce any authentication. Connects to the cluster with HTTP by
default on port :code:`80`. This is useful when connecting to a cluster within a VPC, or creating a tunnel manually
on the side with custom settings.
- ``aws_ssm``: Uses the
`AWS Systems Manager <https://docs.aws.amazon.com/systems-manager/latest/userguide/what-is-systems-manager.html>`__ to
create an SSH tunnel to the cluster, by default on port :code:`32300`. *Note: this is currently only relevant
for SageMaker Clusters.*


.. note::
Expand Down
3 changes: 1 addition & 2 deletions docs/docker-setup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,7 @@ is automatically built and set up remotely on the cluster. The Runhouse
server will start directly inside the remote container.

**NOTE:** This guide details the setup and usage for on-demand clusters
only. Docker container is also supported for Sagemaker clusters, and it
is not yet supported for static clusters.
only. It is not yet supported for static clusters.

Cluster & Docker Setup
----------------------
Expand Down
1 change: 0 additions & 1 deletion docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@ pint==0.20.1
pyarrow==9.0.0
pydata-sphinx-theme==0.13.3
ray>=2.2.0
sagemaker
sentry-sdk==1.28.1
sphinx-book-theme==1.0.1
sphinx-click==4.3.0
Expand Down
2 changes: 1 addition & 1 deletion docs/tutorials/api-clusters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ remotely on your AWS instance.
On-Demand Clusters within Existing Cloud VPC
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you would like to launch on-demand clusters using existing VPCs,
you can easily set it up by configuring SkyPilot. Without setting VPC,
we launch in the default VPC in the region of the cluster. If you do
Expand Down
10 changes: 5 additions & 5 deletions docs/tutorials/api-resources.rst
Original file line number Diff line number Diff line change
Expand Up @@ -276,16 +276,16 @@ to notify them.
INFO | 2024-08-18 06:51:39.797150 | Saving config for aws-cpu-ssh-secret to Den
INFO | 2024-08-18 06:51:39.972763 | Saving secrets for aws-cpu-ssh-secret to Vault
INFO | 2024-08-18 06:51:40.190996 | Saving config to RNS: {'name': '/jlewitt1/aws-cpu_default_env', 'resource_type': 'env', 'resource_subtype': 'Env', 'provenance': None, 'visibility': 'private', 'env_vars': {}, 'env_name': 'aws-cpu_default_env', 'compute': {}, 'reqs': ['ray==2.30.0'], 'working_dir': None}
INFO | 2024-08-18 06:51:40.368442 | Saving config to RNS: {'name': '/jlewitt1/aws-cpu', 'resource_type': 'cluster', 'resource_subtype': 'OnDemandCluster', 'provenance': None, 'visibility': 'private', 'ips': ['3.14.144.103'], 'server_port': 32300, 'server_connection_type': 'ssh', 'den_auth': False, 'ssh_port': 22, 'client_port': 32300, 'creds': '/jlewitt1/aws-cpu-ssh-secret', 'api_server_url': 'https://api.run.house', 'default_env': '/jlewitt1/aws-cpu_default_env', 'instance_type': 'CPU:2+', 'provider': 'aws', 'open_ports': [], 'use_spot': False, 'image_id': 'docker:nvcr.io/nvidia/pytorch:23.10-py3', 'region': 'us-east-2', 'stable_internal_external_ips': [('172.31.5.134', '3.14.144.103')], 'sky_kwargs': {'launch': {'retry_until_up': True}}, 'launched_properties': {'cloud': 'aws', 'instance_type': 'm6i.large', 'region': 'us-east-2', 'cost_per_hour': 0.096, 'docker_user': 'root'}, 'autostop_mins': -1}
INFO | 2024-08-18 06:51:40.190996 | Saving config to RNS: {'name': '/jlewitt1/aws-cpu_default_env', 'resource_type': 'env', 'resource_subtype': 'Env', 'visibility': 'private', 'env_vars': {}, 'env_name': 'aws-cpu_default_env', 'compute': {}, 'reqs': ['ray==2.30.0'], 'working_dir': None}
INFO | 2024-08-18 06:51:40.368442 | Saving config to RNS: {'name': '/jlewitt1/aws-cpu', 'resource_type': 'cluster', 'resource_subtype': 'OnDemandCluster', 'visibility': 'private', 'ips': ['3.14.144.103'], 'server_port': 32300, 'server_connection_type': 'ssh', 'den_auth': False, 'ssh_port': 22, 'client_port': 32300, 'creds': '/jlewitt1/aws-cpu-ssh-secret', 'api_server_url': 'https://api.run.house', 'default_env': '/jlewitt1/aws-cpu_default_env', 'instance_type': 'CPU:2+', 'provider': 'aws', 'open_ports': [], 'use_spot': False, 'image_id': 'docker:nvcr.io/nvidia/pytorch:23.10-py3', 'region': 'us-east-2', 'stable_internal_external_ips': [('172.31.5.134', '3.14.144.103')], 'sky_kwargs': {'launch': {'retry_until_up': True}}, 'launched_properties': {'cloud': 'aws', 'instance_type': 'm6i.large', 'region': 'us-east-2', 'cost_per_hour': 0.096, 'docker_user': 'root'}, 'autostop_mins': -1}
INFO | 2024-08-18 06:51:40.548233 | Sharing cluster credentials, which enables the recipient to SSH into the cluster.
INFO | 2024-08-18 06:51:40.551277 | Saving config for aws-cpu-ssh-secret to Den
INFO | 2024-08-18 06:51:40.728345 | Saving secrets for aws-cpu-ssh-secret to Vault
INFO | 2024-08-18 06:51:41.150745 | Saving config to RNS: {'name': '/jlewitt1/aws-cpu_default_env', 'resource_type': 'env', 'resource_subtype': 'Env', 'provenance': None, 'visibility': 'private', 'env_vars': {}, 'env_name': 'aws-cpu_default_env', 'compute': {}, 'reqs': ['ray==2.30.0'], 'working_dir': None}
INFO | 2024-08-18 06:51:41.150745 | Saving config to RNS: {'name': '/jlewitt1/aws-cpu_default_env', 'resource_type': 'env', 'resource_subtype': 'Env', 'visibility': 'private', 'env_vars': {}, 'env_name': 'aws-cpu_default_env', 'compute': {}, 'reqs': ['ray==2.30.0'], 'working_dir': None}
INFO | 2024-08-18 06:51:42.006030 | Saving config for aws-cpu-ssh-secret to Den
INFO | 2024-08-18 06:51:42.504070 | Saving secrets for aws-cpu-ssh-secret to Vault
INFO | 2024-08-18 06:51:42.728653 | Saving config to RNS: {'name': '/jlewitt1/aws-cpu_default_env', 'resource_type': 'env', 'resource_subtype': 'Env', 'provenance': None, 'visibility': 'private', 'env_vars': {}, 'env_name': 'aws-cpu_default_env', 'compute': {}, 'reqs': ['ray==2.30.0'], 'working_dir': None}
INFO | 2024-08-18 06:51:42.906615 | Saving config to RNS: {'name': '/jlewitt1/aws-cpu', 'resource_type': 'cluster', 'resource_subtype': 'OnDemandCluster', 'provenance': None, 'visibility': 'private', 'ips': ['3.14.144.103'], 'server_port': 32300, 'server_connection_type': 'ssh', 'den_auth': False, 'ssh_port': 22, 'client_port': 32300, 'creds': '/jlewitt1/aws-cpu-ssh-secret', 'api_server_url': 'https://api.run.house', 'default_env': '/jlewitt1/aws-cpu_default_env', 'instance_type': 'CPU:2+', 'provider': 'aws', 'open_ports': [], 'use_spot': False, 'image_id': 'docker:nvcr.io/nvidia/pytorch:23.10-py3', 'region': 'us-east-2', 'stable_internal_external_ips': [('172.31.5.134', '3.14.144.103')], 'sky_kwargs': {'launch': {'retry_until_up': True}}, 'launched_properties': {'cloud': 'aws', 'instance_type': 'm6i.large', 'region': 'us-east-2', 'cost_per_hour': 0.096, 'docker_user': 'root'}, 'autostop_mins': -1}
INFO | 2024-08-18 06:51:42.728653 | Saving config to RNS: {'name': '/jlewitt1/aws-cpu_default_env', 'resource_type': 'env', 'resource_subtype': 'Env', 'visibility': 'private', 'env_vars': {}, 'env_name': 'aws-cpu_default_env', 'compute': {}, 'reqs': ['ray==2.30.0'], 'working_dir': None}
INFO | 2024-08-18 06:51:42.906615 | Saving config to RNS: {'name': '/jlewitt1/aws-cpu', 'resource_type': 'cluster', 'resource_subtype': 'OnDemandCluster', 'visibility': 'private', 'ips': ['3.14.144.103'], 'server_port': 32300, 'server_connection_type': 'ssh', 'den_auth': False, 'ssh_port': 22, 'client_port': 32300, 'creds': '/jlewitt1/aws-cpu-ssh-secret', 'api_server_url': 'https://api.run.house', 'default_env': '/jlewitt1/aws-cpu_default_env', 'instance_type': 'CPU:2+', 'provider': 'aws', 'open_ports': [], 'use_spot': False, 'image_id': 'docker:nvcr.io/nvidia/pytorch:23.10-py3', 'region': 'us-east-2', 'stable_internal_external_ips': [('172.31.5.134', '3.14.144.103')], 'sky_kwargs': {'launch': {'retry_until_up': True}}, 'launched_properties': {'cloud': 'aws', 'instance_type': 'm6i.large', 'region': 'us-east-2', 'cost_per_hour': 0.096, 'docker_user': 'root'}, 'autostop_mins': -1}
Expand Down
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,4 @@ wheel
apispec
httpx
pydantic >=2.5.0
pynvml
6 changes: 1 addition & 5 deletions runhouse/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
from runhouse.resources.asgi import Asgi, asgi
from runhouse.resources.blobs import blob, Blob, file, File
from runhouse.resources.envs import conda_env, CondaEnv, env, Env
from runhouse.resources.folders import Folder, folder, GCSFolder, S3Folder
from runhouse.resources.functions.aws_lambda import LambdaFunction
Expand All @@ -12,8 +11,6 @@
kubernetes_cluster,
ondemand_cluster,
OnDemandCluster,
sagemaker_cluster,
SageMakerCluster,
)

# WARNING: Any built-in module that is imported here must be capitalized followed by all lowercase, or we will
Expand All @@ -26,7 +23,6 @@
package,
Package,
)
from runhouse.resources.provenance import capture_stdout, Run, run, RunStatus, RunType
from runhouse.resources.resource import Resource
from runhouse.resources.secrets import provider_secret, ProviderSecret, Secret, secret

Expand Down Expand Up @@ -63,4 +59,4 @@ def __getattr__(name):
raise AttributeError(f"module {__name__!r} has no attribute {name!r}")


__version__ = "0.0.33"
__version__ = "0.0.34"
15 changes: 14 additions & 1 deletion runhouse/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
LOCALHOST: str = "127.0.0.1"
LOCAL_HOSTS: List[str] = ["localhost", LOCALHOST]
TUNNEL_TIMEOUT = 5
NUM_PORTS_TO_TRY = 10

LOGS_DIR = ".rh/logs"
RH_LOGFILE_PATH = Path.home() / LOGS_DIR
Expand Down Expand Up @@ -73,11 +74,23 @@
# Constants for the status check
DOUBLE_SPACE_UNICODE = "\u00A0\u00A0"
BULLET_UNICODE = "\u2022"
SECOND = 1
MINUTE = 60
HOUR = 3600
DEFAULT_STATUS_CHECK_INTERVAL = 1 * MINUTE
INCREASED_STATUS_CHECK_INTERVAL = 1 * HOUR
STATUS_CHECK_DELAY = 1 * MINUTE
GPU_COLLECTION_INTERVAL = 5 * SECOND

# We collect gpu every GPU_COLLECTION_INTERVAL.
# Meaning that in one minute we collect (MINUTE / GPU_COLLECTION_INTERVAL) gpu stats.
# Currently, we save gpu info of the last 10 minutes or less.
MAX_GPU_INFO_LEN = (MINUTE / GPU_COLLECTION_INTERVAL) * 10

# If we just collect the gpu stats (and not send them to den), the gpu_info dictionary *will not* be reseted by the servlets.
# Therefore, we need to cut the gpu_info size, so it doesn't consume too much cluster memory.
# Currently, we reduce the size by half, meaning we only keep the gpu_info of the last (MAX_GPU_INFO_LEN / 2) minutes.
REDUCED_GPU_INFO_LEN = MAX_GPU_INFO_LEN / 2


# Constants Surfacing Logs to Den
DEFAULT_LOG_SURFACING_INTERVAL = 2 * MINUTE
Expand Down
21 changes: 19 additions & 2 deletions runhouse/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -366,14 +366,13 @@ def _print_envs_info(
total_gpu_memory = math.ceil(
float(env_gpu_info.get("total_memory")) / (1024**3)
)
gpu_util_percent = round(float(env_gpu_info.get("utilization_percent")), 2)
used_gpu_memory = round(
float(env_gpu_info.get("used_memory")) / (1024**3), 2
)
gpu_memory_usage_percent = round(
float(used_gpu_memory / total_gpu_memory) * 100, 2
)
gpu_usage_summery = f"{DOUBLE_SPACE_UNICODE}GPU: {gpu_util_percent}% | Memory: {used_gpu_memory} / {total_gpu_memory} Gb ({gpu_memory_usage_percent}%)"
gpu_usage_summery = f"{DOUBLE_SPACE_UNICODE}GPU Memory: {used_gpu_memory} / {total_gpu_memory} Gb ({gpu_memory_usage_percent}%)"
console.print(gpu_usage_summery)

resources_in_env = [
Expand Down Expand Up @@ -408,6 +407,8 @@ def _print_status(status_data: dict, current_cluster: Cluster) -> None:
if "name" in cluster_config.keys():
console.print(cluster_config.get("name"))

has_cuda: bool = cluster_config.get("has_cuda")

# print headline
daemon_headline_txt = (
"\N{smiling face with horns} Runhouse Daemon is running \N{Runner}"
Expand All @@ -420,6 +421,22 @@ def _print_status(status_data: dict, current_cluster: Cluster) -> None:
# Print relevant info from cluster config.
_print_cluster_config(cluster_config)

# print general cpu and gpu utilization
cluster_gpu_utilization: float = status_data.get("server_gpu_utilization")

# cluster_gpu_utilization can be none, if the cluster was not using its GPU at the moment cluster.status() was invoked.
if cluster_gpu_utilization is None and has_cuda:
cluster_gpu_utilization: float = 0.0

cluster_cpu_utilization: float = status_data.get("server_cpu_utilization")

server_util_info = (
f"CPU Utilization: {round(cluster_cpu_utilization, 2)}% | GPU Utilization: {round(cluster_gpu_utilization,2)}%"
if has_cuda
else f"CPU Utilization: {round(cluster_cpu_utilization, 2)}%"
)
console.print(server_util_info)

# print the environments in the cluster, and the resources associated with each environment.
_print_envs_info(env_servlet_processes, current_cluster)

Expand Down
2 changes: 0 additions & 2 deletions runhouse/resources/blobs/__init__.py

This file was deleted.

Loading

0 comments on commit 2f52ff7

Please sign in to comment.