Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-13314]Revise recommendations to manage Python pipeline dependencies. #16938

Merged
merged 12 commits into from
Mar 29, 2022

Conversation

AnandInguva
Copy link
Contributor

Please add a meaningful description for your change here


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@AnandInguva
Copy link
Contributor Author

R: @tvalentyn

@AnandInguva AnandInguva changed the title [Beam 13314]Revise recommendations to manage Python pipeline dependencies. [BEAM-13314]Revise recommendations to manage Python pipeline dependencies. Feb 25, 2022
@@ -46,7 +46,7 @@ Beam [SDK container images](https://hub.docker.com/search?q=apache%2Fbeam&type=i

1. **[Writing a new](#writing-new-dockerfiles) Dockerfile based on a released container image**. This is sufficient for simple additions to the image, such as adding artifacts or environment variables.
2. **[Modifying](#modifying-dockerfiles) a source Dockerfile in [Beam](https://github.com/apache/beam)**. This method requires building from Beam source but allows for greater customization of the container (including replacement of artifacts or base OS/language versions).

3. **[Build](#modify-existing-base-image) an existing container image to make it compatible with Apache Beam Runners**. This method is used when users start from an existing image, and configure the image to be compatible with Apache Beam Runners.
Copy link
Contributor

@tvalentyn tvalentyn Mar 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. **[Build](#modify-existing-base-image) an existing container image to make it compatible with Apache Beam Runners**. This method is used when users start from an existing image, and configure the image to be compatible with Apache Beam Runners.
3. **[Modifying](#modify-existing-base-image) an existing container image to make it compatible with Apache Beam Runners**. This method is used when users start from an existing image, and configure the image to be compatible with Apache Beam Runners.

Also: one of three ways above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

1. Copy necessary artifacts from Apache Beam base image to your image.
```
# This can be any container image,
FROM python:3.8-slim
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mismatch between py3.8 and py3.7 below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching

@@ -45,6 +45,19 @@ If your pipeline uses public packages from the [Python Package Index](https://py
The runner will use the `requirements.txt` file to install your additional dependencies onto the remote workers.

**Important:** Remote workers will install all packages listed in the `requirements.txt` file. Because of this, it's very important that you delete non-PyPI packages from the `requirements.txt` file, as stated in step 2. If you don't remove non-PyPI packages, the remote workers will fail when attempting to install packages from sources that are unknown to them.
> **NOTE**: An alternative to `pip check` is to use a library like [pip-tools](https://github.com/jazzband/pip-tools) to compile the `requirements.txt` with all the dependencies required for the pipeline.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pip freeze, not pip check

you can explain:

...to compile the requirements.txt` all transitive dependencies from a smaller set of requirements.```

COPY <path to requirements.txt> /tmp/requirements.txt
RUN python -m pip download -r /tmp/requirements.txt

**Note:** [Different approaches](https://beam.apache.org/documentation/runtime/environments/#writing-new-dockerfiles) to build the container images that would be compatible with Apache Beam Runners.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is relevant here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought may be referencing on how to use custom container would be useful but thinking about it, you are right

# Add these lines with the path to the requirements.txt to the Dockerfile

COPY <path to requirements.txt> /tmp/requirements.txt
RUN python -m pip download -r /tmp/requirements.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why pip download and not pip install ?


You can pass a [container](https://hub.docker.com/search?q=apache%2Fbeam&type=image) image with all the dependencies that are needed for the pipeline instead of `requirements.txt`. [Follow the instructions on how to run pipeline with Custom Container images](https://beam.apache.org/documentation/runtime/environments/#running-pipelines).

1. If you are passing a custom container image, `--sdk_container_image` at runtime and specify `--requirements_file` option, we recommend you to install the dependencies from the `--requirements_file` when building your container image. In this case, you would reduce the pipeline startup time and do not need to pass `--requirements_file` option at runtime.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are using a custom container image, we recommend that you install the dependencies from the --requirements_file directly into your image at build time. In this case, you do not need to pass --requirements_file option at runtime, which will reduce the pipeline startup time. Fore example:...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it

@@ -171,6 +171,49 @@ creates a Java 8 SDK image with appropriate licenses in `/opt/apache/beam/third_

By default, no licenses/notices are added to the docker images.

#### Build an existing container image to make it compatible with Apache Beam Runners {#modify-existing-base-image}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emilymye could you PTAL at this section?

AnandInguva and others added 2 commits March 7, 2022 11:58
Co-authored-by: tvalentyn <tvalentyn@users.noreply.github.com>

In the pre-building step, we install pipeline dependencies on the container image prior to the job submission. This would speed up the pipeline execution.\
To use pre-building the dependencies from `requirements.txt` on the container image. Follow the steps below.
1. Provide the container engine. We support `docker` and `cloud_build`(requires a GCP project with Cloud Build API enabled).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local_docker?

1. Provide the container engine. We support `docker` and `cloud_build`(requires a GCP project with Cloud Build API enabled).

--prebuild_sdk_container_enginer <execution_environment>
2. To pass a base image for pre-building dependencies, enable this flag. If not, apache beam's base image would be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this may not work on an arbitrary base image, the base image should follow the same contract to install dependencies in a setup_only mode as apache beam's base image https://github.com/apache/beam/blob/master/sdks/python/container/boot.go#L49

Copy link
Contributor

@tvalentyn tvalentyn Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. prebuild_sdk_container_engine (not enginer)
  2. Good point that the container needs to have the official entry point for this to work. I think in all container-customization mechanisms we suggest, one way or another we recommend, to use Beam's boot entry point.
  3. As a part of making prebuilding not experimental, I think we should remove prebuild_sdk_container_base_image and just use --sdk_container_image flag for this purpose. i don't see the need for two different flags.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to the point 3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@y1chi took care of #3 in #17032. Thanks, @y1chi .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @y1chi pointed out, it may not work if the user doesn't follow apache beam's contract. But we do instruct them to follow the contract in some way.

So, I assume we can introduce this section as part of the instruction?

Copy link
Contributor

@tvalentyn tvalentyn Mar 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove #2 now that we don't need a special flag and use the standard --sdk_container_image flag for this purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we can keep 2 and update the pipeline option to --sdk_container_image=....

2. To pass a base image for pre-building dependencies, enable this flag. If not, apache beam's base image would be used.

--prebuild_sdk_container_base_image <location_to_base_image>
3. To push the container image, pre-built locally with `Docker` , to a remote repository(eg: docker registry), provide URL to the docker registry by passing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local_docker

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it should be local_docker

@@ -171,6 +171,49 @@ creates a Java 8 SDK image with appropriate licenses in `/opt/apache/beam/third_

By default, no licenses/notices are added to the docker images.

#### Build an existing container image to make it compatible with Apache Beam Runners {#modify-existing-base-image}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Build an existing container image to make it compatible with Apache Beam Runners {#modify-existing-base-image}
#### Build a container image based on an existing image compatible with Apache Beam Runners {#modify-existing-base-image}

@@ -171,6 +171,49 @@ creates a Java 8 SDK image with appropriate licenses in `/opt/apache/beam/third_

By default, no licenses/notices are added to the docker images.

#### Build an existing container image to make it compatible with Apache Beam Runners {#modify-existing-base-image}
Beam offers a way to take a Beam container image and customize it. But if you have an existing base image to be compatible with Apache Beam Runners, use a [multi-stage build](https://docs.docker.com/develop/develop-images/multistage-build/) process to copy over the necessary artifacts from a default Apache Beam base image and provide your custom container image.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Beam offers a way to take a Beam container image and customize it. But if you have an existing base image to be compatible with Apache Beam Runners, use a [multi-stage build](https://docs.docker.com/develop/develop-images/multistage-build/) process to copy over the necessary artifacts from a default Apache Beam base image and provide your custom container image.
Beam offers a way to provide your own custom Beam container image. The easiest way to build a new custom image that is compatible with Apache Beam Runners is to use a [multi-stage build](https://docs.docker.com/develop/develop-images/multistage-build/) process. This copies over the necessary artifacts from a default Apache Beam base image to build your custom container image.

# Copy files from official SDK image, including script/dependencies.
COPY --from=apache/beam_python3.7_sdk:2.25.0 /opt/apache/beam /opt/apache/beam

# Perform any addtional customizations if desired
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Perform any addtional customizations if desired
# Perform any additional customizations if desired

FROM python:3.8-slim

# Install SDK. (needed for Python SDK)
RUN pip install --no-cache-dir apache-beam[gcp]==2.25.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can make the version here a build arg https://docs.docker.com/engine/reference/builder/#arg

e.g.

FROM python:3.8-slim
ARG beam_ver=2.25.0
...

RUN pip install --no-cache-dir apache-beam[gcp]==$beam_ver
COPY --from=apache/beam_python3.7_sdk:$beam_ver /opt/apache/beam /opt/apache/beam

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we tried this before and it didn't work.

Copy link
Contributor Author

@AnandInguva AnandInguva Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I built a Docker image with ARG BEAM_VERSION=2.31.0 and RUN pip install --no-cache-dir apache-beam[gcp]==$BEAM_VERSION. This seems to be working. I verified the /opt/apache/beam/boot path.

Can you recall what was the error?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought using Dockerfile params (ARG) in the FROM statement wasn't possible. That's what I was referring to.

Copy link
Contributor Author

@AnandInguva AnandInguva Mar 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, expansion of ARG in the COPY --from is not supported in Docker. Hence moving forward with the current instructions

@AnandInguva
Copy link
Contributor Author

PTAL @tvalentyn @emilymye @y1chi

@@ -171,6 +171,48 @@ creates a Java 8 SDK image with appropriate licenses in `/opt/apache/beam/third_

By default, no licenses/notices are added to the docker images.

#### Modifying an existing container image to make it compatible with Apache Beam Runners {#modify-existing-base-image}
Beam offers a way to provide your own custom container image. The easiest way to build a new custom image that is compatible with Apache Beam Runners is to use a [multi-stage build](https://docs.docker.com/develop/develop-images/multistage-build/) process. This copies over the necessary artifacts from a default Apache Beam base image to build your custom container image.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this multi-stage build?

Copy link
Contributor Author

@AnandInguva AnandInguva Mar 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Using multi stage build process to copy required artifacts from Apache Beam's base image to the provided custom image

RUN pip install --no-cache-dir apache-beam[gcp]==2.35.0

# Copy files from official SDK image, including script/dependencies.
COPY --from=apache/beam_python3.7_sdk:2.35.0 /opt/apache/beam /opt/apache/beam
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is sufficient, IIRC /opt/apache/beam this only contains the boot program? all the base_image_requirements are in site_packages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are in dist_packages

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the requirements will be installed at runtime , when we pip-install the staged apache beam sdk here

err := pipInstallPackage(files, workDir, sdkWhlFile, false, false, []string{"gcp"})

@codecov
Copy link

codecov bot commented Mar 17, 2022

Codecov Report

Merging #16938 (d46bd07) into master (2a45a5b) will increase coverage by 0.94%.
The diff coverage is 26.35%.

❗ Current head d46bd07 differs from pull request most recent head 9ad0ba9. Consider uploading reports for the commit 9ad0ba9 to get more accurate results

@@            Coverage Diff             @@
##           master   #16938      +/-   ##
==========================================
+ Coverage   73.00%   73.95%   +0.94%     
==========================================
  Files         658      669      +11     
  Lines       86706    88159    +1453     
==========================================
+ Hits        63301    65194    +1893     
+ Misses      22405    21853     -552     
- Partials     1000     1112     +112     
Flag Coverage Δ
go 49.35% <26.35%> (+3.69%) ⬆️
python 83.64% <ø> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...s/go/pkg/beam/core/graph/window/trigger/trigger.go 59.09% <0.00%> (ø)
sdks/go/pkg/beam/core/graph/window/windows.go 82.60% <ø> (ø)
sdks/go/pkg/beam/core/runtime/exec/util.go 74.28% <ø> (ø)
sdks/go/pkg/beam/core/runtime/harness/harness.go 11.26% <0.00%> (-1.43%) ⬇️
sdks/go/pkg/beam/core/runtime/exec/plan.go 48.61% <22.22%> (-8.80%) ⬇️
sdks/go/pkg/beam/core/runtime/exec/pardo.go 50.45% <25.00%> (ø)
...kg/beam/core/runtime/xlangx/expansionx/download.go 61.53% <61.53%> (+1.16%) ⬆️
sdks/go/pkg/beam/core/runtime/exec/combine.go 60.43% <66.66%> (ø)
sdks/go/pkg/beam/core/runtime/exec/fn.go 69.26% <80.00%> (+0.56%) ⬆️
sdks/go/pkg/beam/core/runtime/harness/datamgr.go 74.41% <100.00%> (ø)
... and 60 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2a45a5b...9ad0ba9. Read the comment docs.


## Pre-building SDK container image

In the pre-building step, we install pipeline dependencies on the container image prior to the job submission. This would speed up the pipeline execution.\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is missing here. Let's add an introductory sentence.

In pipeline execution modes where a Beam runner launches SDK workers in Docker containers, the additional pipeline dependencies (specified via --requirements_file and other runtime options) are installed into the containers at runtime. This can increase the worker startup time. However, it may be possible to pre-build the SDK containers and perform the dependency installation once before the workers start.

## Pre-building SDK container image

In the pre-building step, we install pipeline dependencies on the container image prior to the job submission. This would speed up the pipeline execution.\
To use pre-building the dependencies from `requirements.txt` on the container image. Follow the steps below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To use pre-building the dependencies from `requirements.txt` on the container image. Follow the steps below.
To pre-build the container image before the pipeline submission, follow the steps below.

To use pre-building the dependencies from `requirements.txt` on the container image. Follow the steps below.
1. Provide the container engine. We support `local_docker` and `cloud_build`(requires a GCP project with Cloud Build API enabled).

--prebuild_sdk_container_engine <execution_environment>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
--prebuild_sdk_container_engine <execution_environment>
--prebuild_sdk_container_engine <container_engine>


In the pre-building step, we install pipeline dependencies on the container image prior to the job submission. This would speed up the pipeline execution.\
To use pre-building the dependencies from `requirements.txt` on the container image. Follow the steps below.
1. Provide the container engine. We support `local_docker` and `cloud_build`(requires a GCP project with Cloud Build API enabled).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Provide the container engine. We support `local_docker` and `cloud_build`(requires a GCP project with Cloud Build API enabled).
1. Provide the container engine. We support `local_docker` (requires local installation of Docker) and `cloud_build`(requires a GCP project with Cloud Build API enabled).

2. To pass a base image for pre-building dependencies, enable this flag. If not, apache beam's base image would be used.

--sdk_container_image <location_to_base_image>
3. To push the container image, pre-built locally with `local_docker` , to a remote repository(eg: docker registry), provide URL to the remote registry by passing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using local_docker engine, provide a URL for the remote registry to which the image will be pushed by passing...

--sdk_container_image <location_to_base_image>
3. To push the container image, pre-built locally with `local_docker` , to a remote repository(eg: docker registry), provide URL to the remote registry by passing

--docker_registry_push_url <IMAGE_URL>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused - what is a sample value of this param? Is it supposed to be the image name+tag or just the registry?

Copy link
Contributor Author

@AnandInguva AnandInguva Mar 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be a registry. We generate the image tag. Image name is coded as beam_python_prebuilt_sdk at [1].

May be it can worded as --docker_registry_push_url <registry_URL>

[1]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we give an example of the expected value? As a user reading this doc it is still not obvious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add an example. Also let me see if I can make the wording more simpler

1. Provide the container engine. We support `docker` and `cloud_build`(requires a GCP project with Cloud Build API enabled).

--prebuild_sdk_container_enginer <execution_environment>
2. To pass a base image for pre-building dependencies, enable this flag. If not, apache beam's base image would be used.
Copy link
Contributor

@tvalentyn tvalentyn Mar 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove #2 now that we don't need a special flag and use the standard --sdk_container_image flag for this purpose.


--docker_registry_push_url <IMAGE_URL>
**NOTE:** `docker_registry_push_url` must be a remote registry.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@y1chi if the user uses pre-building and doesn't provide docker_registry_push_url, what would happen in this case?

I recall it would fail with error something like this Couldn't find the Docker image. If this is the case, we need to make sure that user provides a remote registry URL. If the user doesn't provide it, can we fail the pipeline prior to Job submission?

@AnandInguva
Copy link
Contributor Author

PTAL @tvalentyn

R: @pcoet @melap. Can you provide any suggestions/edits that can make the topic in PR more clear? Thanks!!! :)

Copy link
Collaborator

@pcoet pcoet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just a couple of minor suggestions.

```
>**NOTE**: This example assumes necessary dependencies (in this case, Python 3.7 and pip) have been installed on the existing base image. Installing the Apache Beam SDK into the image will ensure that the image has the necessary SDK dependencies and reduce the worker startup time.
>The version specified in the `RUN` instruction must match the version used to launch the pipeline.<br>
>**Users need to make sure that whatever base image they use has the same Python/Java interpreter version that they used to run the pipeline**.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider: "Make sure that the Python or Java runtime version specified in the base image is the same as the version used to run the pipeline."

@@ -45,6 +45,17 @@ If your pipeline uses public packages from the [Python Package Index](https://py
The runner will use the `requirements.txt` file to install your additional dependencies onto the remote workers.

**Important:** Remote workers will install all packages listed in the `requirements.txt` file. Because of this, it's very important that you delete non-PyPI packages from the `requirements.txt` file, as stated in step 2. If you don't remove non-PyPI packages, the remote workers will fail when attempting to install packages from sources that are unknown to them.
> **NOTE**: An alternative to `pip freeze` is to use a library like [pip-tools](https://github.com/jazzband/pip-tools) to compile the all the dependencies required for the pipeline from a `--requirements_file`, where only top-level dependencies are mentioned.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"compile the all" -> "compile all"


In pipeline execution modes where a Beam runner launches SDK workers in Docker containers, the additional pipeline dependencies (specified via `--requirements_file` and other runtime options) are installed into the containers at runtime. This can increase the worker startup time.
However, it may be possible to pre-build the SDK containers and perform the dependency installation once before the workers start. To pre-build the container image before pipeline submission, provide the pipeline options mentioned below.
1. Provide the container engine. We support `local_docker`(requires local installation of Docker) and `cloud_build`(requires a GCP project with Cloud Build API enabled).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, prefer "Beam" to "we", as in "Beam supports..."

1. Provide the container engine. Beam supports `local_docker`(requires local installation of Docker) and `cloud_build`(requires a GCP project with Cloud Build API enabled).

--prebuild_sdk_container_engine=<container_engine>
2. To pass a base image for pre-building dependencies, provide `--sdk_container_image`. If not, Apache beam's base [image](https://hub.docker.com/search?q=apache%2Fbeam&type=image) would be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, let's remove this and line 156.


**NOTE:** `docker_registry_push_url` must be a remote registry.
> To use Docker, the `--sdk_container_image` should be compatible with Apache Beam Runner. Please follow the [instructions](https://beam.apache.org/documentation/runtime/environments/#building-and-pushing-custom-containers) on how to build a base container image compatible with Apache Beam.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion to add to the notes:

The pre-building feature requires the Apache Beam SDK for Python, version 2.25.0 or later.

The container images created during prebuilding will persist beyond the pipeline runtime.
Once your job is finished or stopped, you can remove the pre-built image from the container registry.

If your pipeline is using a custom container image, most likely you will not benefit from prebuilding step as extra dependencies can be preinstalled in the custom image at build time. If you still would like to use prebuilding with custom images, use Apache Beam SDK 2.38.0 or newer and supply your custom image in via the --sdk_container_image pipeline option.

@tvalentyn
Copy link
Contributor

Thanks!

@tvalentyn tvalentyn merged commit c84818d into apache:master Mar 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants