Skip to content

Reduce build image components templates size #6910

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

hanwen-cluster
Copy link
Contributor

@hanwen-cluster hanwen-cluster commented Jul 9, 2025

Description of changes

ImageBuilder has a limit of 16000 with create component API https://docs.aws.amazon.com/imagebuilder/latest/APIReference/API_CreateComponent.html. This commit reduces the size and add a unit test to prevent the template hitting the limit when parameters with long values are provided

With this PR
Component UpdateOSComponent has size 9251
Component ParallelClusterComponent has size 9699
Component ParallelClusterTagComponent has size 8073
Component ParallelClusterTestComponent has size 13547

Tests

  • Build image on all OSes has passed

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@hanwen-cluster hanwen-cluster requested review from a team as code owners July 9, 2025 17:24
@hanwen-cluster hanwen-cluster force-pushed the developjul9 branch 2 times, most recently from cc68643 to 863bf90 Compare July 9, 2025 17:29
@hanwen-cluster hanwen-cluster added the skip-changelog-update Disables the check that enforces changelog updates in PRs label Jul 9, 2025
@hanwen-cluster hanwen-cluster force-pushed the developjul9 branch 3 times, most recently from 1769061 to 7209724 Compare July 9, 2025 17:48
)
)
assert_that(
len(str(imagebuilder_resources.get(component_name).get("Properties").get("Data")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the length of the component will always remain the same.
Because this component has not replaced the parameters that we pass and which are replaced by Cfn Function Sub.So we need to test that by replacing these paramters in the Data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just keep it as is. Since the test limits the length to under 13600, it leave 2400 for all parameters

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!

@@ -13,205 +13,91 @@ constants:
phases:
- name: build
steps:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you describe the changes made in parallelcluster.yaml its difficult to follow through and compare without guidance.

Copy link
Contributor Author

@hanwen-cluster hanwen-cluster Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more comments to describe the changes and hope they will help

# Get Cookbook Url
- name: CookbookUrl
# Initialize system information and URLs
- name: SystemInfo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really smart!!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! AI did it

# Get Cookbook name
- name: PClusterCookbookVersionName
# Get operating system name for variable chaining
- name: OperatingSystemName
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not combine OperatingSystemName and SystemInfo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OperatingSystemName is used also in ‎cli/src/pcluster/config/imagebuilder_config.py as part of the parameter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok got it

Copy link
Contributor

@himani2411 himani2411 Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we set this OS in the image_dna.json when Cookbook could essentially set that for us?
These defaults are used for image_dna.json

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh thats to validate OS in cookbook!

echo ${!PLATFORM}

# Get input base AMI Architecture
- name: OperatingSystemArchitecture
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not remove this check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDK. This commit aims at retaining the logic. So I didn't dig into it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont retain this though or did i miss something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines -17 to -24
# Get AWS Region
- name: AWSRegion
action: ExecuteBash
inputs:
commands:
- |
set -v
echo ${AWS::Region}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines -80 to -101
RELEASE='{{ build.OperatingSystemRelease.outputs.stdout }}'

if [ `echo "${!RELEASE}" | grep -w '^amzn\.2'` ]; then
OS='alinux2'
elif [ `echo "${!RELEASE}" | grep -w '^amzn\.2023'` ]; then
OS='alinux2023'
elif [ `echo "${!RELEASE}" | grep '^ubuntu\.22'` ]; then
OS='ubuntu2204'
elif [ `echo "${!RELEASE}" | grep '^ubuntu\.24'` ]; then
OS='ubuntu2404'
elif [ `echo "${!RELEASE}" | grep '^rhel\.8'` ]; then
OS='rhel8'
elif [ `echo "${!RELEASE}" | grep '^rocky\.8'` ]; then
OS='rocky8'
elif [ `echo "${!RELEASE}" | grep '^rhel\.9'` ]; then
OS='rhel9'
elif [ `echo "${!RELEASE}" | grep '^rocky\.9'` ]; then
OS='rocky9'
else
echo "Operating System '${!RELEASE}' is not supported. Failing build."
exit {{ FailExitCode }}
fi
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +23 to +35
. /etc/os-release
RELEASE="${!ID}${!VERSION_ID:+.${!VERSION_ID}}"
case "$RELEASE" in
amzn.2) echo 'alinux2' ;;
amzn.2023) echo 'alinux2023' ;;
ubuntu.22*) echo 'ubuntu2204' ;;
ubuntu.24*) echo 'ubuntu2404' ;;
rhel.8*) echo 'rhel8' ;;
rocky.8*) echo 'rocky8' ;;
rhel.9*) echo 'rhel9' ;;
rocky.9*) echo 'rocky9' ;;
*) echo "Unsupported OS: $RELEASE" && exit {{ FailExitCode }} ;;
esac
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

echo ${!PLATFORM}

# Get input base AMI Architecture
- name: OperatingSystemArchitecture
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDK. This commit aims at retaining the logic. So I didn't dig into it

Comment on lines -159 to -180
- name: IsOperatingSystemSupported
action: ExecuteBash
inputs:
commands:
- |
set -v
if [ ${CfnParamUpdateOsAndReboot} == false ]; then
RELEASE='{{ build.OperatingSystemRelease.outputs.stdout }}'
if [ `echo "${!RELEASE}" | grep -Ev '^(amzn|ubuntu|rhel|rocky)'` ]; then
echo "This component does not support '${!RELEASE}'. Failing build."
exit {{ FailExitCode }}
fi

# This component only supports aarch64 CPUs on Amazon Linux 2, Ubuntu2204, RHEL8, Rocky8, RHEL9 and Rocky9
ARCH=$(uname -m)
if [[ `echo ${!ARCH}` == 'aarch64' ]]; then
if [ `echo "${!RELEASE}" | grep -Ev '^(amzn\.2|amzn\.2023|ubuntu\.20\.04|ubuntu\.22\.04|ubuntu\.24\.04|rhel\.8|rocky\.8|rhel\.9|rocky\.9)'` ]; then
echo "This component does not support '${!RELEASE}' on ARM64 CPUs. Failing build."
exit {{ FailExitCode }}
fi
fi
fi
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# This component only supports aarch64 CPUs on Amazon Linux 2, Ubuntu2204, RHEL8, Rocky8, RHEL9 and Rocky9
ARCH=$(uname -m)
if [[ `echo ${!ARCH}` == 'aarch64' ]]; then
if [ `echo "${!RELEASE}" | grep -Ev '^(amzn\.2|amzn\.2023|ubuntu\.20\.04|ubuntu\.22\.04|ubuntu\.24\.04|rhel\.8|rocky\.8|rhel\.9|rocky\.9)'` ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove these checks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why were these checks added when previous steps already checked the OS?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was it centos?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines -58 to -71
- name: OperatingSystemRelease
action: ExecuteBash
inputs:
commands:
- |
set -v
FILE=/etc/os-release
if [ -e ${!FILE} ]; then
. ${!FILE}
echo "${!ID}${!VERSION_ID:+.${!VERSION_ID}}"
else
echo "The file '${!FILE}' does not exist. Failing build."
exit {{ FailExitCode }}
fi
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines -105 to -118
- name: OperatingSystemVersion
action: ExecuteBash
inputs:
commands:
- |
set -v
FILE=/etc/os-release
if [ -e ${!FILE} ]; then
. ${!FILE}
echo "${!VERSION_ID}"
else
echo "The file '${!FILE}' does not exist. Failing build."
exit {{ FailExitCode }}
fi
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines -121 to -135
- name: PlatformName
action: ExecuteBash
inputs:
commands:
- |
set -v
OS='{{ build.OperatingSystemName.outputs.stdout }}'

if [ `echo "${!OS}" | grep -E '^(alinux|rhel|rocky)'` ]; then
PLATFORM='RHEL'
elif [ `echo "${!OS}" | grep -E '^ubuntu'` ]; then
PLATFORM='DEBIAN'
fi

echo ${!PLATFORM}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmarciani
Copy link
Contributor

I like the approach of sourcing system info from a single file rather than retrieving every time from step outputs.
If we need to further reduce the size we can leverage this approach and move the steps to retrieve all the info into a dedicated component to be executed before parallelcluster main component.

apt-get clean
apt-get -y update
apt-get -y install build-essential curl wget jq
fi
exit 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this exit 0 needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not strictly needed. It is the logic is complex. We had a failure from installing curl on al2023. Adding exit 0 makes sure the step pass even if some package installation fail. I know this will cover up some issue. But I don't see serious problem when commands here fail to run

curl --retry 3 -L $CINC_URL | bash -s -- -v {{ ChefVersion }}
[[ -e $CA_CERTS_FILE ]] && {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are changing the indirect expansion to directly check for the file. IIRC with indirect expansion we check for the existence of the symlink file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand this comment. Can you elaborate?

cd /etc/chef && tar -xzf /etc/chef/aws-parallelcluster-cookbook.tgz --strip-components 1 && rm -f aws-parallelcluster-cookbook.tgz

- name: CreatingChefClientFile
- name: CreateFiles
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NitPick: Can we name it CreateChefFiles, I dont want the name to be same as action

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

PLATFORM='{{ build.PlatformName.outputs.stdout }}'
. /opt/parallelcluster/system_info

# Disable Nouveau driver
/bin/sed -r -i -e 's/GRUB_CMDLINE_LINUX="(.*)"/GRUB_CMDLINE_LINUX="\1 rd.driver.blacklist=nouveau nouveau.modeset=0"/' /etc/default/grub
if [[ ${!PLATFORM} == RHEL ]]; then
Copy link
Contributor

@himani2411 himani2411 Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not need the indirect expansion here as we are no longer keeping PLATFORM='{{ build.PlatformName.outputs.stdout }}'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we still use indirect expansion in multiple places, please correct the same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Bash this syntax ${!PLATFORM} leads to indirect expansion where the value of PLATFORM for example if RHEL, then RHEL will be treated as a variable and any value it holds will be used here.
Below is what it will do

PLATFORM='RHEL'

RHEL='OTHER VALUE'

echo "$PLATFORM"             # Prints RHEL


echo "${!PLATFORM}"         # Prints  OTHER VALUE

The syntax ${PLATFORM} or ``$PLATFORM` is same and is direct expansion where it only value of PLATFORM for example if RHEL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plz address the use of indirect expansion in these bash scripts as we not longer use {{ build.AWSRegion.outputs.stdout }} for PLATFORM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought {{ build.AWSRegion.outputs.stdout }} was replaced by jinja but now i am not sure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior in bash is different from CloudFormation

In CloudFormation:

To write a dollar sign and curly braces (${}) literally, add an exclamation point (!) after the open curly brace, such as ${!Literal}. CloudFormation resolves this text as ${Literal}.

 https://docs.aws.amazon.com/AWSCloudFormation/latest/TemplateReference/intrinsic-function-reference-sub.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I would suggest that you make these changes in all the parts of the YAML as I see in some places we are see if [[ ${!PLATFORM} == RHEL ]]; then or if [[ $PLATFORM == RHEL ]]; then

overwrite: true

- name: RemoveKernelPin
- name: Cleanup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NitPick: Can you keep a name like RemoveKernelPinAndCleanup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

ImageBuilder has a limit of 16000. This commit reduces the size and add a unit test to prevent the template hitting the limit when parameters with long values are provided

Co-authored-by: Himani Anil Deshpande <himanidp@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
skip-changelog-update Disables the check that enforces changelog updates in PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants