AWS ParallelCluster v2.10.0
·
59 commits
to release-2.10
since this release
We're excited to announce the release of AWS ParallelCluster Cookbook 2.10.0
This is associated with AWS ParallelCluster v2.10.0.
ENHANCEMENTS
- Add support for CentOS 8.
- Add support for instance types with multiple network cards (e.g.
p4d.24xlarge
). - Enable FSx Lustre in China regions.
- Add validation step for AMI creation process to fail when using a base AMI created by a different version of
ParallelCluster. - Add validation step for AMI creation process to fail if the selected OS and the base AMI OS are not consistent.
- Add possibility to use a post installation script when building an AMI.
- Install NVIDIA Fabric manager to enable NVIDIA NVSwitch on supported platforms.
CHANGES
- Upgrade EFA installer to version 1.10.1
- EFA configuration:
efa-config-1.5
(from efa-config-1.4) - EFA profile:
efa-profile-1.1
(from efa-profile-1.0.0) - EFA kernel module:
efa-1.10.2
(from efa-1.6.0) - RDMA core:
rdma-core-31.amzn0
(from rdma-core-28.amzn0) - Libfabric:
libfabric-1.11.1amzn1.1
(from libfabric-1.10.1amzn1.1) - Open MPI:
openmpi40-aws-4.0.5
(from openmpi40-aws-4.0.3) - Unifies installer runtime options across x86 and aarch64
- Introduces
-g/--enable-gdr
switch to install packages with GPUDirect RDMA support - Updates to OMPI collectives decision file packaging, migrated from efa-config to efa-profile
- Introduces CentOS 8 support
- EFA configuration:
- CentOS 6 is no longer supported.
- Upgrade NVIDIA driver to version 450.80.02.
- Upgrade Intel Parallel Studio XE Runtime to version 2020.2.
- Upgrade Munge to version 0.5.14.
- Retrieve FSx Lustre DNS name dynamically.
- Slurm: change
SlurmctldPort
to 6820-6829 to not overlap with defaultslurmdbd
port (6819). - Slurm: add
compute_resource
name andefa
as node features. - Improve Slurm and Munge installation process by cleaning up existing installations from OS repositories.
- Install Python 3 version of
aws-cfn-bootstrap
scripts. - Do not force compute fleet into
STOPPED
state when performing a cluster update. This allows to update the queue
size without forcing a termination of the existing instances.
BUG FIXES
- Fix ephemeral drives setup to avoid failures when partition changes require a reboot.
- Fix Chrony service management.
- Retrieve the right number of compute instance slots when instance type is updated.
- Fix compute fleet status initialization to be configured before daemons are started by
supervisord
.