Skip to content

[AD] Increasing the attempt and sleep time when we create AD Domain #6617

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: release-3.12
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions cloudformation/ad/ad-integration.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -416,7 +416,7 @@ Resources:
IamInstanceProfile:
Ref: JoinProfile
ImageId: !Ref AdminNodeAmiId
InstanceType: t3.micro
InstanceType: t3.xlarge
KeyName: !Ref Keypair
LaunchTemplate:
LaunchTemplateId: !Ref 'DisableImdsv1LaunchTemplate'
Expand All @@ -440,25 +440,25 @@ Resources:
echo "Domain Certificate Secret: ${DomainCertificateSecretArn}"
echo "Domain Private Key Secret: ${DomainPrivateKeySecretArn}"

mkdir -p /etc/systemd/resolved.conf.d
cat << EOF > /etc/systemd/resolved.conf.d/pcluster-ad-domain-dns-server.conf
cat << EOF > /etc/systemd/resolved.conf
[Resolve]
DNS=${DnsIp1} ${DnsIp2}
Domains=~.
EOF
sudo rm /usr/lib/systemd/resolved.conf.d/resolved-disable-stub-listener.conf
service systemd-resolved restart

ADMIN_PW="${AdminPassword}"

attempt=0
max_attempts=5
max_attempts=8
until [ $attempt -ge $max_attempts ]; do
attempt=$((attempt+1))
echo "[DEBUG] Checking domain name resolution for ${DirectoryDomain} ..."
dig ${DirectoryDomain}
echo "Joining domain (attempt $attempt/$max_attempts) ..."
echo "$ADMIN_PW" | sudo realm join -U "${Admin}" "${DirectoryDomain}" --verbose && echo "Domain joined" && break
sleep 10
sleep 12
done

sleep 10
Expand Down
2 changes: 1 addition & 1 deletion tests/integration-tests/tests/trainium/test_trainium.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ def test_trainium(


def _test_allreduce_single_node(test_datadir, remote_command_executor, scheduler_commands):
result = scheduler_commands.submit_script(str(test_datadir / "neuron-allreduce.sh"), partition="queue-trn2")
result = scheduler_commands.submit_script(str(test_datadir / "neuron-allreduce.sh"), partition="queue-trn32")
job_id = scheduler_commands.assert_job_submitted(result.stdout)
scheduler_commands.wait_job_completed(job_id)
scheduler_commands.assert_job_succeeded(job_id)
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
DevSettings:
Timeouts:
HeadNodeBootstrapTimeout: 2400
Image:
Os: {{ os }}
HeadNode:
Expand All @@ -18,13 +21,16 @@ Scheduling:
Scheduler: slurm
SlurmQueues:
- Name: queue-trn32
CapacityType: CAPACITY_BLOCK
ComputeResources:
- Name: compute-resource-trn32
Instances:
- InstanceType: {{instance}}
InstanceType: {{instance}}
MinCount: 2
MaxCount: 2
Efa:
Enabled: true
CapacityReservationTarget:
CapacityReservationId: cr-05b0c099ce2534ce3
Networking:
SubnetIds:
- {{ private_subnet_id }}
Expand All @@ -42,24 +48,24 @@ Scheduling:
- BucketName: {{ bucket_name }}
# Needed to download neuronx packages and neff file --> FIXME to be removed once packages are public available
- BucketName: aws-parallelcluster-beta
- Name: queue-trn2
ComputeResources:
- Name: compute-resource-trn2
Instances:
- InstanceType: trn1.2xlarge
MinCount: 0 # TODO change to 1 once allreduce test is passing
Networking:
SubnetIds:
- {{ private_subnet_id }}
CustomActions:
OnNodeConfigured:
Script: s3://{{ bucket_name }}/neuron-installation.sh
Iam:
# Policy to access to Trainium beta repository info
AdditionalIamPolicies:
- Policy: arn:aws:iam::447714826191:policy/TrainiumPreviewPolicy
S3Access:
# Needed to download post install script
- BucketName: {{ bucket_name }}
# Needed to download neuronx packages and neff file --> FIXME to be removed once packages are public available
- BucketName: aws-parallelcluster-beta
# - Name: queue-trn2
# ComputeResources:
# - Name: compute-resource-trn2
# Instances:
# - InstanceType: trn1.2xlarge
# MinCount: 0 # TODO change to 1 once allreduce test is passing
# Networking:
# SubnetIds:
# - {{ private_subnet_id }}
# CustomActions:
# OnNodeConfigured:
# Script: s3://{{ bucket_name }}/neuron-installation.sh
# Iam:
# # Policy to access to Trainium beta repository info
# AdditionalIamPolicies:
# - Policy: arn:aws:iam::447714826191:policy/TrainiumPreviewPolicy
# S3Access:
# # Needed to download post install script
# - BucketName: {{ bucket_name }}
# # Needed to download neuronx packages and neff file --> FIXME to be removed once packages are public available
# - BucketName: aws-parallelcluster-beta
Loading