Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some Helix VMs are having trouble installing the Helix custom script extension due to authorization errors #3382

Closed
3 tasks
riarenas opened this issue Jul 11, 2024 · 2 comments
Labels
Ops - First Responder RCA Requested A Root Cause Analysis (RCA) should be completed once this issue has been resolved.

Comments

@riarenas
Copy link
Member

riarenas commented Jul 11, 2024

We are seeing issues in some queues where the machines are being provisioned in Azure, but they are failing to install and run the custom script extension that makes them Helix machines, so we're not seeing any heartbeats for the affected queues.

https://resources.azure.com/subscriptions/f8c1f536-2a9b-41ba-9868-811cc982bb25/resourceGroups/ubuntu.2204.amd64.android.29.open.rt-westus-v2-rg/providers/Microsoft.Compute/virtualMachineScaleSets/ubuntu.2204.amd64.android.29.open.rt-a-scaleset/virtualMachines/34806/instanceView for an example.

...
"extensions": [
    {
      "name": "ubuntu.2204.amd64.android.29.open.rt-a-extension",
      "type": "Microsoft.Azure.Extensions.CustomScript",
      "typeHandlerVersion": "2.1.10",
      "statuses": [
        {
          "code": "ProvisioningState/failed/0",
          "level": "Error",
          "displayStatus": "Provisioning failed",
          "message": "Enable failed: processing file downloads failed: failed to download file[0]: failed to download response and write to file: /var/lib/waagent/custom-script/download/2/ubuntu.2204.amd64.android.29.open.rt.zip: CustomScript failed to download the file from helixosconfig.blob.core.windows.net because the server returned a response code and message of \"403 Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.\" Please verify the machine has network connectivity. (Service request ID: 7d85751d-701e-0032-7397-d3960d000000)"

This was initially reported in the First Responders channel for the ubuntu.2204.amd64.android.29.open[.rt] queues. Both the regular, and runtime specific queue are affected.

We need to:

  • Determine the impact. I suspect other queues are possibly affected
  • Figure out what is causing the 403s and fix

Release Note Category

  • Feature changes/additions
  • Bug fixes
  • Internal Infrastructure Improvements

Release Note Description

@arkalyanms
Copy link
Member

could we post updates on this one either here or in the FR channel with ETAs? This is causing 3X delays in the ML pipelines.

@riarenas riarenas added the RCA Requested A Root Cause Analysis (RCA) should be completed once this issue has been resolved. label Jul 15, 2024
@riarenas
Copy link
Member Author

riarenas commented Jul 19, 2024

We have mitigated all the issues that were contributing to the reduced capacity in Helix, so we are now closing this tracking issue. An RCA will be coming soon and will be available in #3522

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ops - First Responder RCA Requested A Root Cause Analysis (RCA) should be completed once this issue has been resolved.
Projects
None yet
Development

No branches or pull requests

2 participants