| title | AzureML Validation Job Debugging |
|---|---|
| sidebar_label | Validation Job Debugging |
| sidebar_position | 2 |
| description | Troubleshooting guide for AzureML validation job failures and common issues. |
| author | Microsoft Robotics-AI Team |
| ms.date | 2026-03-12 |
| ms.topic | troubleshooting |
Date: December 3, 2025
Branch: feat/azureml-job-support
Status: 🔄 In Progress - Blocked on Storage Authentication
Enable policy validation workflow using osmorobo-validate.sh to validate trained IsaacLab policies registered in Azure Machine Learning.
After successfully training an IsaacLab policy using osmorobo-submit.sh, the user attempted to validate the trained policy using osmorobo-validate.sh. The validation script submits an AzureML command job that:
- Takes a registered model as input (
isaaclab-anymal-latest:2) - Runs validation episodes using the IsaacLab container
- Reports success/failure metrics
When first submitting the validation job, AzureML rejected the job YAML with schema validation errors:
ValidationError: The value 'string' of input type is not valid for Command job.
Supported types are ['uri_file', 'uri_folder', 'mlflow_model', 'custom_model', 'mltable', 'triton_model']
Cause: The isaaclab-validate.yaml template was using typed literal inputs (type: string, type: integer) which are valid for AzureML Pipeline jobs but NOT for Command jobs. Command jobs expect literal inputs as simple key-value pairs without type declarations.
After fixing the input types, a second error occurred:
ValidationError: Empty string is not allowed for input value
Cause: The YAML had task: "" and framework: "" as default values. AzureML doesn't allow empty strings for input values.
After fixing the YAML schema, the job submitted successfully but failed at runtime:
Error Code: ScriptExecution.StreamAccess.Authentication
permission denied when access stream
Cause: The AzureML extension's data-capability sidecar couldn't authenticate to Azure Blob Storage to mount/download the registered model. This is due to the storage account having allowSharedKeyAccess = false and workload identity not being properly configured.
Problem: Command job doesn't support typed literal inputs.
Investigation: Researched AzureML command job YAML schema using Microsoft documentation. Discovered that command jobs use a different input format than pipeline jobs.
Fix: Changed from:
inputs:
task:
type: string
default: ""To:
inputs:
task: auto # Simple key-value, no type declarationAlso changed model input from type: mlflow_model to type: custom_model since the trained checkpoint isn't MLflow format.
Problem: AzureML rejects empty strings as input values.
Fix: Changed empty strings to sentinel values:
task: ""→task: autoframework: ""→framework: auto
The validation script handles auto as "detect from model metadata".
Problem: Jobs fail with ScriptExecution.StreamAccess.Authentication when trying to access the model from blob storage.
Investigation Steps:
-
Checked ML Identity Role Assignments:
az role assignment list --assignee $ML_PRINCIPAL_ID --scope $STORAGE_ID
Result:
Storage Blob Data Contributor✅ assigned -
Checked Federated Identity Credentials:
az identity federated-credential list --identity-name id-ml-osmorobo-tst-001
Result: Empty
[]- No federated credentials existed! -
Checked Terraform State:
terraform state list | grep federatedResult: No matches - federated credentials weren't being created
-
Root Cause Discovery: The
azureml_configobject inmain.tfwas missing:should_install_extension(defaults tofalse)should_federate_ml_identity(defaults tofalse)
This caused the federated credential resources to have
count = 0.
Action: Manually created federated credentials via Azure CLI:
AKS_OIDC_ISSUER=$(az aks show --name aks-osmorobo-tst-001 ... --query oidcIssuerProfile.issuerUrl)
az identity federated-credential create \
--name "aml-default-fic" \
--identity-name "id-ml-osmorobo-tst-001" \
--issuer "$AKS_OIDC_ISSUER" \
--subject "system:serviceaccount:azureml:default" \
--audiences "api://AzureADTokenExchange"Result: Credentials created, but jobs still failed.
Investigation:
az k8s-extension show ... --query "configurationSettings.identityType"Result: null - Extension wasn't configured with user-assigned identity!
Action: Updated extension:
az k8s-extension update \
--configuration-settings \
"identityType=UserAssigned" \
"userAssignedIdentityResourceId=..."Result: Extension updated, but jobs still failed.
Investigation: Checked if service accounts had workload identity annotations:
kubectl get serviceaccount -n azureml default -o yamlResult: No azure.workload.identity annotations.
Action: Added annotations:
kubectl annotate serviceaccount -n azureml default \
azure.workload.identity/client-id="$ML_CLIENT_ID" --overwrite
kubectl label serviceaccount -n azureml default \
azure.workload.identity/use=true --overwriteResult: Jobs still failed with same error.
Investigation: Examined data-capability container logs:
kubectl logs -n azureml $POD -c data-capabilityKey Finding:
Failed to get symmetric key for getting AML token: 'failed to load certificate:
stat /tmp/azureml/cr/j/.../sha1-.pfx: no such file or directory'
This revealed that the data-capability container is trying to use certificate-based authentication rather than workload identity, despite all the correct configurations being in place.
Investigation: Tried to enable shared key access as a workaround:
az storage account update --name stosmorobotst001 --allow-shared-key-access trueResult: Setting remains false - controlled by Terraform configuration (should_enable_storage_shared_access_key = false by default).
Added missing fields to azureml_config to enable extension installation and workload identity federation:
azureml_config = {
should_integrate_aks = var.should_integrate_aks_cluster
should_install_extension = var.should_integrate_aks_cluster // NEW: Enable extension when integrating
should_federate_ml_identity = var.should_integrate_aks_cluster // NEW: Enable workload identity federation
aks_cluster_purpose = var.aks_cluster_purpose
inference_router_service_type = var.inference_router_service_type
workload_tolerations = var.workload_tolerations
cluster_integration_instance_types = var.cluster_integration_instance_types
}Impact: These fields were previously missing, causing federated identity credentials to not be created (count = 0).
Fixed input schema to comply with AzureML command job requirements:
| Change | Before | After |
|---|---|---|
| Model input type | type: mlflow_model |
type: custom_model |
| Literal inputs | Had type: string, type: integer |
Removed type declarations (simple key-value) |
| Empty strings | task: "" |
task: auto (sentinel value) |
| Mount mode | mode: ro_mount |
mode: download (attempted workaround) |
Rationale: AzureML command jobs don't support typed literal inputs like pipeline jobs do.
az identity federated-credential create \
--name "aml-default-fic" \
--identity-name "id-ml-osmorobo-tst-001" \
--resource-group "rg-osmorobo-tst-001" \
--issuer "https://westus3.oic.prod-aks.azure.com/..." \
--subject "system:serviceaccount:azureml:default" \
--audiences "api://AzureADTokenExchange"
az identity federated-credential create \
--name "aml-training-fic" \
--identity-name "id-ml-osmorobo-tst-001" \
--resource-group "rg-osmorobo-tst-001" \
--issuer "https://westus3.oic.prod-aks.azure.com/..." \
--subject "system:serviceaccount:azureml:training" \
--audiences "api://AzureADTokenExchange"az k8s-extension update \
--cluster-name aks-osmorobo-tst-001 \
--cluster-type managedClusters \
--resource-group rg-osmorobo-tst-001 \
--name azureml-aks-osmorobo-tst-001 \
--configuration-settings \
"identityType=UserAssigned" \
"userAssignedIdentityResourceId=/subscriptions/.../id-ml-osmorobo-tst-001"kubectl annotate serviceaccount -n azureml default \
azure.workload.identity/client-id="afbecdd1-1eb2-4fed-8043-b88a15f25154" --overwrite
kubectl label serviceaccount -n azureml default \
azure.workload.identity/use=true --overwriteError Code: ScriptExecution.StreamAccess.Authentication
permission denied when access stream. Reason: None
PermissionDenied(None)
Error Message: Authentication failed when trying to access the stream.
Root Cause Analysis:
| Factor | Status |
|---|---|
Storage account allowSharedKeyAccess |
false (security best practice) |
ML Identity has Storage Blob Data Contributor |
✅ Verified |
| Federated Identity Credentials exist | ✅ Created |
AzureML Extension has identityType=UserAssigned |
✅ Configured |
| Data-capability using workload identity | ❌ Not working |
The AzureML extension's data-capability sidecar container is not properly authenticating to Azure Blob Storage using workload identity federation. Despite all the correct configurations being in place, the token exchange isn't happening.
Failed to get symmetric key for getting AML token: 'failed to load certificate:
stat /tmp/azureml/cr/j/.../sha1-.pfx: no such file or directory'
This suggests the data-capability container is still trying to use certificate-based auth rather than workload identity.
| Job Name | Status | Error |
|---|---|---|
sharp_pen_l66lnkmpfy |
Failed | Permission denied (before FIC creation) |
tough_brick_y9qw8npw1m |
Failed | Permission denied (after FIC creation) |
olden_table_61c5q9vx4k |
Failed | Permission denied (after SA annotation) |
frosty_arch_jtzh3fky15 |
Failed | Permission denied (after extension update) |
blue_cabbage_qjqy4kv8y9 |
Failed | Permission denied (with download mode) |
Add to terraform.tfvars:
should_enable_storage_shared_access_key = trueThen run:
cd infrastructure/terraform
terraform apply -var-file=terraform.tfvarsPros: Quick, will unblock validation workflow Cons: Reduces security posture (shared keys are less secure than managed identity)
Modify validate.sh to use Python AzureML SDK to download the model within the container, which can leverage the pod's MSI endpoint:
from azure.ai.ml import MLClient
from azure.identity import ManagedIdentityCredential
credential = ManagedIdentityCredential()
ml_client = MLClient(credential, subscription_id, resource_group, workspace_name)
ml_client.models.download(name="isaaclab-anymal-latest", version="2", download_path="/mnt/model")Pros: Maintains security posture Cons: Requires code changes, more complex
The workload identity integration with AzureML Kubernetes extension may have a bug or require additional configuration not documented.
Pros: Proper fix Cons: Time-consuming, may take days/weeks
Submit jobs to AzureML managed compute (e.g., gpu-cluster) instead of the attached Kubernetes compute.
Pros: Simpler authentication model Cons: Loses GPU node pool flexibility, may need to provision separate GPU compute
- Immediate: Implement Option 1 (enable shared key access) to unblock validation testing
- Short-term: File Microsoft support ticket for workload identity + AzureML extension issue
- Medium-term: Implement Option 2 as a more secure long-term solution
- Documentation: Update deployment docs to note this limitation
| File | Change Type |
|---|---|
infrastructure/terraform/main.tf |
Added should_install_extension, should_federate_ml_identity |
workflows/azureml/validate.yaml |
Fixed input schema, changed mount to download |