Skip to content

Leverage Deployment Stacks for idempotency #30

@arnaudlh

Description

@arnaudlh

Summary

The current destroy flow in .github/workflows/git-ape-destroy.exampleyml primarily deletes a single resource group (az group delete) plus a narrow sweep for subscription-scope Microsoft.Authorization/* and Microsoft.Authorization/policyAssignments resources discovered via az deployment operation sub list.

This works for the single-RG Key Vault template we ship today, but it is not idempotent once a deployment spans more than one resource group, creates subscription/MG-scope resources via nested deployments, or creates soft-deletable services (Key Vault, APIM, Log Analytics, App Configuration, Cognitive Services, Recovery Services, ML workspace, …).

Observed concretely after running @git-ape destroy deployment deploy-20260423-092136 (single-RG Key Vault with purge protection): the RG is gone but the Key Vault remains soft-deleted at subscription scope for 90 days and cannot be purged (purge protection enabled). Re-running the exact same template will fail with VaultAlreadyExists until retention expires — destroy + redeploy is not idempotent.

Orphan categories a "delete the RG" strategy can leave behind

# Category Example
1 Soft-deleted data services Key Vault, APIM, Cognitive Services, App Configuration, Log Analytics workspace, Recovery Services vault, ML workspace
2 Purge-protected resources Key Vault with enablePurgeProtection: true
3 Multiple resource groups Template creates rg-app + rg-data — only one is tracked in state.resourceGroup
4 Subscription-scope role assignments created via nested deployments Not always enumerable through az deployment operation sub list
5 Subscription-scope policy assignments / definitions / exemptions Same as above
6 Management-group-scope resources Custom policies, role assignments at MG scope
7 Cross-RG resources from nested deployments VNet peering in a hub RG, DNS record in a shared DNS RG, secret in a shared KV
8 Cross-subscription nested deployments Destroy runs against one subscription only
9 Tenant / Entra ID objects App registrations, directory groups
10 Backup protected items / recovery points in cross-RG Recovery Services vaults Survive source-RG delete
11 Subscription-level diagnostic settings microsoft.insights/diagnosticSettings at sub scope
12 Subscription budgets & cost alerts Microsoft.Consumption/budgets
13 Resource locks Don't orphan but block delete and leave partial state
14 Remote-side references Approved Private Endpoint connections on a shared service, remote VNet peerings, DNS records in shared zones
15 Subscription deployment-history entries Accumulate toward the 800/scope limit

Proposed approach — two layers

Layer A — Azure Deployment Stacks (primary, for new deployments)

Deployment Stacks natively track every resource in a deployment regardless of scope.

  • Replace az deployment sub create with az stack sub create --action-on-unmanage deleteAll --deny-settings-mode denyDelete in git-ape-deploy.exampleyml.
  • Stack name = deployment id; store it in state.stackId.
  • Destroy becomes a single az stack sub delete --action-on-unmanage deleteAll, covering multi-RG, sub-scope, and MG-scope uniformly.
  • Remaining gaps to handle explicitly: soft-delete purge (1, 2) and remote-side references (14) — stacks don't handle either.

Layer B — State-driven fallback (retrofits existing + legacy deployments)

For pre-stack deployments and cases where stacks can't be used:

  1. Capture-at-deploy: walk the deployment-operation graph recursively (root + every nested op) and emit a flat list of every targetResource.id into state.managedResources[] with {id, type, scope, apiVersion, softDeletable, purgeProtected}. Also populate state.resourceGroups[], state.subscriptions[], state.externalReferences[], state.stackId (nullable).
  2. Destroy algorithm (idempotent):
    1. If stackId present → az stack sub delete; skip to step 7.
    2. Topologically sort managedResources[] (locks → role/policy assignments → children → parents → RGs).
    3. For each resource: az resource show → if 404 mark already-gone; else delete; retry transient.
    4. For each RG in resourceGroups[]: az group delete --yes.
    5. For each softDeletable[] entry: list soft-deleted → purge if purgeProtected=false, else record retained-soft-deleted with expiry date.
    6. Probe externalReferences[] for remote-side leftovers (stale PE connections, peerings, DNS records).
    7. Delete subscription deployment-history entry for the deployment.
    8. Write terminal status with per-resource outcome. Re-runs converge to the same end state.

Proposed schema changes

Extend state.json:

{
  "stackId": "string | null",
  "managedResources": [
    { "id": "/subscriptions/.../Microsoft.KeyVault/vaults/foo",
      "type": "Microsoft.KeyVault/vaults",
      "scope": "resourceGroup",
      "apiVersion": "2024-11-01",
      "softDeletable": true,
      "purgeProtected": true }
  ],
  "resourceGroups": ["rg-app", "rg-data"],
  "subscriptions": ["<subId>"],
  "externalReferences": [
    { "kind": "privateEndpointConnection", "targetResourceId": "..." }
  ]
}

Extend metadata.json: resourceGroup (string) → resourceGroups (array). Add scope to allow subscription | managementGroup.

Add new status values to docs/DEPLOYMENT_STATE.md: retained-soft-deleted, partially-destroyed.

Implementation phases

  • Phase 1 — Schema & state capture: extend state.json / metadata.json, update DEPLOYMENT_STATE.md, update azure-template-generator.agent.md / deploy agent to walk deployment operations after deploy and populate managedResources[].
  • Phase 2 — Deployment Stacks integration: add deployMethod toggle (default stack) in requirements gathering; stack create in git-ape-deploy.exampleyml; stack-delete branch in git-ape-destroy.exampleyml.
  • Phase 3 — Fallback hardening: extract destroy logic into .github/scripts/destroy.sh (or .ps1) implementing the idempotent algorithm above; add soft-delete purge loop + remote-reference probe.
  • Phase 4 — Validation: fixture deployment with 2 RGs + purge-protected KV + sub-scope role assignment + cross-RG reference; destroy → re-run destroy (must be already-destroyed); stack-vs-fallback parity; soft-delete replay (redeploy succeeds once retention allows).

Out of scope

  • Entra ID / app-registration cleanup (requires Graph permissions; separate issue).
  • Data-plane cleanup (KV secrets, blob contents — gone with control plane).
  • Management-group-scope deployments (noted but deferred).

Open questions for discussion

  1. Stacks opt-in or default? Recommend stack as the default for new deployments, keeping sub-deployment as an explicit fallback. Stacks are GA.
  2. Auto-purge non-protected soft-deleted resources? Recommend yes on destroy (never purge protected); surface both in the summary. Alternative: require an explicit --purge-soft-deleted flag.
  3. Clean up deployment-history entries after destroy? Recommend yes (to stay well below the 800/scope cap).
  4. Scope of this work: single issue or should each phase be split into its own issue once we align on direction?

Reproduction

  1. Deploy the included Key Vault + private endpoint template (.azure/deployments/deploy-20260423-092136).
  2. Run @git-ape destroy deployment deploy-20260423-092136.
  3. Observe: RG is deleted; Key Vault remains soft-deleted at subscription scope; purge protection prevents purge; redeploying with the same name fails until retention expires.

Happy to open a draft PR for Phase 1 (schema + capture) as the foundation once we align on the two-layer direction.

Metadata

Metadata

Labels

enhancementNew feature or request

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions