Skip to content

Fix CD pipeline: split stages + git tag-based deployment tracking#233

Merged
mohamedmansour merged 3 commits intomainfrom
fix/cd-pipeline-stage-split
Apr 14, 2026
Merged

Fix CD pipeline: split stages + git tag-based deployment tracking#233
mohamedmansour merged 3 commits intomainfrom
fix/cd-pipeline-stage-split

Conversation

@mohamedmansour
Copy link
Copy Markdown
Contributor

@mohamedmansour mohamedmansour commented Apr 14, 2026

Problem

The hourly CD pipeline took ~30 minutes per run even when no deployment was needed. Two issues:

  1. 1ES SDL overhead on every run — The pipeline had a single Package stage. Even when the Deploy job was skipped (version already published), the 1ES Official Pipeline Template still injected and ran all SDL/compliance tasks (CredScan, PoliCheck, BinSkim, Windows source analysis agent) at the stage level.

  2. Registry checks unreachable from 1ES agents — The npm and crates.io HTTP checks returned HTTP 000 (connection failure) due to firewall/proxy restrictions, falsely triggering deployments against already-published versions.

Solution

1. Split into two stages

# Before: one stage — SDL overhead always runs (~30 min)
- stage: Package
  jobs:
  - job: NeedsDeployment   # ~1 min check
  - job: Deploy             # conditional, often skipped
  # 1ES SDL tasks still execute for the stage

# After: two stages — heavyweight stage skipped entirely when not needed
- stage: Check              # Lightweight version check (~1-2 min)
  jobs:
  - job: CheckVersion

- stage: Package            # Entirely skipped when needsDeployment=false
  dependsOn: Check
  condition: eq(dependencies.Check.outputs[...], 'true')
  jobs:
  - job: Deploy

When Package is skipped at the stage level, Azure Pipelines does not provision agents or run any 1ES-injected SDL tasks.

2. Replace registry HTTP checks with git tag tracking

# Before: unreliable curl to external registries (HTTP 000 on 1ES agents)
curl ... "https://registry.npmjs.org/@microsoft%2Fwebui/${VERSION}"
curl ... "https://crates.io/api/v1/crates/microsoft-webui/${VERSION}"

# After: check for deployed/* git tag (zero external network needed)
git fetch --tags --quiet
if git tag -l "deployed/v${VERSION}" | grep -q .; then
  echo "Already deployed"    # needsDeployment=false
fi

After a successful deployment, the pipeline pushes a deployed/vX.Y.Z tag:

git tag "deployed/v${VERSION}"
git push origin "deployed/v${VERSION}"

How it works

  1. Check stage downloads the latest GitHub release, extracts version from .crate filename
  2. Checks if deployed/v0.0.8 tag exists in the repo
  3. Tag exists → needsDeployment=false → Package stage entirely skipped
  4. Tag missing → needsDeployment=true → Package stage runs deployment
  5. After successful deploy → pushes deployed/v0.0.8 tag

Impact

Metric Before After
Hourly no-op run time ~30 min ~1-2 min
External network dependency npm + crates.io (broken) None
False deployment triggers Every run None
Deploy history audit Check pipeline logs git tag -l "deployed/*"

Bootstrap

After merging, push the tag for the current version to prevent a re-deploy:

git tag deployed/v0.0.8
git push origin deployed/v0.0.8

## Problem

The hourly CD pipeline took ~30 minutes even when no deployment was
needed. The 1ES Official Pipeline Template injects SDL/compliance tasks
(CredScan, PoliCheck, BinSkim, Windows source analysis agent) **at the
stage level**. With everything in a single `Package` stage, these tasks
ran every hour regardless of whether the `Deploy` job was skipped.

## Root cause

```yaml
# Before: one stage, two jobs
- stage: Package
  jobs:
  - job: NeedsDeployment   # ~1 min check
  - job: Deploy             # conditional, often skipped
  # But 1ES SDL tasks still run for the stage → ~30 min overhead
```

The `Deploy` job condition correctly evaluated to `false`, but the
stage-level SDL overhead (Windows agent provisioning, compliance scans)
still executed every run.

## Fix

Split into two stages so the heavyweight stage is **entirely skipped**
when no deployment is needed:

```yaml
# After: two stages
- stage: Check              # Lightweight version check (~1-2 min)
  jobs:
  - job: CheckVersion       # Downloads release, checks npm + crates.io

- stage: Package            # Only runs when needsDeployment == true
  dependsOn: Check
  condition: eq(dependencies.Check.outputs[...], 'true')
  jobs:
  - job: Deploy             # Downloads artifacts and publishes
```

When `Package` is skipped at the stage level, Azure Pipelines does not
provision agents or run any 1ES-injected SDL tasks for it.

## Impact

Hourly no-op runs drop from **~30 minutes → ~1-2 minutes**.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mohamedmansour mohamedmansour requested a review from janechu April 14, 2026 05:14
mohamedmansour and others added 2 commits April 13, 2026 22:46
## Problem

The registry checks falsely reported versions as "not yet deployed"
when the 1ES agent couldn't reach the registries:

- **npm**: Used `npm view ... 2>/dev/null` which silently swallows all
  errors. If npm isn't installed or the registry is unreachable, the
  empty output is treated as "not deployed".
- **crates.io**: `curl` returned `HTTP 000` (connection failure —
  DNS/proxy/firewall), and any non-200 was treated as "not deployed".

Both caused unnecessary deployment runs against already-published
versions.

## Fix

```bash
# Before (npm) — npm CLI dependency, errors silently swallowed
npm view "@microsoft/webui@${VERSION}" version 2>/dev/null | grep -qxF "${VERSION}"

# After (npm) — curl to registry HTTP API, no npm CLI needed
NPM_STATUS=$(check_registry "npm" "https://registry.npmjs.org/@microsoft%2Fwebui/${VERSION}")
```

```bash
# Before (crates.io) — no timeout, no retry, HTTP 000 = "not deployed"
curl -s -o /dev/null -w "%{http_code}" ...

# After (both) — shared helper with timeout, retry, and network error detection
check_registry() {
  curl ... --connect-timeout 10 --max-time 30 --retry 3 --retry-delay 5 ...
  if [ "$status" = "000" ]; then exit 1; fi  # fail, don't falsely trigger
}
```

Key changes:
- Replaced `npm view` with direct curl to `registry.npmjs.org` HTTP API
- Added `--connect-timeout 10 --max-time 30 --retry 3 --retry-delay 5`
- HTTP 000 now **fails the step** with an actionable error instead of
  silently triggering a deployment
- Added `set -euo pipefail` for strict error handling

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Problem

The 1ES agents cannot reach external registries (npm, crates.io) —
curl returns `HTTP 000` (connection failure) due to firewall/proxy
restrictions. This caused every hourly run to falsely detect versions
as "not yet deployed" and trigger unnecessary deployment attempts.

## Solution

Replace external registry checks with git tag-based tracking:

### Check stage — look for existing tag

```bash
# Before: unreliable curl to external registries
check_registry "npm" "https://registry.npmjs.org/..."
check_registry "crates.io" "https://crates.io/api/v1/..."

# After: check for deployed/* git tag (no external network needed)
git fetch --tags --quiet
if git tag -l "deployed/v${VERSION}" | grep -q .; then
  # already deployed
fi
```

### Deploy job — tag after successful deployment

```bash
# New step at end of Deploy job
DEPLOY_TAG="deployed/$(releaseTag)"   # e.g. deployed/v0.0.8
git tag "${DEPLOY_TAG}"
git push origin "${DEPLOY_TAG}"
```

### How it works

1. Check stage extracts version from .crate filename (e.g. `0.0.8`)
2. Checks if `deployed/v0.0.8` tag exists in the repo
3. If tag exists → `needsDeployment=false` → Package stage skipped
4. If tag missing → `needsDeployment=true` → Package stage runs
5. After successful deploy, pushes `deployed/v0.0.8` tag

### Benefits

- Zero external network dependencies — works behind any firewall
- Atomic: tag only pushed after the release template succeeds
- Auditable: `git tag -l 'deployed/*'` shows full deploy history
- Partial deploy recovery: if deploy fails mid-way, no tag is pushed,
  so the next hourly run retries automatically

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mohamedmansour mohamedmansour changed the title chore: Split CD pipeline into Check + Package stages to avoid 1ES overhead Fix CD pipeline: split stages + git tag-based deployment tracking Apr 14, 2026
@mohamedmansour mohamedmansour merged commit 29863fc into main Apr 14, 2026
26 checks passed
@mohamedmansour mohamedmansour deleted the fix/cd-pipeline-stage-split branch April 14, 2026 08:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants