Skip to content

Spec: phpboyscout/cicd v0.2 — tofu-plan / tofu-apply

  • Repository: gitlab.com/phpboyscout/cicd
  • Released as: v0.2.0 (minor — two new components, no change to the four v0.1 components' input shape).
  • First consumer: phpboyscout/infra — the src/security-baseline/ stack, applied via GitLab CI (GitLab migration spec Phase E).

Summary

v0.1 of phpboyscout/cicd shipped four gate components (lint, security, validate, pages) — none of which touch AWS. v0.2 adds the two components that actually drive infrastructure:

  • tofu-plan — runs tofu plan against a real AWS account and a real (GitLab-managed) state backend. Produces a reviewable plan artifact + a GitLab MR plan-widget report. Runs on branches / MRs.
  • tofu-apply — consumes the plan artifact and runs tofu apply. Manual-gated by default; runs on the default branch.

Both authenticate to AWS with no static credentials: GitLab CI mints an OIDC ID token, AWS's AssumeRoleWithWebIdentity exchanges it for short-lived credentials. This is the GitLab-side mirror of what GitHub Actions OIDC did before the migration, and the reason Phase D provisioned the gitlab.com OIDC IDP + phpboyscout-automation role in the AWS account.

Motivation

Phase D of the GitLab migration cut infra/bootstrap over to GitLab CI OIDC and moved its state to GitLab. But the gate components can't verify the OIDC chain — they run tofu validate -backend=false, which needs neither AWS nor the real state. The migration's whole point — applying infrastructure (security-baseline, future workload stacks) from GitLab CI — needs components that:

  1. Obtain AWS credentials via OIDC (no long-lived secrets in CI).
  2. Talk to the GitLab-managed HTTP state backend.
  3. Run plan / apply with the safety rails infrastructure demands (reviewable plan, manual apply gate, plan/apply consistency).

infra will accrue more stacks (src/<workload>/, modules/); each needs the same plan/apply flow. A reusable pair of components is the same call we made for the gate components — author once, version, Renovate-bump consumers.

Decisions

D1 — Two components, not one mode-switched component

tofu-plan and tofu-apply are separate templates. Rejected a single tofu component with a mode: plan|apply input because:

  • The two have different rules: defaults (plan on branches/MRs; apply on the default branch, manual).
  • tofu-apply needs: the tofu-plan job's artifact — a dependency a single component can't cleanly express on itself.
  • Separate templates make the consumer's .gitlab-ci.yml read as what it is: a plan stage and an apply stage.

D2 — AWS auth: OIDC ID token → AssumeRoleWithWebIdentity, via env vars

GitLab CI mints an ID token through the id_tokens: keyword:

id_tokens:
  AWS_OIDC_TOKEN:
    aud: sts.amazonaws.com

The component writes that token to a file and exports the two variables the AWS SDK's web-identity credential provider looks for:

echo "$AWS_OIDC_TOKEN" > "${CI_PROJECT_DIR}/.aws-oidc-token"
export AWS_ROLE_ARN="$[[ inputs.role_arn ]]"
export AWS_WEB_IDENTITY_TOKEN_FILE="${CI_PROJECT_DIR}/.aws-oidc-token"
export AWS_REGION="$[[ inputs.aws_region ]]"

OpenTofu's aws provider uses the standard AWS SDK credential chain; with those three set it performs the AssumeRoleWithWebIdentity exchange itself. No explicit aws sts call in the component — the provider does it, refreshes it, and the credentials are never written to disk beyond the short-lived ID token.

Why audience sts.amazonaws.com: that's what the terraform-aws-bootstrap v0.2 GitLab path bakes into the IAM OIDC provider's client_id_list and the role trust policy's aud condition. The component's aud input defaults to it; an override exists for accounts that pinned a different audience.

D3 — Consumer requirement: provider must NOT hardcode an AWS profile

The web-identity credential chain only kicks in if nothing higher priority short-circuits it. A stack whose providers.tf says profile = "tofu-bootstrap" will ignore the env vars and fail in CI (no ~/.aws/credentials on the runner).

Consumers of tofu-plan / tofu-apply must declare the aws provider with no static profile — let the credential chain resolve. The recommended pattern:

variable "aws_profile" {
  description = "Local AWS CLI profile. Leave null in CI — the OIDC web-identity credential chain is used instead."
  type        = string
  default     = null
}

provider "aws" {
  region              = var.region
  profile             = var.aws_profile   # null in CI
  allowed_account_ids = [var.account_id]
}

This is a consumer-side concern, documented here and enforced by the component failing loudly (the AWS provider errors with a clear "no valid credential sources" message if a profile is wrongly pinned).

D4 — State backend auth: gitlab-ci-token + $CI_JOB_TOKEN

Consumer stacks store state in the GitLab-managed HTTP backend (one state object per stack, per the migration spec). tofu init against that backend needs HTTP basic-auth. In CI the component exports:

export TF_HTTP_USERNAME="gitlab-ci-token"
export TF_HTTP_PASSWORD="$CI_JOB_TOKEN"

CI_JOB_TOKEN carries the terraform_state permission for the job's own project by default — no PAT, no project access token needed for same-project state. Cross-project state would need a different credential; out of scope for v0.2 (every phpboyscout stack stores state in its own project).

D5 — Plan artifact hand-off + MR plan widget

tofu-plan runs tofu plan -out=tfplan.cache and saves tfplan.cache as a job artifact. tofu-apply declares needs: [tofu-plan-job] with artifacts: true and runs tofu apply tfplan.cache — applying the exact plan that was reviewed. If state moved between plan and apply, tofu apply rejects the stale plan (correct, fail-safe).

tofu-plan additionally emits tofu show -json tfplan.cache > tfplan.json and publishes it as a reports: terraform: artifact, so GitLab renders an add/change/destroy summary in the MR widget.

Sensitive values: a binary plan can embed sensitive attribute values. Artifacts on a private project are acceptable exposure; the artifact expire_in is short (1 day). Consumers on public projects should not use tofu-apply's artifact hand-off with sensitive state — documented as a caveat.

D6 — tofu-apply is manual-gated by default

tofu-apply's default rules: run the job only on the default branch and as when: manual — a human clicks "apply" in the GitLab UI after reviewing the plan. Infrastructure apply is not something to trigger automatically on merge.

A manual boolean input (default true) lets a consumer opt into auto-apply-on-merge if they have a reason; we don't, and the default protects against accidental applies.

D7 — Inputs surface

tofu-plan:

Input Type Default Notes
image_version string "v0.2.0" infra-tools tag
stage string "plan" consumer's stage layout
working_directory string "." the stack directory
role_arn string — (required) AWS role to assume
aws_region string "eu-west-2"
aud string "sts.amazonaws.com" OIDC token audience
var_file string "" optional -var-file path, relative to working_directory

tofu-apply: all of the above, plus:

Input Type Default Notes
manual boolean true whether apply is a manual-gated job
plan_job string "tofu-plan" the job name to pull the plan artifact from

D8 — Versioning

Adding components is a minor bump → cicd v0.2.0. The four v0.1 components are unchanged; consumers on @v0.1.x are unaffected until they bump. Pre-1.0 caveat from v0.1 still holds.

Open questions

  • OQ1 — Single combined role vs plan/apply split. Phase D provisioned one role (phpboyscout-automation) with Administrator Access. The components take a role_arn input, so a future plan/apply role split (read-only plan role, write apply role) is a consumer-side change — call terraform-aws-bootstrap's automation-iam twice — not a component change. Tentative: ship v0.2 single-role; revisit role split as a separate piece of work.
  • OQ2 — tofu-apply re-plan vs artifact apply. v0.2 uses the artifact hand-off (apply the reviewed plan). An alternative is apply re-plans from scratch. Artifact apply is safer (no drift window) and is the GitLab-documented pattern. Tentative: artifact hand-off, as in D5.
  • OQ3 — GitLab environment: integration. GitLab can track deployments per environment:. v0.2 doesn't wire this; a v0.2.x follow-on could add an environment input so applies show up in the GitLab environments/deployments UI. Tentative: defer.

Component catalogue

tofu-plan

spec:
  component: [version]
  inputs:
    image_version: { type: string, default: "v0.2.0" }
    stage:         { type: string, default: plan }
    working_directory: { type: string, default: "." }
    role_arn:      { type: string }
    aws_region:    { type: string, default: "eu-west-2" }
    aud:           { type: string, default: "sts.amazonaws.com" }
    var_file:      { type: string, default: "" }
---
tofu-plan:
  stage: $[[ inputs.stage ]]
  image: registry.gitlab.com/phpboyscout/images/infra-tools:$[[ inputs.image_version ]]
  id_tokens:
    AWS_OIDC_TOKEN:
      aud: $[[ inputs.aud ]]
  variables:
    TF_HTTP_USERNAME: gitlab-ci-token
    TF_HTTP_PASSWORD: $CI_JOB_TOKEN
  script:
    - |
      echo "tofu-plan $[[ component.version ]] (image $[[ inputs.image_version ]])"
      echo "$AWS_OIDC_TOKEN" > "$CI_PROJECT_DIR/.aws-oidc-token"
      export AWS_ROLE_ARN="$[[ inputs.role_arn ]]"
      export AWS_WEB_IDENTITY_TOKEN_FILE="$CI_PROJECT_DIR/.aws-oidc-token"
      export AWS_REGION="$[[ inputs.aws_region ]]"
      cd "$[[ inputs.working_directory ]]"
      tofu init -input=false
      VARFILE_ARG=""
      [ -n "$[[ inputs.var_file ]]" ] && VARFILE_ARG="-var-file=$[[ inputs.var_file ]]"
      tofu plan -input=false -out=tfplan.cache $VARFILE_ARG
      tofu show -json tfplan.cache > tfplan.json
  artifacts:
    paths:
      - $[[ inputs.working_directory ]]/tfplan.cache
    reports:
      terraform: $[[ inputs.working_directory ]]/tfplan.json
    expire_in: 1 day

tofu-apply

spec:
  component: [version]
  inputs:
    image_version: { type: string, default: "v0.2.0" }
    stage:         { type: string, default: apply }
    working_directory: { type: string, default: "." }
    role_arn:      { type: string }
    aws_region:    { type: string, default: "eu-west-2" }
    aud:           { type: string, default: "sts.amazonaws.com" }
    manual:        { type: boolean, default: true }
    plan_job:      { type: string, default: "tofu-plan" }
---
tofu-apply:
  stage: $[[ inputs.stage ]]
  image: registry.gitlab.com/phpboyscout/images/infra-tools:$[[ inputs.image_version ]]
  id_tokens:
    AWS_OIDC_TOKEN:
      aud: $[[ inputs.aud ]]
  variables:
    TF_HTTP_USERNAME: gitlab-ci-token
    TF_HTTP_PASSWORD: $CI_JOB_TOKEN
  needs:
    - job: $[[ inputs.plan_job ]]
      artifacts: true
  rules:
    - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
      when: $[[ inputs.manual ]] && "manual" || "on_success"
  script:
    - |
      echo "tofu-apply $[[ component.version ]] (image $[[ inputs.image_version ]])"
      echo "$AWS_OIDC_TOKEN" > "$CI_PROJECT_DIR/.aws-oidc-token"
      export AWS_ROLE_ARN="$[[ inputs.role_arn ]]"
      export AWS_WEB_IDENTITY_TOKEN_FILE="$CI_PROJECT_DIR/.aws-oidc-token"
      export AWS_REGION="$[[ inputs.aws_region ]]"
      cd "$[[ inputs.working_directory ]]"
      tofu init -input=false
      tofu apply -input=false tfplan.cache

The when: ternary in tofu-apply's rule is illustrative — GitLab input interpolation in when: needs verifying during implementation (OQ for the build: GitLab may require the rule split into two entries gated on the boolean rather than an inline ternary).

Risk register

Risk Mitigation
OIDC token audience mismatch — role assume fails aud input defaults to the value terraform-aws-bootstrap v0.2 bakes into the IAM provider. Self-test fixture exercises the full chain against a real (throwaway) role before tagging.
Plan artifact embeds sensitive values Private-project artifact, short expire_in. Documented caveat for public-project consumers.
Stale plan applied after state drift tofu apply tfplan.cache rejects a plan whose state serial no longer matches — fail-safe by design.
CI_JOB_TOKEN lacks terraform_state permission Default GitLab project settings grant it; if a project disabled it, tofu init fails loudly at the backend step. Documented prerequisite.
Consumer pins an AWS profile and the job ignores OIDC creds D3 documents the requirement; the AWS provider errors clearly. Self-test fixture's provider is profile-free.
when: input interpolation unsupported Flagged inline above; implementation verifies and falls back to two rules: entries if needed.

Implementation plan

  1. Spec lands — this file, status approved.
  2. templates/tofu-plan.yml + templates/tofu-apply.yml per the catalogue.
  3. Self-testtests/tofu-plan/ + tests/tofu-apply/ fixtures. Resolved (option b): the fixture stack uses no aws provider — just a terraform_data resource (built-in tofu provider) with a variable + output. The fixture declares a backend "http" pointing at the phpboyscout/cicd project's own GitLab-managed state (a dedicated state name per component: selftest-plan / selftest-apply) so the self-test exercises:
  4. the id_tokens: token mint,
  5. the component's token-file + env-var wiring,
  6. TF_HTTP_* auth against the GitLab state backend (CI_JOB_TOKEN),
  7. tofu init / plan / apply and the plan→apply artifact handoff.

It deliberately does not prove the AWS AssumeRoleWithWebIdentity exchange — no AWS API call is made because the fixture has no aws provider. role_arn is passed a dummy value (arn:aws:iam::000000000000:role/selftest-noop); it's exported as AWS_ROLE_ARN but never used. The real end-to-end AWS-auth proof is Phase E's first tofu plan of infra/src/security-baseline/. The tofu-apply self-test sets manual: false so the apply runs automatically; applying the fixture is a no-op terraform_data write, no cloud resources. 4. Root .gitlab-ci.yml gains the two new self-test triggers. 5. CHANGELOG [0.2.0], merge develop → main, tag v0.2.0. 6. Phase E properinfra/src/security-baseline/ consumes tofu-plan + tofu-apply (separate task).

Follow-ups

  • environment: integration (OQ3) — surface applies in GitLab's deployment UI.
  • plan/apply role split (OQ1) — separate read-only plan role, consumer-side.
  • tofu-destroy component — eventually, for tearing down ephemeral stacks; deliberately omitted from v0.2 (destroy is rare and high-blast-radius; manual tofu destroy is fine until a real need appears).