Skip to content

Spec: phpboyscout/cicd v0.10.5 — goreleaser auto-retries transient release failures

  • Repository: gitlab.com/phpboyscout/cicd
  • Released as: v0.10.5 (patch — one new input with a behaviour-preserving default is additive, but the default value changes release-job behaviour on failure, so it ships as a fix).
  • Driver: go-tool-base's v0.17.0 tag pipeline. The goreleaser job fired automatically on the tag, but failed during macOS notarization:
sign & notarize macOS binaries
release failed: unable to add timestamps (RFC3161):
  Post "http://timestamp.apple.com/ts01": dial tcp 17.32.213.161:80: i/o timeout

goreleaser fails the entire run on that error, so a single transient network blip reaching Apple's timestamp server published zero release assets for every platform. A manual re-run ~6.5h later (same commit, same config) succeeded — confirming a pure transient. Self-updating consumers then saw "unable to find asset for ".

Summary

Release jobs are long, expensive, and tag-triggered once per release; a transient failure (a timestamp-server dial-timeout, a runner dropout, an image pull blip) should not require a human to notice and click "retry". GitLab's job-level retry is the right tool, but it is not currently wired into the goreleaser component.

Add a retry_max input (default 2) and wire retry on the transient failure classes:

goreleaser:
  retry:
    max: $[[ inputs.retry_max ]]      # default 2 (GitLab caps at 2)
    when:
      - script_failure
      - runner_system_failure
      - stuck_or_timeout_failure
  ...

script_failure covers the notarization/timestamp timeout (goreleaser exits 1); runner_system_failure and stuck_or_timeout_failure cover runner dropouts and hung jobs. The retry is safe: the v0.17.0 failure occurred during signing, before any release upload, and goreleaser's release.mode: keep-existing (the documented mode for this component) makes a re-run idempotent — it attaches to the existing Release and replaces artefacts rather than duplicating them.

Design

New input

Input Type Default Description
retry_max number 2 Automatic retries on a transient release failure (network / runner / timeout). GitLab caps this at 2; set 0 to disable.

Behaviour

  • Default (retry_max: 2): a transient release failure auto-retries up to twice before the job is marked failed. A genuine, deterministic failure (bad .goreleaser.yaml, a real build error) will still ultimately fail — it just takes the extra attempts to surface. The cost (a few extra minutes on a real failure) is far smaller than a missed release.
  • retry_max: 0: restores pre-v0.10.5 behaviour (no retry). The failure-path self-test sets this so it does not re-run the deliberately-failing job three times.

retry.when is fixed (not an input): the three transient classes above are the only ones worth auto-retrying; retrying always would mask deterministic failures, and retrying nothing else is the point.

Tests

tests/goreleaser/ deliberately runs goreleaser with no .goreleaser.yaml so the job exits non-zero (tolerated via allow_failure.exit_codes). With the new default that failure would retry twice (three runs) before being tolerated, so the fixture passes retry_max: 0 to keep the self-test single-shot. This also exercises the new input plumbing.

Consumer follow-up

go-tool-base carries an interim in-repo retry override on its goreleaser job (the immediate stopgap raised the same day). Once it bumps its gitlab.com/phpboyscout/cicd/goreleaser include to @v0.10.5, that override is removed in favour of this component default.