This page is for managers who need Ansible to be predictable, auditable, and low-risk, not “cool”.
- A repeatable way to configure systems and deploy changes.
- A single source of truth for “how servers should look” (versioned, reviewed).
- Faster recovery and rebuilds (when playbooks are actually used and tested).
¶ Where Ansible Fits (And Where It Does Not)
Use it for:
- OS and baseline configuration (users, SSH, packages, time, logging).
- Service configuration and deployments (idempotent changes).
- Compliance baselines (when you can prove intended state and drift).
Do not use it as:
- A replacement for image building (Packer) or orchestration (Kubernetes) when those are the actual needs.
- A giant “run shell scripts everywhere” system without review and testing.
- Version control: playbooks/roles live in git, not copied around.
- Code review: merge requests required for changes to production automation.
- Environment separation: dev/stage/prod inventories and variables are clearly separated.
- Secrets: no plaintext credentials in repos; use Ansible Vault or an external secrets backend.
- Least privilege: limit
become usage, restrict SSH keys, and log execution.
- Testing gate: at minimum
ansible-lint plus a dry-run or smoke run in a lab.
- Standard repo layout (inventories, roles, group_vars/host_vars, collections).
- Clear ownership per role/playbook (who maintains it, who approves changes).
- Runbooks for: onboarding new hosts, emergency changes, rollback, and break-glass access.
- A cadence for updates (Ansible core, collections, OS packages) and for playbook maintenance.
- Change failure rate (did automation reduce incidents or create new ones?).
- Mean time to recover (MTTR) when a node needs rebuild.
- Drift rate: how often does “intended state” differ from reality.
- Lead time for a safe change (from request to applied change with evidence).
- What is in scope for Ansible (baseline only, or apps too)?
- How do we prevent ad-hoc changes from drifting the state?
- What is the rollback plan when a playbook causes an outage?
- How are secrets stored and rotated?
- What is the promotion path from dev to prod?
- What does “done” look like (tests, evidence, logs, approvals)?
- Start with a baseline role for all hosts (SSH, time sync, users, logging).
- Add one service end-to-end (install, config, start, health check).
- Add CI checks and a staging run.
- Expand to more services and more environments once the pipeline is boring.