Modernizing CI build servers: How to migrate from Chef to Ansible
Table of contents
- Learn why we migrated from Chef to Ansible configuration management for our continuous integration (CI) build servers.
- Discover practical strategies for gradual migration without disrupting existing pipelines.
- Understand cost comparisons and hosting considerations for build servers.
- Compare Chef and Ansible to make informed decisions for your infrastructure.
What happens when your infrastructure configuration management system becomes a bottleneck? When deploying a single continuous integration (CI) server takes more than an hour and only one team member (barely) understands how it works, it’s time for a change.
This blog post explores our journey from Chef to Ansible for managing CI build servers — critical infrastructure supporting daily developer operations. It covers how we improved team maintainability, cut our configuration codebase by more than 99 percent, greatly improved scaling capabilities, and transformed our deployment process.
Why now?
Ever since I joined Nutrient in late 2023, I wanted to modernize our CI build servers (running as self-hosted Buildkite agents(opens in a new tab)) and the Chef cookbooks behind them responsible for their configuration. Working with them daily was challenging; dead code, minimal documentation, and my own limited Ruby and Chef expertise made maintenance and any new development difficult.
To be frank, we let our Chef ecosystem rot. Documentation was minimal or non-existent, key team members who understood the system had left, and previous upgrade attempts had failed. The infrastructure worked, but it had become a black box we could barely maintain and that we found impossible to develop new features with.
And while our fleet of bare-metal Linux servers on Debian 10 Buster from Hetzner Dedicated Server was functioning adequately, with Debian 10 reaching end-of-life (EOL) on 30 June 2024, we faced growing security and compliance risks — making the impending deadline a clear catalyst for change.
This deadline created the perfect opportunity to reassess our configuration management strategy and transition away from Chef, which had become a maintenance bottleneck.
Core issues identified
We identified several critical problems that made migration necessary:
- Security risk — Running an unsupported OS created compliance issues and security vulnerabilities.
- Knowledge gap — Chef became a black box for our Platform team, with failed previous upgrade attempts.
- Complexity — Critical environments and variables were intertwined in unknown ways, with nested cookbooks referencing decommissioned components written more than a decade ago.
- Outdated technology — Chef lagged behind modern DevOps tools, while Ansible offered greater flexibility and ease of use.
- Scaling difficulties — Our Chef setup required significant time to deploy individual servers or make configuration changes.
Chef vs. Ansible: A practical comparison
When evaluating configuration management tools, we compared Chef and Ansible across several dimensions relevant to our infrastructure needs.
| Aspect | Chef | Ansible |
|---|---|---|
| Language | Ruby-based DSL — full programming language enables complex logic and sophisticated abstractions | YAML-based playbooks — declarative and human-readable |
| Learning curve | Steep — involves learning Ruby and Chef’s architecture | Gentle — human-readable YAML, easier for teams to learn |
| Infrastructure requirements | Requires Chef server, database, and associated tooling — provides centralized management, reporting, and compliance tracking | Agentless — runs over SSH, no dedicated server, simpler initial setup but no native reporting or auditing |
| Deployment model | Pull-based — agents check in with Chef server and pull configurations, similar to GitOps workflow | Push-based — configurations are pushed from control node to target servers on demand |
| Code complexity | Complex nested cookbooks with dependencies — powerful abstractions and reusable code patterns | Simple, modular playbooks with minimal dependencies — easier to understand at a glance |
| Online documentation | Extensive, mature documentation with deep technical coverage | Straightforward, task-oriented documentation |
| Community and ecosystem | Mature ecosystem, albeit declining relative to modern tools | Active, growing community with strong support and modern integrations |
| Idempotency | Robust idempotency when properly designed, with comprehensive resource management | Native idempotency by default with clear task execution |
| Debugging | Centralized logging and reporting through Chef server, but can be challenging with nested dependencies | Easier with clear task execution and verbose output, but requires manual log aggregation |
| Multi-server deployment | Built-in orchestration through Chef server with centralized control and reporting | Native parallel execution capabilities with flexible ad-hoc execution |
Why move away from Chef?
Chef was an industry standard for configuration management, but our experience revealed significant limitations beyond the core issues we faced:
- High learning curve — Requires Ruby knowledge and understanding of Chef’s architecture, which the current Platform team does not have
- Infrastructure overhead — Maintaining the Chef server, database, and associated toolset created ongoing costs and complexity
- Declining ecosystem — Community and industry support declined relative to modern tools like Ansible
- Technical debt — The time required to refactor and document our legacy cookbooks would be enormous
These factors, combined with our immediate challenges, made migrating to a more modern, approachable configuration management tool a faster and safer long-term strategy than maintaining our Chef infrastructure.
Requirements for CI server migration
We established clear criteria for our migration, outlined below, to ensure success.
Must-have requirements
- Minimal disruption — Existing pipelines remain functional, with developer teams unable to notice any transition
- Current OS — Stay up to date to meet security practices and compliance requirements
- Team maintainability — New setup must be understood and managed by current team members
- Clear documentation — Eliminate undocumented knowledge with easy-to-follow runbooks
- Bare-metal servers — Support specialized workloads like Android emulation requiring low-level hardware access
- Rapid, automated deployment — Reduce time to deploy new servers and eliminate as many manual steps as possible
Nice-to-have features
- Single infrastructure provider — Simplified management with consolidated hosting
- Vendor-neutral tooling — Avoid lock-in with CloudFormation or ARM templates to enable provider flexibility
- Cost optimization — Maximize value and minimize infrastructure expenses
- Elastic scaling — Support easy scaling up or down as requirements change
Solution options explored
We evaluated multiple approaches to find the best fit for our requirements.
Configuration management tools
Packer — Excellent for image-based scaling in cloud environments like AWS EC2, but not suitable for our bare-metal requirements.
Ansible — Simple, readable playbooks with no dedicated server or database requirements. Offered better flexibility for our specific infrastructure needs.
Hosting considerations
We then compared three hosting options for our specialized requirements.
Hetzner Dedicated Server
- Bare-metal servers perfect for Android builds
- Proven reliability with existing infrastructure
- Hardware-level access for emulation workloads
- Manual scaling and no possibility of IaC
Hetzner Cloud
- Virtualized environment with good flexibility
- Integrated support for Packer, Ansible, and Terraform
- Limited by vertical scalability constraints
AWS EC2
- Comprehensive auto-scaling capabilities
- Provider consolidation benefits
- Significantly higher costs and complexity
Cost comparison analysis
| Provider | Server/instance type | Type | specifications | Storage | Monthly cost* | Autoscaling |
|---|---|---|---|---|---|---|
| Hetzner Dedicated Server | AX52 | Bare-metal | AMD Ryzen 7 7700, 64GB DDR5, 8 cores | 2×1TB Gen4 NVMe SSD | $65 + $43 one-off setup fee | No autoscaling |
| AWS EC2 | c5.metal | Bare-metal | 96 vCPU, 192GB RAM | EBS gp3: $8/100GB | $2,980 + storage | Available with AWS autoscaling groups (ASG) |
| Hetzner Cloud | CCX33 | Virtualized | 8 vCPU, 32GB RAM | 240GB SSD | $63 | Supported |
| AWS EC2 | m5.2xlarge | Virtualized | 8 vCPU, 32GB RAM | EBS gp3: $8/100GB | $280 + storage | Full autoscaling capabilities |
*Pricing at time of evaluation in USD, hosted in European data centers. AWS costs exclude EBS storage, data transfer, and other fees.
Note: For bare-metal comparisons, Hetzner Dedicated Server (AX52) and AWS EC2 (c5.metal) represent the closest comparable server specifications available from each provider.
Each hosting option has tradeoffs:
- AWS offers comprehensive auto-scaling and managed services albeit at significantly higher costs. Scaling and replacing individual servers is automatic and painless.
- Hetzner Cloud provides good flexibility and IaC support but has vertical scalability constraints.
- Hetzner Dedicated Server offers great value for bare-metal servers, but lacks any autoscaling capabilities and requires fairly meticulous scripts to configure. Contacting Hetzner support is often required to debug any hardware issues.
Decision: Hetzner Dedicated Server offered the best value for our bare-metal requirements, aligning with our existing infrastructure expertise. The cost savings and proven reliability outweighed the lack of autoscaling capabilities.
The path forward: Ansible on Hetzner Dedicated Server
We chose to modernize our configuration management with Ansible while maintaining our proven Hetzner Dedicated Server infrastructure. Our strategy focused on three key principles: replacing Chef cookbooks with Ansible playbooks, executing a gradual transition to minimize risk, and documenting processes for team maintainability.
Implementation plan
Once we decided what to do, we structured our migration as a four-phase process to minimize risk and ensure success.
Phase 1: Proof-of-concept development
Develop initial Ansible playbooks to configure Linux servers with essential components:
- Buildkite Agent — Complete installation and configuration setup
- Android CI support — Testing capabilities for specialized build requirements
- Essential tooling — Docker, Git, Vim, curl, TLS certificates, and development dependencies
- Authentication systems — Repository access and HyperDX Agent integration for observability
- Hetzner-specific features — Rescue mode access and disk encryption configuration
Phase 2: Validation and testing
Comprehensive testing to ensure reliability before production deployment:
- Pipeline validation — Execute major pipelines, including monorepo and website builds, on test nodes
- Health monitoring — Verify server status through Buildkite, SSH access, and essential mount points for disk drives
- Performance benchmarking — Compare deployment times and resource utilization against Chef baseline
Phase 3: Gradual production rollout
Risk-minimized migration strategy executed over a two-week period:
- Sequential migration — Take agents offline individually, remove from Chef state, add to Ansible inventory
- Continuous monitoring — Apply configurations, monitor stability, and iterate based on findings
- Rollback preparation — Maintain Chef configurations as backup during transition period
Phase 4: Infrastructure cleanup
Finalize migration and establish sustainable practices:
- Legacy removal — Eliminate obsolete Chef artifacts, deprecated automation, and outdated runbooks
- Documentation creation — Develop comprehensive playbook documentation and maintainable runbooks
- Knowledge transfer — Train team members on new Ansible workflows and troubleshooting procedures
Migration results and lessons learned
Our Chef-to-Ansible migration delivered significant improvements across multiple dimensions.
Quantifiable improvements
The migration delivered measurable results that transformed our infrastructure operations:
Deployment time
- Before — Manual, error-prone process requiring significant time to deploy each server.
- After — Automated, consistent process with Ansible able to scale multiple servers simultaneously.
- Improvement — Dramatic reduction in deployment time, operational stress, and human error.
Documentation quality
- Before — Minimal documentation, heavy reliance on undocumented knowledge and outdated runbooks.
- After — Comprehensive, clear documentation for Ansible playbooks and server configurations.
- Improvement — Vastly improved documentation quality, enabling easier onboarding and maintenance.
Team productivity
- Before — Only a single team member could understand Chef cookbooks, leading to bottlenecks.
- After — Entire platform team can maintain Ansible configurations, and the entire engineering team can contribute, as it’s simply human-readable YAML.
- Improvement — Higher speed of development and reduced reliance on specific individuals.
Configuration management complexity
- Before — Complex, nested Chef cookbooks with multiple dependencies, forgotten servers, more than five repositories, and dead code.
- After — Simple, modular Ansible playbooks with minimal dependencies and no external servers, all hosted in a single code repository.
- Improvement — Substantial reduction (more than 99 precent) in lines of code (LoC), from several hundred thousand lines to less than 2,000.
Key takeaways for infrastructure modernization
Our migration taught us several valuable lessons:
- Tool selection isn’t everything — Building systems your team understands and can maintain matters more than choosing the “best” tool
- Documentation is critical — Clear documentation and maintainability equal raw technical capabilities in importance
- Alignment matters — Choose configuration management and hosting solutions that match your team’s skills, operational needs, and budget constraints
- Iterative approach reduces risk — Well-documented, phased migrations provide better foundations for scalability and resilience
- Deadlines can be catalysts — The Debian 10 EOL deadline forced us to address technical debt we might have otherwise deferred
What’s next?
Our migration to Ansible represents more than a tool change — it’s a shift toward simplicity and maintainability over legacy complexity. This foundation enables:
- Improved deployment speed — Deploy new CI agents in minutes, not hours
- Improved reliability — Reduced single points of failure in our infrastructure
- Enhanced collaboration — Multiple team members can contribute to infrastructure management
- Future flexibility — Vendor-neutral approach enables easier provider migrations if needed
By prioritizing team understanding and operational simplicity, we’ve established a sustainable platform for our growing development needs.
If you want to learn more about our historical approach to CI, be sure to check out these posts: