Modernizing CI build servers: How to migrate from Chef to Ansible

Hugh Woodfall

November 26, 2025

Modernizing CI build servers: How to migrate from Chef to Ansible

TL;DR

Learn why we migrated from Chef to Ansible configuration management for our continuous integration (CI) build servers.
Discover practical strategies for gradual migration without disrupting existing pipelines.
Understand cost comparisons and hosting considerations for build servers.
Compare Chef and Ansible to make informed decisions for your infrastructure.

What happens when your infrastructure configuration management system becomes a bottleneck? When deploying a single continuous integration (CI) server takes more than an hour and only one team member (barely) understands how it works, it’s time for a change.

This blog post explores our journey from Chef to Ansible for managing CI build servers — critical infrastructure supporting daily developer operations. It covers how we improved team maintainability, cut our configuration codebase by more than 99 percent, greatly improved scaling capabilities, and transformed our deployment process.

Why now?

Ever since I joined Nutrient in late 2023, I wanted to modernize our CI build servers (running as self-hosted Buildkite agents(opens in a new tab)) and the Chef cookbooks behind them responsible for their configuration. Working with them daily was challenging; dead code, minimal documentation, and my own limited Ruby and Chef expertise made maintenance and any new development difficult.

To be frank, we let our Chef ecosystem rot. Documentation was minimal or non-existent, key team members who understood the system had left, and previous upgrade attempts had failed. The infrastructure worked, but it had become a black box we could barely maintain and that we found impossible to develop new features with.

And while our fleet of bare-metal Linux servers on Debian 10 Buster from Hetzner Dedicated Server was functioning adequately, with Debian 10 reaching end-of-life (EOL) on 30 June 2024, we faced growing security and compliance risks — making the impending deadline a clear catalyst for change.

This deadline created the perfect opportunity to reassess our configuration management strategy and transition away from Chef, which had become a maintenance bottleneck.

Core issues identified

We identified several critical problems that made migration necessary:

Security risk — Running an unsupported OS created compliance issues and security vulnerabilities.
Knowledge gap — Chef became a black box for our Platform team, with failed previous upgrade attempts.
Complexity — Critical environments and variables were intertwined in unknown ways, with nested cookbooks referencing decommissioned components written more than a decade ago.
Outdated technology — Chef lagged behind modern DevOps tools, while Ansible offered greater flexibility and ease of use.
Scaling difficulties — Our Chef setup required significant time to deploy individual servers or make configuration changes.

Chef vs. Ansible: A practical comparison

When evaluating configuration management tools, we compared Chef and Ansible across several dimensions relevant to our infrastructure needs.

Aspect	Chef	Ansible
Language	Ruby-based DSL — full programming language enables complex logic and sophisticated abstractions	YAML-based playbooks — declarative and human-readable
Learning curve	Steep — involves learning Ruby and Chef’s architecture	Gentle — human-readable YAML, easier for teams to learn
Infrastructure requirements	Requires Chef server, database, and associated tooling — provides centralized management, reporting, and compliance tracking	Agentless — runs over SSH, no dedicated server, simpler initial setup but no native reporting or auditing
Deployment model	Pull-based — agents check in with Chef server and pull configurations, similar to GitOps workflow	Push-based — configurations are pushed from control node to target servers on demand
Code complexity	Complex nested cookbooks with dependencies — powerful abstractions and reusable code patterns	Simple, modular playbooks with minimal dependencies — easier to understand at a glance
Online documentation	Extensive, mature documentation with deep technical coverage	Straightforward, task-oriented documentation
Community and ecosystem	Mature ecosystem, albeit declining relative to modern tools	Active, growing community with strong support and modern integrations
Idempotency	Robust idempotency when properly designed, with comprehensive resource management	Native idempotency by default with clear task execution
Debugging	Centralized logging and reporting through Chef server, but can be challenging with nested dependencies	Easier with clear task execution and verbose output, but requires manual log aggregation
Multi-server deployment	Built-in orchestration through Chef server with centralized control and reporting	Native parallel execution capabilities with flexible ad-hoc execution

Why move away from Chef?

Chef was an industry standard for configuration management, but our experience revealed significant limitations beyond the core issues we faced:

High learning curve — Requires Ruby knowledge and understanding of Chef’s architecture, which the current Platform team does not have
Infrastructure overhead — Maintaining the Chef server, database, and associated toolset created ongoing costs and complexity
Declining ecosystem — Community and industry support declined relative to modern tools like Ansible
Technical debt — The time required to refactor and document our legacy cookbooks would be enormous

These factors, combined with our immediate challenges, made migrating to a more modern, approachable configuration management tool a faster and safer long-term strategy than maintaining our Chef infrastructure.

Requirements for CI server migration

We established clear criteria for our migration, outlined below, to ensure success.

Must-have requirements

Minimal disruption — Existing pipelines remain functional, with developer teams unable to notice any transition
Current OS — Stay up to date to meet security practices and compliance requirements
Team maintainability — New setup must be understood and managed by current team members
Clear documentation — Eliminate undocumented knowledge with easy-to-follow runbooks
Bare-metal servers — Support specialized workloads like Android emulation requiring low-level hardware access
Rapid, automated deployment — Reduce time to deploy new servers and eliminate as many manual steps as possible

Nice-to-have features

Single infrastructure provider — Simplified management with consolidated hosting
Vendor-neutral tooling — Avoid lock-in with CloudFormation or ARM templates to enable provider flexibility
Cost optimization — Maximize value and minimize infrastructure expenses
Elastic scaling — Support easy scaling up or down as requirements change

Solution options explored

We evaluated multiple approaches to find the best fit for our requirements.

Configuration management tools

Packer — Excellent for image-based scaling in cloud environments like AWS EC2, but not suitable for our bare-metal requirements.

Ansible — Simple, readable playbooks with no dedicated server or database requirements. Offered better flexibility for our specific infrastructure needs.

Hosting considerations

We then compared three hosting options for our specialized requirements.

Hetzner Dedicated Server

Bare-metal servers perfect for Android builds
Proven reliability with existing infrastructure
Hardware-level access for emulation workloads
Manual scaling and no possibility of IaC

Hetzner Cloud

Virtualized environment with good flexibility
Integrated support for Packer, Ansible, and Terraform
Limited by vertical scalability constraints

AWS EC2

Comprehensive auto-scaling capabilities
Provider consolidation benefits
Significantly higher costs and complexity

Cost comparison analysis

Provider	Server/instance type	Type	specifications	Storage	Monthly cost*	Autoscaling
Hetzner Dedicated Server	AX52	Bare-metal	AMD Ryzen 7 7700, 64GB DDR5, 8 cores	2×1TB Gen4 NVMe SSD	$65 + $43 one-off setup fee	No autoscaling
AWS EC2	c5.metal	Bare-metal	96 vCPU, 192GB RAM	EBS gp3: $8/100GB	$2,980 + storage	Available with AWS autoscaling groups (ASG)
Hetzner Cloud	CCX33	Virtualized	8 vCPU, 32GB RAM	240GB SSD	$63	Supported
AWS EC2	m5.2xlarge	Virtualized	8 vCPU, 32GB RAM	EBS gp3: $8/100GB	$280 + storage	Full autoscaling capabilities

*Pricing at time of evaluation in USD, hosted in European data centers. AWS costs exclude EBS storage, data transfer, and other fees.

Note: For bare-metal comparisons, Hetzner Dedicated Server (AX52) and AWS EC2 (c5.metal) represent the closest comparable server specifications available from each provider.

Each hosting option has tradeoffs:

AWS offers comprehensive auto-scaling and managed services albeit at significantly higher costs. Scaling and replacing individual servers is automatic and painless.
Hetzner Cloud provides good flexibility and IaC support but has vertical scalability constraints.
Hetzner Dedicated Server offers great value for bare-metal servers, but lacks any autoscaling capabilities and requires fairly meticulous scripts to configure. Contacting Hetzner support is often required to debug any hardware issues.

Decision: Hetzner Dedicated Server offered the best value for our bare-metal requirements, aligning with our existing infrastructure expertise. The cost savings and proven reliability outweighed the lack of autoscaling capabilities.

The path forward: Ansible on Hetzner Dedicated Server

We chose to modernize our configuration management with Ansible while maintaining our proven Hetzner Dedicated Server infrastructure. Our strategy focused on three key principles: replacing Chef cookbooks with Ansible playbooks, executing a gradual transition to minimize risk, and documenting processes for team maintainability.

Implementation plan

Once we decided what to do, we structured our migration as a four-phase process to minimize risk and ensure success.

Phase 1: Proof-of-concept development

Develop initial Ansible playbooks to configure Linux servers with essential components:

Buildkite Agent — Complete installation and configuration setup
Android CI support — Testing capabilities for specialized build requirements
Essential tooling — Docker, Git, Vim, curl, TLS certificates, and development dependencies
Authentication systems — Repository access and HyperDX Agent integration for observability
Hetzner-specific features — Rescue mode access and disk encryption configuration

Phase 2: Validation and testing

Comprehensive testing to ensure reliability before production deployment:

Pipeline validation — Execute major pipelines, including monorepo and website builds, on test nodes
Health monitoring — Verify server status through Buildkite, SSH access, and essential mount points for disk drives
Performance benchmarking — Compare deployment times and resource utilization against Chef baseline

Phase 3: Gradual production rollout

Risk-minimized migration strategy executed over a two-week period:

Sequential migration — Take agents offline individually, remove from Chef state, add to Ansible inventory
Continuous monitoring — Apply configurations, monitor stability, and iterate based on findings
Rollback preparation — Maintain Chef configurations as backup during transition period

Phase 4: Infrastructure cleanup

Finalize migration and establish sustainable practices:

Legacy removal — Eliminate obsolete Chef artifacts, deprecated automation, and outdated runbooks
Documentation creation — Develop comprehensive playbook documentation and maintainable runbooks
Knowledge transfer — Train team members on new Ansible workflows and troubleshooting procedures

Migration results and lessons learned

Our Chef-to-Ansible migration delivered significant improvements across multiple dimensions.

Quantifiable improvements

The migration delivered measurable results that transformed our infrastructure operations:

Deployment time

Before — Manual, error-prone process requiring significant time to deploy each server.
After — Automated, consistent process with Ansible able to scale multiple servers simultaneously.
Improvement — Dramatic reduction in deployment time, operational stress, and human error.

Documentation quality

Before — Minimal documentation, heavy reliance on undocumented knowledge and outdated runbooks.
After — Comprehensive, clear documentation for Ansible playbooks and server configurations.
Improvement — Vastly improved documentation quality, enabling easier onboarding and maintenance.

Team productivity

Before — Only a single team member could understand Chef cookbooks, leading to bottlenecks.
After — Entire platform team can maintain Ansible configurations, and the entire engineering team can contribute, as it’s simply human-readable YAML.
Improvement — Higher speed of development and reduced reliance on specific individuals.

Configuration management complexity

Before — Complex, nested Chef cookbooks with multiple dependencies, forgotten servers, more than five repositories, and dead code.
After — Simple, modular Ansible playbooks with minimal dependencies and no external servers, all hosted in a single code repository.
Improvement — Substantial reduction (more than 99 precent) in lines of code (LoC), from several hundred thousand lines to less than 2,000.

Key takeaways for infrastructure modernization

Our migration taught us several valuable lessons:

Tool selection isn’t everything — Building systems your team understands and can maintain matters more than choosing the “best” tool
Documentation is critical — Clear documentation and maintainability equal raw technical capabilities in importance
Alignment matters — Choose configuration management and hosting solutions that match your team’s skills, operational needs, and budget constraints
Iterative approach reduces risk — Well-documented, phased migrations provide better foundations for scalability and resilience
Deadlines can be catalysts — The Debian 10 EOL deadline forced us to address technical debt we might have otherwise deferred

What’s next?

Our migration to Ansible represents more than a tool change — it’s a shift toward simplicity and maintainability over legacy complexity. This foundation enables:

Improved deployment speed — Deploy new CI agents in minutes, not hours
Improved reliability — Reduced single points of failure in our infrastructure
Enhanced collaboration — Multiple team members can contribute to infrastructure management
Future flexibility — Vendor-neutral approach enables easier provider migrations if needed

By prioritizing team understanding and operational simplicity, we’ve established a sustainable platform for our growing development needs.

If you want to learn more about our historical approach to CI, be sure to check out these posts:

Explore related topics

Development