System Provisioning on Amazon AWS

The AWS platform has a bewildering number of components. I find it useful to think of the different components in terms of levels of abstraction for platform deployment. This is reflected somewhat in the AWS Documentation when selecting the Products selection in the drop-down menu from the top left.

If you’re deploying just back-end computing tasks, then EC2, Lambda (still in preview), and Auto Scaling are a good fit. For Analytics deployments, EMR, Kinesis, and Data Pipeline. And so on.

For server provisioning, the canonical solution is OpsWorks. For web-application or worker deployments, Elastic Beanstalk. And, at a certain scale –multi-region deployments — there’s CloudFormation.

At Alight, we use the AWS platform mostly for ETL work and data warehousing. That is, we don’t host user-facing web applications. So Beanstalk isn’t necessarily needed. One curious aspect of Beanstalk is that it automatically creates SQS queues for each Beanstalk deployment — our current needs don’t include SQS, so this was a minor mismatch.

(Why We Decided Against) Using Chef

The best fit for our needs is stand-alone server provisioning in a repeatable fashion. For which OpsWorks is a logical fit. Except….

OpsWorks can be used to with stock machine images (AMIs), but for any custom configuration (assuming it’s not baked into your AMI), you’re options are Chef and… Chef.

Unfortunately, setting up a developer environment for Chef requires not only the Chef development kit, but also Berkshelf and ChefSpec, all of which depend on Ruby. Since we’re a Python shop, we don’t, by default, have Ruby set up on our computers, and maintaining the Ruby ecosystem of tools is a fair bit of overhead.

Beyond just setup, using Chef involves some level of knowledge of Ruby, Berkshelf, Knife, and the chef-client utility.

Using Ansible

In the configuration management world, Ansible is another popular option, which works in the declarative, idempotent manner that Chef shares. Ansible, hough, uses YAML, which is a very straight-forward format tow rite.

Since OpsWorks doesn’t support Ansible, we have to use another means to get Ansible working on EC2. Luckily, EC2 includes a User Datafield, where you can provide a shell script that will be run when the EC2 instance is first created. Code Commit seems promising, but since it was only recently announced and is still in Preview, we host our own Gitlab server. Our provisioning playbooks (and cookbooks) are hosted there and new EC2 instances perform a git clone to get their provisioning instructions, and then run the Ansible playbook in local connection mode.

The basic user data script we use looks something like

#!/bin/bash

sudo apt-get update
sudo apt-get install -y git

# Install latest ansible
sudo apt-get install software-properties-common
sudo apt-add-repository ppa:ansible/ansible
sudo apt-get update
sudo apt-get install -y ansible

git clone alight/ansible-playbooks.git
cd ansible-playbooks/
ansible-playbook -c local -i hosts -s aws-playbook.yml
logger "Finished provisioning"

Give an Ansible playbook repository hosted elsewhere, that’s all you need. There are ample examples of Ansible playbooks available publicly, but briefly, we’ve used this to properly provision Jenkins, Gitlab, and an SFTP server.

Ansible within Auto Scaling

This same user data field is available in EC2 Launch Configurations, which are used for Auto Scaling Groups. Since Auto Scaling is the fundamental AWS unit for growing a deployment horizontally or dynamically based on load, this dovetails nicely.

Auto-scaling is also quite useful while developing and testing the provisioning, since an autoscaling group ensures that a minimum number observers is always running. The basic workflow goes something like:

Push code changes to the playbook repo
Manually terminate the running instance in your autoscaling group
Autoscaling starts a new instance, which pulls latest changes

Using Vagrant for Development

Vagrant is a wrapper around VirtualBox (and other virtual machine providers), and supports Ansible. Thus it’s fairly straight-forward to use Vagrant to test Ansible scripts on your local computer. First steps are to install VirtualBox and Vagrant (left as an exercise to the reader). Then, the basic outline looks like the following:

vagrant box add ubuntu/trusty64 --provider virtualbox    
vagrant up

Within the Vagrantfile, you specify the playbook that should be run and can additionally provide extra_vars which are passed on to the Ansible configuration environment.

...
config.vm.provision "ansible" do |ansible|
    ansible.playbook = "playbooks/playbook.yml"
    ansible.host_key_checking = false

    ansible.extra_vars = {
      default_user: "vagrant",
      mount_point: "/dev/sdb",
    }
end
...

The extra_vars can be used to override variables which might be set in the playbook group_vars, but need to have different values for Vagrant (e.g. mount_point, which for Ubuntu on EC2 would be /dev/xvdf, ut for VirtualBox/Vagrant needs to be /dev/sdb).

Testing and Verification

To verify that the Ansible provisioning is or did run as expected on anew EC2 instance, connect to the instance (via SSH) and monitor the cloud-init-output.log andsyslog log files:

tail -f /var/log/cloud-init-output.log
tail -f /var/log/syslog

The User Data script output goes to the cloud-init log, while any system output due to the commands from that user data script will go to syslog. Using a test verification system such as BATS, we can create what are essentially unit tests for the provisioning.

One example of why testing is necessary is in the configuration of logrotate. Since logrotate won’t generally run for at least a day and, in some cases, a week after server creation, we need to ensure that any changes we make to files under /etc/logrotate.d/ are valid, without running logrotate directly. One way to to do this is to check the return value of the command logrotate -d /etc/logrotate.d/syslog.

CloudWatch and the Autoscaling trifecta

The final piece of the platform deployment is logging. Using a combination of provisioning automation, log aggregation, and alarms, AWS provides the mechanism by which you can ensure that your platform is always-on, and grows (and shrinks) proportional to load.

The basic information flow here is roughly …

Let EC2 instances forward log messages to CloudWatch
Define metric filters in CloudWatch based on those logs
Create alarms from those metrics filters
Trigger EC2 Auto Scaling actions based on the alarms

The creation of metrics, alarms, and autoscaling triggers is all well-documented in the AWS documentation. I will only mention briefly the approach we use to for the first step, log forwarding. Basically, within our Ansible playbook, we configure the CloudWatch Logs agent to forward all of syslog. Within our code-base, we use Python’s SysLogHandler to forward all application logging to syslog. And that’s basically it. Whenever a new server is provisioned via an autoscaling trigger, log messages automatically flow into CloudWatch, where existing filters and alarms, built on a Log Group, will catch errors or other application-layer messaging.

It Takes an Ecosystem …

This post summarizes a large part of my work over the past few months at Alight. When I started, we were using Chef and OpsWorks, but the cookbooks we had were fairly brittle and needed updating. And, initially, I thought we had to maintain OpsWorks, in order to use AutoScaling (and the related Elastic Load Balancing). But among the myriad components that AWS gives you, there are really several different, independent approaches you can take to growing and hardening a deployment.

I’ve gone into some detail here explaining how we manage configuration management and scaling without OpsWorks, but the broader point is that you don’t have to use any particular set of tools. If your platform is geared towards streaming and analytics, you might run exclusively Kinesis, EMR, and DynamoDB (and, at some point, Lambda) — you possibly wouldn’t need a single EC2 server. Or, if you run multiple independent user-facing web apps, ElasticBeanstalk, SQS, SNS, and Elastic Transcoder might be all you need.

Granted, you’ll probably always make use of IAM (user accounts and authentication), S3 (file storage), and a few other pieces (CloudTrail can provide notifications of AWS actions, say). But the particular flavor of an AWS deployment really comes down to specific needs and service offerings.

System Provisioning on Amazon AWS

(Why We Decided Against) Using Chef

Using Ansible

Ansible within Auto Scaling

Using Vagrant for Development

Testing and Verification

CloudWatch and the Autoscaling trifecta

It Takes an Ecosystem …

Leave a Reply Cancel reply

Our Solution

CAPABILITIES

CAPABILITIES

ABOUT US

Our Solution

CAPABILITIES

Resources

ABOUT US