3 Cloudformation Pitfalls and Ways Out

Save yourself time and head-scratching

Peter Njihia
5 min readOct 8, 2019

If you have any solutions in the AWS cloud, then you’ve likely crossed paths with AWS CloudFormation: AWS’s Cloud Provisioning tool. It’s a definition language written in a text file in either JSON or YAML formats. It’s very effective and reliable but there are edge cases that can get spinning in one place without progress. I’ll highlight some of my experiences with these edge cases and how I was able to get past that. There are others, but these ones standout because they are either recent or had me hitting walls for quite some time.

Photo by Nathan Dumlao on Unsplash

1. Tag-Based RBAC Policies

Most organizations use tags and prefixes Role Based Access Control (RBAC). For instance, you could have a policy where the Web team always tags their resources: Group = Web where this is then bound to permissions that allow members of the Web team to provision and manage resources tagged “Web”. The same can be applied to other groups. Let’s take a look at this IAM policy:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:Describe*"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ec2:Create*"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"aws:RequestTag/Group": "Web"
}
}
},
{
"Effect": "Allow",
"Action": [
"ec2:Delete*",
"ec2:Modify*",
"ec2:CreateTags"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"ec2:ResourceTag/Group": "Web"
}
}
},
{
"Effect": "Deny",
"Action": [
"ec2:DeleteTags",
"ec2:UpdateTags"
],
"Resource": "*",
"Condition": {
"StringLike": {
"aws:RequestTag/Group": "*"
}
}
}
]
}

In summary, the policy allows the Web team to create EC2 resources as long as they are tagged “Web”. It also allows them to tag any existing resources that’s missing the tag. Lastly it blocks the Web team from ever deleting the “Group” tag (ability to do so, opens up the opportunity to change another team’s ownership of the resource). Consider the situation where the Web team has an untagged resource in an existing Cloudformation stack, so they update the template to add the tag, all within their permissions, the right thing to do. During update, something goes wrong somewhere else and the Cloudformation update needs to rollback. Well, it’ll try to delete the newly created tag, which of course is denied, leaving you stack in the infamous “Rollback_Failed”. Continue rollback will keep failing.
Proposed Fix: The trite: Contact your Administrator! A more permissive role that allows deletion of tags, will be able to complete the rollback for you. DO NOT delete the resource yourself, Cloudformation will still attempt to delete the tag (even if it’s not there). Deleting the stack is also an option, a last resort perhaps.
How to avoid: Consider forcing tags during creation time, but then after, resolve for all un-tagged resources at an admin level.

2. Renaming Application Load Balancers

You have a well configured Application Load Balancer (ALB), already in use, listeners and target groups in place, mapped to a url in Route 53, everyone is happy. Here comes a compliance call, bringing to your attention that the naming convention is wrong (it could be breaking account-level scripts etc). No big deal, let’s just fix it in Cloudformation. Caution: renaming the ALB requires a replacement, so put everyone on alert of a minor downtime. This forces the Listener to be replaced as well. Here’s the sequence of a replacement event:

  1. Create new resource
  2. Move dependencies to new resource
  3. Delete old resource

Cloudformation will create a new ALB, a new Listener, and expectedly, it’ll want to attach the existing Target Groups to the new listener. Here comes the trap .. Target Groups cannot be associated with 2 listeners at the same time!! So stack update fails on its face.
Proposed Fix: disassociate the target groups from the existing listener by deleting rules and/or pointing the listener to a dummy target group. Deleting the stack might be a bit extreme, but a solution nonetheless.
How to avoid: Name your target groups based on the ALB name, so incase it changes, new target groups will be created. If that’s not pretty, create a “Suffix” parameter and append it to the Target Group name, if you ever need to force a rename, tweak the suffix, keeping the template crisp.

3. Unstable ASG (Can’t acquire desired state — min/desired/max)

I once ran into a weird error where an Autoscaling Group (ASG) will not attain stability, hence instances kept being recycled out. Had trouble replicating this at first, but realized it had everything to do with having multiple Codedeploy deployment groups on the same ASG (A design to reconsider — in fact not recommended, more in this AWS blog). Instances provisioned by an ASG with multiple Codedeploy hooks will receive multiple notifications, but the Code deploy agent handles one command at a time. There’s no system of ordering the deployment steps, and if one fails, it might cascade to the others, or they may get starved off. What sucks, is your Cloudformation might be stuck in a “rollback failed” state and you can’t push updates, not a pleasant moment when you are chasing productivity.
Proposed Fix: Delete Lifecycle policy hooks on the ASG. You can do this via the console:

Or use the API:

aws autoscaling describe-lifecycle-hooks --auto-scaling-group-name my-auto-scaling-groupaws autoscaling delete-lifecycle-hook --lifecycle-hook-name my-lifecycle-hook --auto-scaling-group-name my-auto-scaling-group

How to avoid: Consolidate code deploy automation for an ASG to one Deployment Group. If this is not practical, consider using tags but remember new instances created by autoscaling events will not have this code applied to them. So I recommend the first option.

I hope this saves you some head-scratching but most importantly, saves you time as every distraction and nuance takes a hit on productivity and focus. Just sharing my 2 cents.

It ain’t no fun if the homies can’t have none — Snoop Dog

--

--

Peter Njihia

I'm a Cloud Architect/SRA/DevSecOps Engineer helping folks build and run in the cloud efficiently..