Clean 'em! Getting rid of unused AMIs using Python Lambda and Terraform

[*]

We are all aware that in the AWS-cloud world of today, immutable infrastructure and deployments are preferrable. It is also a fact that if we use immutable deployments, it means we often create multiple Amazon Machine Images (AMIs). To reduce storage costs we might want to delete (or deregister, in AWS speak) these AMIs and associated storage volumes.

In this blog post I will describe how to set up an AMI cleaner for unused images.

The main part is a Lambda function. It checks the images and deletes them and accompanying EBS snapshots. The function is written in Python, and it uses Boto3, an AWS SDK for Python. It also relies on JMESPath, the query language of the AWS CLI for querying JSON (more on it here). The function takes the following in the “event” argument:

  • regions (list of strings): in what region you’d like to run the cleaner
  • max_ami_age_to_prevent_deletion (number): if an AMI is older than the specified value, it can safely be deleted
  • ami_tags (a map of strings where each object has a tag key and tag value): if an image has the specified tags, it could be a candidate for deletion

Let’s have a look at the helper methods that are used in the Lambda:

1) A method to find AMIs used in autoscaling groups:[*]

def imagesInASGs(region):
  amis = []
  autoscaling = boto3.client('autoscaling', region_name=region)
  print(f'Checking autoscaling groups in region region...')
  paginator = autoscaling.get_paginator('describe_auto_scaling_groups')

  page_iterator = paginator.paginate(
    PaginationConfig = 'PageSize': 10
  )  
  filtered_asgs = page_iterator.search(f"AutoScalingGroups[*].[Instances[?LifecycleState == 'InService'].[InstanceId, LaunchTemplate.LaunchTemplateId,LaunchTemplate.Version]]")

  for key_data in filtered_asgs:
    matches = re.findall(r"'(.+?)'",str(key_data))
    instance_id = matches[0]
    template = matches[1]
    version = matches[2]
    print(f"Template found: template version version")

    if (template == ""):
      send_alert(f"AMI cleaner failure", f"Failed to find launch template that was used for instance instance_id")
      return

    ec2 = boto3.client('ec2', region_name = region)
    launch_template_versions = ec2.describe_launch_template_versions(
      LaunchTemplateId=template, 
      Versions=[version]
    );  
    used_ami_id = launch_template_versions["LaunchTemplateVersions"][0]["LaunchTemplateData"]["ImageId"]
    if not used_ami_id:
      send_alert(f"AMI cleaner failure", f"Failed to find AMI for launch template template version version")
      return    
    amis.append(used_ami_id)
  return amis

Here, by using boto3 we paginate through autoscaling groups in a region. And then we use an equivalent of AWS CLI query to get the details of the autoscaling groups that are most interesting for us:[*] filtered_asgs = page_iterator.search(f"AutoScalingGroups[*].[InstanceId, LaunchTemplate.LaunchTemplateId,LaunchTemplate.Version]]")

The result we get is a string, and by using this regex: "'(.+?)'" we break down the string into separate variables.

After that we use boto3 ec2 client to extract the AMI Id used in autoscaling groups, and save this value into an array.

2) The next function will get AMI Ids that are used in running EC2s, including those that were not launched using autoscaling:[*]

def imagesUsedInEC2s(region):
  print(f'Checking instances that are not in ASGs in region region...')
  amis = []
  ec2_resource = boto3.resource('ec2', region_name = region)
  instances = ec2_resource.instances.filter(
    Filters=
    [
      
        'Name': 'instance-state-name',
        'Values': [ 'running' ]
      
    ])
  for instance in list(instances):
      amis.append(instance.image_id)

  return amis

3) A method that creates AMI filters in the correct format. We pass in values as a map(string) in Terraform, and we need to convert these values into JMESPath format, which is the following:[*]


   'Name': 'tag:CatName',
   'Values': [ 'Boris' ]

The method itself looks like this:[*]

def makeAmiFilters(ami_tags):
  filters = [
    
      'Name': 'state',
      'Values': ['available']
    
  ]
  for tag in ami_tags:
    filters.append('Name': f'tag:key', 'Values':[f'value'] )
  return filters

4) A function that sends a message to an SNS topic:[*]

def send_alert(subject, message):
  sns.publish(
    TargetArn=os.environ['sns_topic_arn'], 
    Subject=subject, 
    Message=message)

5) The main function, or the handler:[*]

def lambda_handler(event, context):
  amis_in_use = []
  total_amis_deleted = 0
  total_snapshots_deleted = 0
  try:
    regions = event['regions']
    max_ami_age_to_prevent_deletion = event['max_ami_age_to_prevent_deletion']

    filters = makeAmiFilters(event['ami_tags'])

    for region in regions:
      amis_in_use = list(set(imagesInASGs(region) + imagesUsedInEC2s(region)))
      ec2 = boto3.client('ec2', region_name = region)
      amis = ec2.describe_images(
        Owners = ['self'],
        Filters = filters
      ).get('Images')
      for ami in amis:
        now = datetime.now()
        ami_id = ami['ImageId']
        img_creation_datetime = datetime.strptime(ami['CreationDate'], '%Y-%m-%dT%H:%M:%S.%fZ')
        days_since_creation = (now - img_creation_datetime).days

        if ami_id not in amis_in_use and days_since_creation > max_ami_age_to_prevent_deletion:
          ec2.deregister_image(ImageId = ami_id)
          total_amis_deleted += 1

          for ebs in ami['BlockDeviceMappings']:
            if 'Ebs' in ebs:
              snapshot_id = ebs['Ebs']['SnapshotId']              
              ec2.delete_snapshot(SnapshotId=snapshot_id)
              total_snapshots_deleted += 1

    print(f"Deleted total_amis_deleted AMIs and total_snapshots_deleted EBS snapshots")

  except Exception as e:
    send_alert(f"AMI cleaner failure", e)

Infrastructure[*] CloudWatch Events rule that triggers on schedule has the above Lambda function as a target. In this example, the function will run on the first day of every month:[*]

resource "aws_cloudwatch_event_rule" "trigger" 
  name = "$var.name_prefix-ami-cleaner-lambda-trigger"
  description = "Triggers that fires the lambda function"
  schedule_expression = "cron(0 0 1 * ? *)"
  tags = var.tags

The event target specifies an input to pass into the Lambda function, among other parameters (the values here are purely for example purposes):[*]

resource "aws_cloudwatch_event_target" "clean_amis" 
  rule = aws_cloudwatch_event_rule.trigger.name
  arn = aws_lambda_function.ami_cleaner.arn
  input = jsonencode(
    ami_tags_to_check= 
     "Environment"="UAT"
     "Application"="MyApp"
    
    regions = ["us-east-2", "eu-west-1"]
    max_ami_age_to_prevent_deletion = 7
  )

If you’d like to create a test event for this Lambda function, you’ll need to enter the following into the test event field:[*]


  "regions": ["us-east-2", "eu-west-1"],
  "max_ami_age_to_prevent_deletion": 7,
  "ami_tags_to_check": 
    "Environment": "UAT"
    "Application": "MyApp"
  

The function itself needs to have the following Terraform resources defined:[*]

resource "aws_lambda_function" "ami_cleaner" 
  filename = "$path.module/lambda.zip"
  function_name = "ami-cleaner-lambda"
  role = aws_iam_role.iam_for_lambda.arn
  handler = "lambda_function.lambda_handler"
  runtime = "python3.8"
  source_code_hash = data.archive_file.lambda_zip.output_base64sha256
  tags = var.tags

  environment 
    variables = 
      sns_topic_arn = var.sns_topic_arn
    
  

resource "aws_lambda_permission" "allow_cloudwatch_to_call_ami_cleaner" 
  statement_id  = "AllowExecutionFromCloudWatch"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.ami_cleaner.function_name
  principal     = "events.amazonaws.com"
  source_arn    = "arn:aws:events:<region>:<account_id>:rule/ami-cleaner-lambda-trigger*"

data "archive_file" "lambda_zip" 
  type        = "zip"
  source_file = "$path.module/lambda.py"
  output_path = "$path.module/lambda.zip"

Using archive_file data source in Terraform is convenient because you won’t need to create a zip with the function manually when you update it.

Lambda IAM Policy[*] For the Lambda function to perform the described operations on resources, the following IAM actions need to be allowed in the policy:[*]

"ec2:DescribeImages", 
"ec2:DescribeInstances",
"ec2:DescribeLaunchTemplates",
"ec2:DescribeLaunchTemplateVersions",

"ec2:DeregisterImage",
"ec2:DeleteSnapshot",
"autoscaling:DescribeAutoScalingGroups",
"sns:Publish"   

In order to not allow the function to delete any AMIs and snapshots but only those with a specific tag, we can create Terraform policy statement dynamically and restrict the policy to allow removal of resources only if they have a certain tag key and value:[*]

data "aws_iam_policy_document" "ami_cleaner_policy_doc" {
...
  dynamic "statement" 
    for_each = var.ami_tags_to_check
      content 
        actions = [
        "ec2:DeregisterImage",
        "ec2:DeleteSnapshot"
        ]
        resources = ["*"]
        condition 
          test     = "StringLike"
          variable = "aws:ResourceTag/$statement.key"
          values = [statement.value]
                
        effect = "Allow"      
    
     
}

Of course, a lot of the values in Terraform can be set as variables. In this case, we can pass the following values as variables to the AMI cleaner module:

  • tags
  • regions
  • sns_topic_arn
  • ami_tags_to_check
  • max_ami_age_to_prevent_deletion
  • schedule_expression

SUMMARY[*] Hopefully, this post exemplifies how to do AMI cleanup based on tags, in multiple AWS regions. I have learnt a lot from this piece of work, and I hope someone will learn something new about AWS or Terraform too.

[*] [*]Source link