Find out this one cool tip for a 70% savings off your infrastructure costs!

I love clickbait titles – they at least seem to work for getting code reviews right away, and they might even work all that spam people get.


But seriously, at Spectrum we do everything we can to keep costs low while delivering a fantastic product to our clients. One of the ways we do this is by using Infrastructure as Code (IaC) and spot instances in AWS. IaC is an important philosophy and tool because we can build reliable and repeatable systems across multiple environments. It also lets us push changes across our fleet quickly without needing a bunch of people to build out additional resources for our systems.


And, while it indeed is an amusingly clickbait title, we really are saving 70% on some of our infrastructure.


The way we’re achieving this is using Terraform to build out Kubernetes (AWS EKS) and running worker clusters built on Autoscaling Groups with spot instances. How to build and deploy all of this via terraform will come in a later post, but that isn’t required to realize the savings.


Around November of 2018, Amazon released the ability to run mixed Autoscaling Groups with both spot instances and on-demand instances. This change makes it amazingly easy to reduce the cost of your development environments and, depending on what your production environment looks like, capitalize on fantastic gains there as well.


So it really helps if you have built and designed a horizontally scalable application and system – it’s not really a requirement for a development environment, just kind of a requirement for a production environment. This is what one of our development environment worker groups running in Kubernetes looks like:

resource "aws_autoscaling_group" "eks_main_asg" {
 desired_capacity    = 10
 max_size            = 12
 min_size            = 8
 name                = "asg-$"
 vpc_zone_identifier = ["$"]

 mixed_instances_policy {
   instances_distribution {
     on_demand_percentage_above_base_capacity = 0
     spot_instance_pools                      = 1
     on_demand_base_capacity                  = 0
   }

   launch_template {
     launch_template_specification {
       launch_template_id = "$"
       version            = "$$Latest"
     }

     override {
       instance_type = "c4.xlarge"
     }

     override {
       instance_type = "c4.2xlarge"
     }

     override {
       instance_type = "m5.xlarge"
     }

     override {
       instance_type = "m5a.xlarge"
     }

     override {
       instance_type = "m4.2xlarge"
     }
   }
 }

 tag {
   key                 = "Name"
   value               = "asg-$"
   propagate_at_launch = true
 }

 tag {
   key                 = "kubernetes.io/cluster/$"
   value               = "owned"
   propagate_at_launch = true
 }
}

There are a couple of cool things going on here.


You can see above we said we wanted 10 ec2 instances running for our cluster, we also gave it room to launch another instance before tearing down existing instances. The mixed instance policy is the new and fun part. This lets you define how many stable instances you need running at all times; the value of setting this will greatly depend on your use cases. The other really cool bit is the “on_demand_percentage_above_base_capacity”. While I didn’t define it here to keep this example simple, this setting lets your ASG scale out while maintaining a set percentage of your fleet as on-demand. This helps ensure a certain stability to your environment if your spot instances get ripped away.


So, the whole point here is to try to reduce costs and one of the tools available to us is to set overrides. If our main instance type (c5.xlarge) isn’t available at the price we want for spot, we can check other instance types with the appropriate resources. Maybe a different instance type will be available in the spot market at the price we want. (Side note: If we do not set the “spot_max_price” variable, it will default to the on-demand price of your main instance preference.)


That is all great in theory – but how stable is this in the real world? We certainly don’t want our engineers sitting around waiting for their instances to spin up, or having their work fail because an instance was ripped away.


The stability and duration of spot instances has so many variables: What region are you in? What instances are you trying to request? Is AWS going through an upgrade or patch cycle? All that being said, our spot instances have persisted for MONTHS! That’s one heck of a stable spot instance! Even if some of our spot instances were ripped away in the middle of a work day, it is unlikely our engineers would notice. We use load balancers in front of our applications and it takes less than 90 seconds for a new ec2 instance to launch, join the kubernetes cluster, and start serving pods. 


Hopefully, this helped demonstrate a great way to reduce some of your infrastructure costs and if you aren’t using them today, to try out mixed instance Auto Scaling Groups.

If you have any questions or comments, feel free to reach out.