Airbnb's AWS costs were getting out of hand. Here's how it tackled them

Kubernetes Cluster Autoscaler needed some tweaking...

Airbnb's AWS costs were getting out of hand. Here's how it tackled them

Ensuring that your cloud spending automatically scales with demand, both up and down, is one of the core priorities of anyone running meaningful workloads in the cloud -- and often a painful and expensive lesson learned the hard way for too many organisations. Yet there is no shortage of meaningful, detailed guidance out there from those who have learned this lesson and Airbnb is one of those keen to share hard-won guidance.

In a detailed new report Airbnb software engineers Evan Sheng and David Morrison have illustrated how they dynamically flex their cloud clusters using the Kubernetes Cluster Autoscaler; a report that comes as the company in recent years has made a shift that many early cloud adopters are making: shifting almost all online services from manually orchestrated AWS EC2 instances to the open source container orchestrator Kubernetes.

In the process the team has fed a number of improvements upstream into the Kubernetes Cluster Autoscaler that help improve how the container orchestrator manages AWS Autoscaling Groups (ASGs).

Like many organisations, until recently, each Airbnb AWS service was manually provisioned to have the necessary compute capacity available. A combination of aggressively identifying and tackling highest areas of spend saw the company cut $63.5 million in hosting costs in just nine months in 2020, through a combination of some robust cultural changes, a pivot to Kubernetes, rethinking its storage strategy and working with AWS.

(Kubernetes often gets criticised for its complexity, but as one lead DevOps engineer on the dating application Hornet, until recently running what they described as "very much a 2012-2014-era setup of AWS auto scaling groups and load balancers and launch configuration" -- similar in sound to Airbnb's previous environment --  told The Stack recently: “I’m a big fan of Kubernetes. If I’m building a whole bunch of custom AWS orchestration stuff, with monkeys pulling levers, when the next guy comes in, I’ve got to teach them how to do all that. If I do everything on Kubernetes when the next guy comes in, I just need to say ‘hey, do you know Kubernetes?’"...)

How Airbnb reduced AWS costs

How Airbnb reduced AWS costs with Kubernetes
"Help!"

Today Airbnb run thousands of nodes (virtual or physical machines) across nearly a hundred clusters to accommodate its demanding workloads. Before doing so, management of cloud capacity was challenging, as its engineers noted in a blog this May 23, saying: "[Previously] each instance of a service was run on its own machine, and manually scaled to have the proper capacity to handle traffic increases. Capacity management varied per team and capacity would rarely be un-provisioned once load dropped..."

Airbnb, which this May reported over 100 million bookings for Q1 alone, was born in the cloud -- starting the platform with what it describes as a "few clicks on the console" back in 2007, when AWS itself was in its infancy. Yet for all its rapid growth, it reported in 2021 that a few years earlier it had "noticed AWS monthly cost growth was outpacing revenue growth" and set a goal to hold infrastructure “costs per night booked” steady.

The company convened cross-functional teams, pulled reams of data and soon had some success. Among them was a huge effort to tackle storage: "Amazon S3 Storage costs have historically been one of our top areas of spend, and by implementing data retention policies, leveraging more cost effective storage tiers, and cleaning up unused warehouse storage, we have brought our monthly S3 costs down considerably" Airbnb noted last year.

"You need to consider the access pattern for the data along with the file size and number of objects in the S3 bucket, as there can be unexpected costs"its team noted. "Take Glacier, as an example. For each object stored in Glacier, S3 stores an additional 32KB data in “Standard” storage class. So if you store an object to Glacier, with 1 KB in size, S3 will put an extra 32KB in Standard, both charged at corresponding prices. So while Glacier is only 10% the cost of Standard storage class, the total cost can be higher than simply storing the data in Standard."

The company also made significant use of AWS's 2019 "savings plan" -- a pricing model that lets customers save up to 72% on Amazon EC2 and AWS Fargate if they agree to a consistent amount of compute usage over a certain term.  The Airbnb cloud efficiency journey was a multifaceted one initially described in some detail here, yet it is its shift to Kubernetes that provides a particularly compelling case study given how many are making this journey -- and which was a significant one vis-a-vis how Airbnb reduced AWS costs, the company says.

Airbnb's Kubernetes migration

Airbnb has shifted almost all online services from manually orchestrated EC2 instances to Kubernetes.

Its engineers noted in a detailed May 23, 2022 blog that this evolution -- a major part of how Airbnb reduced AWS costs -- can be split into three stages.

1: Homogenous Clusters, Manual Scaling;

2: Multiple Cluster Types, Independently Autoscaled;

3: Heterogeneous Clusters, Autoscaled.

Its initial deployment of Kubernetes, engineers Evan Sheng and David Morrison note, was relatively simple: "A handful of clusters, each with a single underlying node type and configuration, which ran only stateless online services... [we then] started running containerized services in a multi-tenant environment (many pods on a node). This aggregation led to fewer wasted resources, and consolidated capacity management for these services to a single control point at the Kubernetes control plane. At this stage, we scaled our clusters manually, but this was still a marked improvement over the previous situation."

As the team tried to move more diverse workload types with a wide range of requirements onto Kubernetes, "we created a cluster type abstraction [that defines the] underlying configuration for a cluster, meaning that all clusters of a cluster type are identical, from node type to different cluster component settings... our initial strategy of manually managing capacity of each cluster quickly fell apart.

"To remedy this, we added the Kubernetes Cluster Autoscaler to each of our clusters. This component automatically adjusts cluster size based on pod [a group of containers] requests — if a cluster’s capacity is exhausted, and a pending pod’s request could be filled by adding a new node, Cluster Autoscaler launches one. Similarly, if there are nodes in a cluster that have been underutilized for an extended period of time, Cluster Autoscaler will remove these from the cluster. Adding this component worked beautifully for our setup, saved us roughly 5% of our total cloud spend, and the operational overhead of manually scaling clusters."

So far, so good, but the company was running the risk of swapping manual AWS wrangling for Kubernetes wrangling and freely admits as much: "Our [Kubernetes] cluster types had grown to over 30, and the number of clusters to 100+. This expansion made Kubernetes cluster management tedious..." the two engineers said, adding that they tried to tackle this by creating “heterogeneous” clusters that could accommodate diverse workloads with a single Kubernetes control plane; this meant less configuration testing and "with Airbnb now running on our Kubernetes clusters, efficiency in each cluster provides a big lever to reduce cost... aggregation of workload types -- some big and some small -- can lead to better bin packing and efficiency, and thus higher utilisation."

If you can't buy it, build it...

Kubernetes Cluster Autoscaler is a Kubernetes component that adds or removes "nodes" based on utilisation metrics, i.e. trimming them from a compute cluster when utilisation falls below a certain threshold to ensure infrastructure is elastic and can dynamically flex around the changing demands of the workloads on top.

The tool maintains a list of node groups (available compute resource) and runs scheduling simulations against pending workloads to test that it can scale, passing these to a component called the Expander, which chooses -- once told to do so --  which node group to expand based on a user-specified tiered priority list. Finding that Kubernetes' default expanders "were not sophisticated enough to satisfy our more complex business requirements around cost and instance type selection" the Airbnb team built a custom tool implemented as a gRPC client and server (read how that works here) as well as two other key improvements to the Kubernetes Auto Scaler.

  1. "Early abort for AWS ASGs with no capacity: Short circuit the Cluster Autoscaler loop to wait for nodes it tries to spin up to see if they are ready by calling out to an AWS EC2 endpoint to check if the ASG has capacity. With this change enabled, users get much more rapid, yet correct scaling. Previously, users using a priority ladder would have to wait 15 minutes between each attempted ASG launch, before trying an ASG of lower priority.
  2. "Caching launch templates to reduce AWS API calls: Introduce a cache for AWS ASG Launch Templates. This change unlocks using large numbers of ASGs, which was critical for our generalized cluster strategy. Previously", the two noted this month, "for empty ASGs (no present nodes in a cluster), Cluster Autoscaler would repeatedly call an AWS endpoint to get launch templates, resulting in throttling from the AWS API..."

Perhaps it hasn't been the easiest ride and one that has thrown up several engineering challenges, but now, having the largest portion of compute at Airbnb on a single platform has "provided a strong, consolidated lever to improve efficiency" the team notes and it has been using its custom adjustments to the K8s autoscaler to "scale all of our clusters without issues since the beginning of 2022" -- more broadly, the company says that it has seen "a profound cultural change toward cost awareness and management". Its finance team created a company-wide award for financial discipline, presented by the CFO, which recognized employees who had driven important cost savings initiatives and its infrastructure team has held cost savings hackathons. Airbnb's "AWS Attribution Dashboard" meanwhile became the most viewed dashboard at Airbnb and has since remained in the top list the company said last year, describing the cost savings efforts as "a new muscle that we will only strengthen with time."

Follow The Stack on LinkedIn