Configuring and using AWS EKS in production - round 2 #kubernetes #aws

Share on:

it's been some weeks now that our migration to Amazon EKS at work) is completed and the clusters are in production. I have written a brief in the past on some major points, you can find it here. With some extra confidence while the system is serving real traffic I decided to come back for a more concrete and thorough list of steps and a set of notes I gathered through this journey. Obviously there are several companies out there that have been using Amazon's Kubernetes service, so this post aims to be just another point of reference for EKS migration and adoption uses cases.

Platform - a web platform

The overall platform is a powering a website (e-store), the EKS clusters operate on a active-active mode, meaning they share load and are utilized accordingly based on weighted load-balancing. Cluster load balancing - if we can call it that way is performed on the edge , so no kubernetes federation concepts for the time being. The total amount of accumulated compute in terms of CPUs, is somewhere between 400-600 cores (depending on the load). The total amount of microservices powering the platform are in the range of 20-30, mostly Java payloads and a mix of node (Js Node based). The platform is in expanding state system entropy is increasing by adding more pieces to the puzzle in order to cover more features or deprecate legacy /older systems.

The website is serving unique page views in the range of half of million daily (accumulated 15 markets - across Europe, UK and APAC), traffic is highly variable due to the nature of the business. On days where artists are onsale, or announce new events, the traffic spikes are contributing to somewhat a 50-70% more unique page renders compared to a non-busy day. The platform is also subject and target of unforeseen (malicious?) traffic, scraping the whole range of public APIs or attacking certain areas.

The infrastructure powering the above site should provide:

  • elasticity - shrink and grow based on demand - also offer the ability to do that based on manual intervention, on cases where we do know beforehand when we are going to have surges.
  • stability - always available always serve pages and API responses
  • Toleration on failures, usually having in mind potential outages on different AWS a.z or whole regions.
  • Cost effectiveness, reduce the operation cost over time (AWS usage cost)
  • Secure
  • Fairly open to development teams. Deploying and understanding kubernetes is a developer team concern and not an exotic operation, for a separate team.

Kubernetes

Kubernetes was already for 2+ years our target deployment platform. The only thing that changed over time is the different tools used to spin new clusters. We already had operational experience and faced several challenges with different versions and capabilities of kubernetes throughout the time. Despite the challenges, adopting kubernetes is considered a success. We never faced complete outages, the clusters and the concepts implemented never deviated from what is stated on the manual (we did gain elasticity, stability, control over the deployment process and last but not least - adopting kubernetes accelerated the path to production and delivery of business value.

Never before developers had such a close relationship with the infrastructure, in our case. This relationship developed over time and was contributing to increased awareness between 2 split concerns, the side that writes software, and the side operating and running the code in production. The biggest win was mostly the process of empowering developers of being more infrastructure aware - which slowly leads to potentially improvements on the way software is developed. Obviously the same concepts apply to any team and any cloud centric initiative. Abstracting infrastructures concerns lowers the barrier of morphing a traditional developer which was completely disconnected from the operations to this world to. After that, sky is the limit in terms of digging deeper to the details and obviously understanding more about the infrastructure. This process requires time and people that are willing to shift their mindset.

EKS why?

The first obvious answer is because AWS. If AWS is your main cloud, then you continuously try to leverage as much as possible the features of your cloud, unless you are on a different path (for example you want cloud autonomy hedging by mixing different solutions, or you think you can develop everything on you own, if you can afford it). The integration of EKS with the AWS world has matured enough where you can enjoy running a fairly vanilla setup of Kubernetes (not bastardised) and behind the scenes take advantage of the integration glue offered by AWS/ESK to the rest of the AWS ecosystem.

The second answer is cluster upgrades and security patches. Before EKS we had to engage with the specifics of the different tools (installers) when new versions came along. In many cases especially if your cloud setup has custom configuration trying to fit clusters on environments with custom networking or special VPC semantics was getting more and more challenging. Despite engaging on cluster updates in the past, the risk involved was getting bigger and bigger and we soon faced the usual dilemma many people and companies are facing (many don't want to admit) - if you want upgrade an existing cluster just ditch it and create a new one. While being a solution, that involved a lot of extra work from our side, re-establishing our platform on top of new clusters. Obviously there is more work for us to many the platform migration more automated.

The third answer is the update policies of EKS. If you want to play by the rules of EKS, you will get your masters auto upgraded on minor revisions, and you will be gently pushed to engage on upgrading your clusters to major versions. Despite still having the option to sit back and do nothing, this model encourages and accelerates the development of automation to be in place for cluster updates. it's a matter of confidence as well - the more often you upgrade and control the upgrade process the more confident you become.

The team

2 people. The most important thing on this setup is not the size of the team (2) but the mix of skills. Since we want to be as close as possible to the actual needs of the developers ultimately serve the business, we realised that changes like that can not happen in a skill vacuum. You can not configure and spin infrastructure thinking only as a developer but the same time you can not build the infrastructure where developers will evolve and create a platform having in mind only the operational side of things. You need to have both, when developers are not educated enough on things like infrastructure security or performance or thorough monitoring Ops skills and expertise will provide all the above and educate at the same time so next time they improve.

On the other side, when the infrastructure is not easily consumed by developers, not accessible or there is an invisible barrier that disconnects the software maker from its system in production - this is where a developers point of view can help on finding the middle ground. Iteration and progressive changes is an area where software developers often do better compared to other functions.

This is one of the most taboo things in the market currently where both sides fight for control and influence. I am not sure what is the correct definition of DevOps but in my mind this journey was a DevOps journey and I wish I will be able to experience it in other places as well through out my career. Combine skills within the team and encourage the flow of knowledge instead of introducing organization barriers or bulkheads.

Side concern - EKS worker networking

Since this was our first time adopting EKS, we decided that the safest and more flexible approach was to fully adopt the AWS CNI networking model. This was a great change compared to our previous clusters that were heavy on overlay networking. Pods now are much easier to troubleshoot and identify networking problems - since they have routable IPs. See here. Following the vanilla approach will raise concerns about VPC CDIR sizes, we opted for a clean solution isolating our clusters from shared VPCs and starting fresh and clean, new VPCs with a fairly big range.

In cases where secondary -hot- IPs are starting to run out, or you are limited y the capabilities of your workers (Num of ENI) See here. Also nice read here.

Tools

Our main goal was not to disrupt the workflows and semantics of the existing development teams, and make our EKS clusters look kind of the same as our existing clusters. This does not mean that th our existing setup was perfect or we did not want to modernise. Again the no1 priority was the clusters should be serving the needs of the teams deploying services on top of them and not our urge to try new technologies all the time. Obviously lots of stuff will be new and different but config changes and change of tooling should be introduced iteratively. The basic flow was the following:

  1. Create the clusters and establish the clusters
  2. Introduce more or less the same semantics and configs - make it easy for teams to move their payloads (apps)
  3. Stabilize
  4. Gradually educate and start introducing more changes on top of the clusters, either these are like new policies, new ways of deployments or new rules enforced. First priority is developer productivity with a fine balanced on good practises and obviously keeping things simple.

In order to setup / upgrade and configure the clusters we came up with a solution that uses the following tools

The workflow is the following:

  • Use Packer if you want to bake a new worker AMI (if needed or else skip)
  • Plan and Apply the terraform stack that controls the state of masters and the workers auto-scaling groups, IAM and other specifics so that the cluster is formed. We have our own terraform module even though now the reference EKS model found here is pretty solid.
  • Start invoking kubectl or helm after the cluster is formed to install some basic services.

Installing services on top of the cluster

Once the cluster is up AWS wise, meaning the masters can talk to various worker nodes, we deploy and configure the following components on top.

  1. Install helm (Tiller)
  2. Configuring aws-auth based on our RBAC / AWS roles to enable access to users - kubectl patch
  3. Install metrics-server (modifed helm chart)
  4. Install the aws cluster-autoscaler (helm chart)
  5. Install kubernetes-dashboard (helm chart)
  6. Install prometheus / kube-state-metrics (helm chart)
  7. Install fluentd-bit deamons set (preconfigured to ship logs to E.S) (helm chart)
  8. Install or modify correct versions for kube-proxy see here
  9. Install or modify correct versions for aws-cni see here
  10. Install of modify correct version for CoreDNS +scale up coreDNS
  11. Scale up coreDNS
  12. Create or update namespaces
  13. Install - ambassador -proxy on certain cases - hybrid Ingress.
  14. Populate the cluster and specific namespaces with secrets - already stored on Vault

Overall the whole orchestration is controlled by Terraform. Structure changes to the cluster e.g. worker nodes size, scaling semantics etc are updated on the terraform level. Some of the helm charts indicated above are dynamically templated by terraform during provisioning - so the helm charts being applied- already are in sync and have the correct values. The idea is that terraform vars can be passed as variables to individual kubectl or helm invocations - the power and simplicity of local_exec and the bash provisioner see here.

Auto-scaling groups and worker segmentation

Back the actual cluster setup and a very important point the auto-scaling groups, spinning the workers of the clusters. There are several patterns and techniques and by googling relevant material on the internet you will find different approaches or advices.

We opted for a simple setup where our workers will be devided into 2 distinct groups (autoscaling groups/ launch templates).

  • system - workers : We will be installing kube-system material on these workers which will be always of lifecycle type: OnDemand or Reserve instances. Payloads like prometheus, cluster autoscaler, the coredns pods or sometimes the Ambassador Proxy (if we choose too).
  • normal - workers: Will be hosting our application pods on the various namespaces. This is the asg that is expected to grow faster in terms of numbers.

The above setup on terraform - has to be reflected and mapped to one kubernetes we have defined above - the aws ** cluster autoscaler**.

1- --namespace=kube-system
2- --skip-nodes-with-local-storage=false
3- --skip-nodes-with-system-pods=true
4- --expander=most-pods
5- --nodes={{.Values.kubesystemMinSize}}:{{.Values.kubesystemMaxSize}}:{{.Values.kubesystemAsgName}}
6- --nodes={{.Values.defaultMinSize}}:{{.Values.defaultMaxSize}}:{{.Values.defaultAsgName}}

The above setup - requires a minimal convention our application helm charts. Introduce 2 node affinity or node selectors rules. Currently the easier way is through nodeSelector even though they will be deprecated.

Spot instances (bring that cost down!) By being able to decouple the Kubernetes side of things (through the cluster autoscaler configs) and the AWS side, especially since we are using terraform - we now had the flexibility to experiment with Spot instances. Our main goal was to make the use of spot instances transparent as much as possible to the people deploying apps on the cluster, and make it more of a concern for cluster operators. Obviously, there is still a wide concern /change that all involved parties should be aware. Increasing the volatility of the cluster workers, meaning by running payloads on workers that may die within a 2 minute notice, introduces challenges that is good that people writing services on these clusters should be aware of.

Spot instances can be added in the mix using a setup of 2 auto-scaling groups, assuming you use the correct launch template and mixed instance policies. Many people decide to group their workers in in more than 2ASGs, for example instead of 2 you could have 5 or 10, where you can have more granular control of the EC2/classes utilized and their life cycle. Also you could target parts of your pods / apps to specific groups of workers based on their capabilities or lifecycle.

In general the more fine grained control you want and the more you want to hedge the risk of Spot termination the more you will lean towards the following strategies or choices.

  • Segment your workers into different capability groups (spot/OnDemand/Reserved single or multiple classes/mixed instance policies
  • Increase the average number of pods on each replica set so that you hedge the risk of pods of the same replica set (deployment) land on the same type of workers that potentially can be killed at the same time.
  • More stateless less stateful. In way your platform can be able to recover of sustain suffer micro or minor outages of Compute/Memory. The more rely on singleton services or centralized resources the more you are going to hedge random outages.

Spot instances mean reduced prices but also termination notification. When thinking about termination the current pattern you need to consider 3 factors

  1. AWS Region (eu-west-1)
  2. AWS availability (eu-west-1a,eu-west-1b.)
  3. Class (m4.xlarge)

The above triplet is usually the major factor that will affect the spot price of class in general. The current strategy is that your payloads (pods/containers) need to obviously spread as effectively as possible

  • Region : Thus more than one cluster
  • AZ: Your ASG should spin workers on ALL the available zones that the region offers.
  • Class: if you ASG is single class - your chances this class to be subject of random spot termination and affecting your clusters is higher than using a bigger list of classes.

The general idea is to hedge your risk of spot instance termination by running your workloads - multi region/ multi asg / multi class. There is still some risk involved - e.g. AWS massively retiring at the same time - spot resources - or rapidly changing the prices.

This is a very tricky area and settings on the ASG can help you hedge a bit more on this - for example if you have hard rules on your price limit's the ASG can respect that, for example rules like don't bid beyond this price for a single spot resource. The more you make the ASG / launch template strict controlling your cost estimate - the bigger the risk to suffer outages because of this hard limit and a sudden change on the price.

The most flexible approach is to let the ASG pick the _lowest-price for you so you can be sure that it will do it's best to find the next available price combination to feed your cluster with compute and memory.

In terms of spreading your pods around to different workers I think the simplest advice is not put all your eggs on a single basket. Pod Affinity/AntiAffinity rules is your no1 tool in these cases + labels on your nodes. You can find a very nice article here.

Last but not least. When termination of spot instances do happen, it is more than important to be able to react on the cluster level, so that these worker terminations don't make the cluster go crazy. The more concurrent terminations happen the bigger the risk you will see big waves of pod movement among workers and az. Kubernetes will try to balance and stuff pods into the remaining resources and obviously spin new resources, but it really depends much can can tolerate these movements and also to control how the re-scheduling of pods happen. In this area another useful tool available for you, are the kubernetes pod disruption budgets where can act as an extra set of rules the kubernetes masters - will take into account when it's resource availability is in flux (meaning workers are coming and going).

On top of that in order to gracefully handle these terminations - that actually happen with a 2 minute notice, daemonsets like this (spot termination handler) will easy the pain + offer more visibility . The daemon once the spot instance receives the termination event, will gracefully drain your node, which in turn mark the worker as not ready to receive and schedule workloads, which in turn will kick a scheduling round where kubernetes will try to place the pods on other workers if there is enough space or kill new workers. Eventually the system will try to balance and satisfy your setup configs and demands - but it really depends on the amount of concurrent terminations you will have and how your pods are spread around these workers.

The bigger the spread the less the impact. Also you can also consider a mixed policy where certain workers are always on demand and the rest are spot - so that you can hedge even more, more intense spot instance termination events.

Cluster upgrade concerns and workflow

Cluster updates require some work in terms of coordination + establishing a process. There are 3 cases:

  • No EKS or kubernetes versions updates - only modifications on the components installed on top of the clusters, for example you want to update fluentd-bit to a newer version.
  • Minor EKS update (auto mode) that needs an EKS AMI update, bringing your workers in the same version state.
  • Major EKS update (kubernetes upgrade for example from 1.12 to 1.13) - that will require an AMI update + some aws EKS components updated.

The third case is the most challenging one, because not only you need to bake a new AMI based on the reference provider by AWS, you also need to follow the conventions and versions of components as defined here:

  • core-dns
  • kube-proxy
  • AWS CNI plugin update.

This means that prior to engaging on updates you need to update your config scripts, in our case the terraform variables, so that when the new AMI makes it to production and we have the core of the cluster setup, to be able to update or re-install certain components. Always follow this guide.The documentation by AWS is pretty solid.

AWS API throttling and EKS

  • The AWS masters are a black box for you as an end user, but is highly recommended that by default you have their CloudWatch logs enabled. This is was huge improvement for us, compared to our previous clusters. Master logs are isolated and easily searchable so we avoid the noise of filtering or searching big amount of logs. Also, check this very nice utility that is usually referenced in many support cases the EKS logs collector.
  • The masters as every other component of EKS do leverage the AWS API to make things happen. This applies for everything that runs on AWS. What you need to be aware is that if you are operating on busy centralized AWS accounts, there is always a quota on the API calls issued from different components (EC2/etc). Your EKS masters are chatty as well, and the API calls issued by them will be counted and billed as the rest of the calls on your account (they are not free and they contribute to the quota). This means that when and if AWS API throttling happens on your accounts - your EKS clusters can be affected as well, so make sure you have appropriate monitoring in place to check when this happens. If throttling happens for large amounts of times - the bigger the risk internal components of EKS fail to sync or talk to each other - which means that overall the cluster may start to report random errors that sometimes can not be correlated. This is a tricky one and I really hope AWS changes the policy on this for the EKS masters and shields them from API throttling that may happen on the account. The other solution is to box_ your clusters into specific accounts and not put all your stuff on a single account with a single API quota.
  • AWS recently introduced this service - I guess is going to help a lot on visibility and proactively avoiding problems

Overall

  • Obviously our platform is still in flux and changes occur and will happen throughout the time. The same applies for EKS as a product, over time you see changes and updates from AWS, a very positive sign since you can see that AWS is invested on this product and with every major kubernetes update, EKS evolves as well. Another positive thing is the quality of support from AWS, there are several times where we had to double check cases with AWS support stuff and I have to admit the resolution and answers provided were very thorough.

  • As I have said in the past, I think that at some point AWS will decide to complete the integration journey for its users and provide a turn key solution where configuration of the cluster will be automated end to end (masters, workers, plugins and setup). Lets see.