Kubernetes

This post is really better read as the opening score of "2001: A Space Odyssey" is playing…​.

In the beginning, there were servers. And developers wrote code and deployed to these servers. These servers were unique. They were ordered to a particular spec — cpu, memory, disk…​each application had a custom server.

And, while this was better than not having servers, server management became the bottleneck. The lead time required by Ops teams for a new set of servers could be weeks (months?). Scaling due to load was difficult. To help, teams often had spare servers doing nothing which could be easily reconfigured for the additional load. Often, the application was no longer under stress, and the requirement to scale changed. Thankfully, we all bought enterprise grade stuff, so we never had to deal with servers failing (hah!).

Yup, the early ages of computing was difficult…​we couldn’t scale quickly, we had to commit to lots of hardware, and we had to believe servers didn’t fail.

Virtualization "moved the bottleneck." Libvirt, among others, allowed for easy VM management, and tools like Openstack added additional service level abstractions. No longer did we wait weeks/months for additional servers. We built virtual machines (VM) in minutes (hours?). And it was amazing, when compared to physical servers.

The explosion of virtual machines revealed a new challenge. Setting up a machine by following a wiki (playbook/procedure/recipe) was no longer fast enough. If I could get a VM in minutes, I should be able to configure it and make it useful (everything configured correctly) just as quickly.

Config management tools moved the bottleneck again. Tools like CFEngine, Puppet, and Chef worked to configure a machine via repeatable process. We would create a VM, then run our configuration management tool to configure the machine. And this was great. These management tools were flexible, and testable, and did their best to build identical systems. And it was amazing, when compared to manually configuring virtual machines.

Until we realized even that wasn’t fast enough. We didn’t want an entire copy of Ubuntu every time. We didn’t want to care about sshd, ntpd, logrotate, iptables on every new VM, but we had to. Furthermore, our process of matching workload to compute resources was wildly inaccurate. We’d provision 8GB machines, and they’d sit idle for 95% of the time, or consume only 1GB of memory.

Containerization helped to solve this. Docker, for one, helps to wrap most workloads into a standard package for execution…​and so it’s easy to run. But best yet, is how images are tagged. If I deploy a single instance, or hundreds, they’re identical — and the configuration has been completed already. Docker is great and makes deployments of desired code quick. We understood how to deploy multiple containers across machines, and manually schedule where they run. And it would be amazing, when compared to config-managed VMs.

It became obvious Docker alone wasn’t sufficient. We needed a tool for managing where containers would be scheduled, and associated requirements. Furthermore, we needed better control over the running containers and their environment. Some containers need to be accessible on the network, load balanced, identifiable by name, etc. And, in a perfect world, we’d like additional features including self-healing services, easy container management (including scaling), HTTP RESTful API.

And that’s what we’ve found in Kubernetes (http://kubernetes.io). From their site, "Kubernetes is an open source orchestration system for Docker containers." Built on the Google’s experience of running containers at scale, it manages deployments of containers in a declarative manner. Kubernetes works to ensure your cluster’s current state matches the desired state. Containers, nodes, and services may fail, but the system works to heal itself. It is amazing, when compared to manually managing containers.

We’re working to use Kubernetes to run our software in Docker. We expect failure of a single host will not affect any service. And upon failure, we expect our system will return to it’s desired state (as defined by the number of containers, service definitions, etc). Additionally, we can easily manage the underlying compute resources by rebuilding when required. And, we expect the memory optimizations containerization yields will result in more densely scheduled machines, which directly maps to finance.

All of this, and the developers seem to like it too! Initially, we’re working to use Kubernetes to host our build system, and support dynamic build workers. Eventually, we expect they’ll have CLI (and, therefore, HTTP RESTful) access to control their services — in all environments. We’ve considered how to integrate this with our build system and deliver a CI/CD system.

Like all new technology, there are some rough areas to avoid. Specifically, running databases and persistent storage feels a little green. So, for services with those rough areas, we’ll keep using our config-managed VMs. Once we learn more, we expect to contribute to the community experience on solving them.

We’re pretty excited about Kubernetes. It has some amazing research and real-world experience behind it’s design, and we’re looking forward to each of it’s features. We’re hoping to publish more from our findings, open source all of the support/integration code we’re running, push changes upstream, and perhaps speak as a case study at one of the conferences in the near future. :)

Furthermore, we know we don’t know it all! So, we’ve launched a new Meetup group in Tacoma, WA, and our inaugural meeting is here: http://www.meetup.com/Tacoma-Technical-Operations-Meetup/events/225262224/

We’re hoping to develop a strong local community and help each other advance how we (as a community) manage infrastructure. Come join us!