KubeVirt Status 0x7E1 (2017)

Where do we stand with KubeVirt at the end of the year? (This virtual machine management add-on for Kubernetes)

We gained some speed again, but before going into what we are doing right now; let’s take a look at the past year.

Retro

The year started with publishing KubeVirt. The first few months flew by, and we were mainly driven by demoing and PoCing parts of KubeVirt. Quickly we were able to launch machines, and to even live migrate them.

We had the chance to speak about KubeVirt at a couple of conferences: FOSDEM 2017, devconf.cz 2017, KubeCon EU 2017, FrOSCon 2017. And our work was mainly around sorting out technical details, and understanding how things could work - in many areas. with many open ends.

At that time we had settled with CRD based solution, but aiming at a User API Server (to be used with API server aggregation), a single libvirt in a pod nicluding extensions to move the qemu processes into VM pods, and storage based on PVCs and iSSCI leveraging qemu’s built-in drivers. And we were also struggling with a nice deployment. To name just a few of the big items we were looking at.

Low-level changes and storage

Around the middle of the year we were seeing all the features which we would like to consume - RAW block storage for volumes, device manager, resource classes (now delayed) - to go into Kubernetes 1.9. But at the same time we started to notice, that our current implementation was not suited to consume those features. The problem we were facing was that the VM processes lived in a single libvirtd pod, and were only living in te cgroups of the VM pods in order to allow Kubernets to see the VM resource consumption. But because the VMs were in a singlel ibvirt pod from a namespace perspective, we weren’t bale to use any feature which was namespace related. I.e. raw block storage for volumes is a pretty big thing for us, as it will allow us to directly attach a raw block device to a VM. However, the block devices are exposed to pods, to the VM pods (the pods we spawn for every VM) - and up to now our VMs did not see the mount namespace of a pod, thus it’s not possible to consume raw block devices. It’s similar for the device manager work. It will allow us to bring devices to pods, but we are not able to consume them, because our VM proccesses live in the libvirt pod.

And it was even worse, we spent a significant amount of time on finding workarounds in order to allow VM processes to see the pods’ namespaces in order to consume the features.

All of this due to the fact that we wanted to use libvirtd as intended: A single instance per host with a host wide view.

Why we do this initially - using libvirtd? Well, there were different opinions about this within the team. In the end we stuck to it, because there are some benefits over direct qemu, mainly API stability and support when it comes to things like live migrations. Further more - if we did use qemu directly, we would probably come up with something similar to libvirt - and that is not where we want to focus on (the node level virtualization).

We engaged internally and publicly with the libvirtd team and tried to understand how a future could look. The bottom line is that the libvirtd team - or parts of it - acknowledged the fact that the landscape if evolving, and that libvirtd want’s to be part of this change - also in it’s own interest to stay relevant. And this happened pretty recently, but it’s an improtant change which will allow us to consume Kubernetes features much better. And should free time because we will spend much less time on finding workarounds.

API layer

The API layer was also exciting.

We spend a vast amount of time writing our own user API server to be used with API server aggregation. We went with this because it would have provided us with the full controll over our entity types and http endpoints, which was relevant in order to provide access to serial and graphical consoles of a VM.

We worked this based on the assumption that Kubernetes will provide anything to make it easy to integrate and run those custom API servers.

But in the end these assumptions were not met. The biggest blocker was that Kubernetes does not provide a convenient way to store an custom APIs data. Instead it is left to the API server to provide it’s own mechanism to store it’s data (state). This sounds small, but is annoying. This gap increases the burden on the operator to decide where to store data, it could eventually add dependencies for a custom etcd instance, or adds a dependency for PV in the cluster. In the end all of this might be there, but it wasis a goal of KubeVirt to be an easy to deploy add-on to Kubernetes. Taking care of these pretty cluster specific data storage problems took us to much off track.

Rigth now we are re-focusing on CRDs. We are looking at using jsonscheme for validation which landed pretty recently. And we try to get rid of our subresources to remove the need of a custom API server.

Network

Networking was also an issue where we spend a lot of time on. It still is a difficult topic.

Traditionally (this word is used intentionally) VMs are connected in a layer 2 or layer 3 fashion. However, as we (will really) run VMs in a pod, there are several ways of how we can connect a VM to a Kubernetes network

  • Layer 2 - Is not in Kubernetes
  • Layer 3 - We’d hijack the pod’s IP and leave it to the VM
  • Layer 4 - This is how applications behave within a pod - If we make VMs work this way, then we are most compatible, but restrict the functionality of a VM

Besides of this - VMs often have multiple interfaces - and Kubernetes does not have a concept for this.

Thus it’s a mix of technical and conceptual problems by itself. It becomes even more complex if you consider that the traditional layer 2 and 3 connectivity helped to solve problems, but in the cloud native worlds the same problems might be solved in a different way and in a different place. Thus here it is a challenge to understand what problems were solved, and how could they be solved today, in order to understand if feature slike layer 2 connectivity are really needed.

Forward

As you see, or read, we’ve spent a lot of time on reseraching, PoCing, and failing. Today it looks different, and I’m positive for the future.

On a broader front:

This post got longer than expected and it is still missing a lot of details, but you see we are moving on, as we still see the need of giving users a good way to migrate the workloads from today to tomorrow.

There are probably also some mistakes, feel free to give me a ping or ignore then in a friendly fashion.

::: {#footer} [ December 21st, 2017 1:30pm ]{#timestamp} [kubevirt]{.tag} [kubernetes]{.tag} [libvirtd]{.tag} [cinder]{.tag} [crd]{.tag} [uas]{.tag} [fedora]{.tag} [virtualization]{.tag} :::