Connecting virtual machines to Kubernetes in a friendly way.

Friendly
Camel {width=”640” height=”363”}

We can run them, but how can we connect them? That’s what we’ve been thinking about in KubeVirt for a while. We can run VMs (and even migrate them) on Kubernetes. But so far networking was private. The stock qemu user-space SLIRP (SLURP?) stack was used to connect the VMs to the public world. This allowed the VMs to reach out, but the world could not reach them.

The goal is to be able to speak to a KubeVirt VM, like you can also speak to other pods.

With KubeVirt we aim to be good Kubernetes citizens, which brings a few constraints, mainly:

Networking must work from within a Pod, as KubeVirt is deployed in Pods only
Networking must work with the default Kubernetes networking (CNI) - to be independent of the CNI network plugin

But because we want to rely on libvirtd for virtualization, we also have constraints from that side, mainly:

Networking approach must be supported by libvirtd

And these constraints have a few implications:

We need a a Pod NIC for each VM NIC and each VM NIC can only have one IP address
This further implies that each VM NIC has a pre-defined IP address - the one from the Pod NIC

Now, let’s tackle the story step by step.

A pod just has one NIC! Yes, that’s true. But there are already discussions how to change this. And besides that, CNI can already add additional NICs to arbitrary containers.

The inital state can be imagined like this:

libvirtd pod:
- eth0

Get a new NIC into a Pod

We can not assume that libvirt is on all hosts, thus KubeVirt needs to ship it in a Pod. Every VM we run, will be launched in the network namespace of the libvirt Pod.

Thus: For each virtual NIC which is used by any VM managed by libvirt, we need to create a dedicated NIC within the Pod.

How do we do this? We work with the assumption that CNI is on the host for now. We work with this temporary assumption, because we assume that there are ways through the Kubernetes API in future to attach multiple NICs to pods.

So, how do we do it? Generaly speaking we just call the CNI plugn on the host and ask it to create another NIC inside our container, or rather, inside the network namespace of our container.

One issue is that currently all NICs connected to a Pod are on the same IP network, this causes routing confusion, to solve this we are removing the routes which are related to this new Pod NIC.

The new state is now:

libvirtd pod:
- eth0
- eth42 (mac42)

Represent the Pod NIC in libvirt

Now that we’ve got the new NIC inside the pod, to be used with a virtual NIC of a VM, we need to find a way to represent this NIC in a way how libvirt can consume it. We could use the direct attachment type, where a macvtap is directly attached to the NIC. This would work, as long as we don’t need to speak to the host - except for the fact that we’ve got no way to provide the pre-defined IP to the VM. To achieve this we need to rely on DHCP. And the only way to provide an IP to a VM in libvirt is to create a network, with a DHCP server which will distribute the IP to a VM.

Thus we create a libvirt network for the Pod, including an entry which is mapping the mac-address of the Pod NIC to the IP of the Pod NIC. So - There is an interesting fact here: We map the Pod mac-address to the Pod IP. This means that the VM later on will have a NIC which has the same mac and IP as the Pod NIC had. This is necessary to ensure that all packets get to the VM. But to allow all packets to travel to the VM, we need to ensure that the mac and IP is not used along the path. This means we need to remove the IP from the Pod NIC, and also change the mac of the Pod NIC - because if they keep the mac and IP, then packets, addressed to the VM, would end at the Pod NIC (because mac/ip match).

Now - What did we do? We removed the IP from and changed the mac of the Pod NIC, then we created a libvirt network, which will distill into a bridge, connected to the Pod NIC, and also a dnsmasq instance providing all DNS, and IP informations to the VM once it comes up. The delivered IP will be the former IP of the Pod NIC.

After starting the network, the new state is:

libvirtd pod:
- eth0
- eth42
  - br-eth42
    # BR: DHCP and host-to-guests

Connect the VM to the network

Connecting the VM is now trivial - We use the regular libvirt network source to connect the VM to the previously defined network. Important is just to assign the correct mac address to the VM NIC. As a reminder, the mac address used to be the mac address of the Pod NIC, but now used for the VM.

With this we effectively mived the IP endpoint from the Pod NIC to the virtual NIC of the VM.

After VM boot the new state is:

libvirtd pod:
- eth0
- eth42
  - br-eth42
    - macvtap-42 (mac42)

Friendly
Street {width=”640” height=”251”}

What happens on VM boot?

Once the VM boots a few things happen (okay - some of these things happen when you start the libvirt network).

A bridge will be created by libvirt, attaching the Pod NIC as a slave. A dnsmasq instance will be spawned providing DHCP to the VM. A VM will be instanciated, connected via a tun device to the bridge.

Once the VM has booted up, it can retrieve it’s IP using DHCP. This will also include the correct DNS infromations the Pod is using.

Once a ping is sent from the VM, it will travel through the TUN device, to the bridge from there on through it’s slave, the Pod NIC, to the CNI network. The CNI network will now see packets from this NIC - and it does not see that they come from a VM.

There is thus no difference to how the pod traffic looks like - so how could it be irritatted? it isn’t, that’s nice.

Note: Rancher VM has a similar approach, except that it does not use libvirt, and that it replaces the original Pod NIC, and does not additional ones.

Code? Just POC.

::: {#footer} [ June 10th, 2017 11:09pm ]{#timestamp} [kubernetes]{.tag} [libvirt]{.tag} [kubevirt]{.tag} [network]{.tag} [rancher]{.tag} :::