Overlay Network and Calico
Overlay network abstracts physlcal network abstracts the physical network to create a virtual network. There are two kinds of overlays –
a) Virtual Extensible LAN (VXLAN) based overlays
b) Border Gateway Protocol (BGP) based overlays.
To keep overlay consistent and reliable they stores their meta data Information and otherwise in Key-Value (KV) stores like etcd. DNS allows containers to reference each other by service name rather than IP address. There are multiple types of overlay solutions available like Calico, Weavenet etc.
Network plugins in Kubernetes come in a two flavors:
a) CNI plugins
b) Kubenet plugin
Kubenet is a very basic, simple network plugin, on Linux only. It does not, of itself, implement more advanced features like cross-node networking or network policy. It is typically used together with a cloud provider that sets up routing rules for communication between nodes, or in single-node environments.
Kubenet creates a Linux bridge and veth pair for each pod with the host end of each pair connected to this bridge. The pod end of the pair is assigned an IP address allocated from a range assigned to the node either through configuration or by the controller-manager. Bridge is assigned an MTU matching the smallest MTU of an enabled normal interface on the host.
There are multiple third party CNI plugins such as Flannel, Calico, Romana, Weave-net.
Calico is deployed as a daemonset on the Kubernetes cluster. The daemonset construct of Kubernetes ensures that Calico runs on each node of the cluster. Calico is made up of the following interdependent components:
1. Felix, the primary Calico agent that runs on each machine that hosts endpoints.
2. Orchestrator plugin, orchestrator-specific code that tightly integrates Calico into that orchestrator.
3. etcd, the data store, stores the data for the Calico network in a distributed, consistent, fault-tolerant manner, ensures that the Calico network is always in a known-good state
4. BIRD, a BGP client that distributes routing information.Calico deploys a BGP client on every node that also hosts a Felix. The role of the BGP client is to read routing state that Felix programs into the kernel and distribute it around the data center.When Felix inserts routes into the Linux kernel FIB, the BGP client will pick them up and distribute them to the other nodes in the deployment. This ensures that traffic is efficiently routed around the deployment. It is this BGP client that lets containers in pods and across communicate with each other by creating a BGP mesh. In large networks it creates an overhead and hence in such cases BGP Route Reflector is been used in such scenarios.
5. BGP Route Reflector (BIRD), an optional BGP route reflector for higher scale. This component is deployed in large deployments.
Calico uses BGP to deploy overlays and performs layer 3 forwarding at each compute node at kernel level. Calico sets up a mesh of BGP peers, where the peers are the hosts that make up the cluster. Each BGP peer will advertise container routes to all other peers. When peers receive the route information, they will update their routing tables. The Calico’s solution is to use layer 3 networking all the way up to the containers. No Docker bridges, no NAT, just pure routing rules and iptables. Calico can use IP-in-IP or VXLAN tunnels. In ICP Calico makes use of IP-in-IP, details are been discussed in next section.
Calico deployment in ICP could be verified through below command.
Etcd is the backend data store for all the information Calico needs. It is recommended to deploy a separate etcd for production systems. The next key component in the calico stack is BIRD. BIRD is a BGP routing daemon which runs on every host. Calico makes uses of BGP to propagate routes between hosts. Bird runs on every host in the Kubernetes cluster, usually as a DaemonSet. It’s included in the calico/node container.
Calico works on policy driven network security implementation by leveraging iptables.
ICP takes care of all installation and configuration aspects of calico during installation itself, so one need not worry about calico installation with ICP.
We will be using two utilities jq and fping in this article. To install then run the below commands on each node participating in ICP
apt install jq
apt install fping
IBM Cloud Private Networks
IBM Cloud Private has the following networks – Node and Pod. The Node network is internal network all the nodes are part of, and it is provided by customer data center or cloud where ever the ICP infrastructure is setup.The physical machines or nodes that participate in ICP cluster may be multihomed or singlehomed with interfaces connected to public and private network. Public network provides public internet addressable IP addresses to host where as private network provides privately accessible IP addresses. The dynamic IP addresses to physical nodes are provided by centralized DHCP server or could be static IP addresses based on customer requirements.
Pod network is provided by CNI of ICP which is Calico in our case. A CNI plugin is responsible for inserting a network interface into the container network namespace (e.g. one end of a veth pair) and making any necessary changes on the host (e.g. attaching the other end of the veth into a bridge). It then assign the IP to the interface and setup the routes consistent with the IP Address Management by invoking appropriate IPAM plugin. Each networking plugin has its own approach to IP address management (IPAM, for short). At a high-level, Calico uses IP pools to define what IP ranges are valid to use for allocating pod IP addresses, the subnet CIDR range of which is configured by administrator.The IP pools are subdivided into smaller chunks – called blocks – which are then assigned to particular nodes in the cluster. Blocks are allocated dynamically to nodes as the number of running pods grows or shrinks. In ICP this is taken care of internally by taking CIDR details from config.yaml during ICP installation.
Felix creates a virtual network interface and assigns an IP address from the Calico IPAM for each Pod. This interface carries the prefix, cali unless specified otherwise. This ensures that the Pods carry a routable IP address and the packets are routed appropriately. It also is responsible for cleaning up the interfaces when a Pod is evicted. Felix exposes metrics that are used for instance state reporting via a monitoring tool, such as Prometheus. Felix is responsible for network policy enforcement. Felix monitors the labels on the Pods and compares against the defined network policy objects to decide whether to allow or deny traffic to the Pod.
Pods are the smallest unit of deployment in Kubernetes. A Pod can be scheduled on one of the many nodes in a cluster and has a unique IP address.
calixxx and caliyyy are abstraction that can be used to create tunnels between pod network namespaces, and physical network in another namespace. In other words cali….@.. is the interface between the host and the container. Kubernetes first creates the network namespace for the pod before invoking any plugins. This is done by creating a pause container that “serves as the “parent container” for all of the containers in your pod”. Kubernetes then invokes the CNI-plugin to join the pause container to a network. All containers in the pod use the pause network namespace (netns). CNI plugin setsup a veth pair (calixxx and caliyyy) to attach the pod to the bridge just created. To allocate L3 info such as IP addressees to pods, an IPAM-plugin (ipam) is called. The below screen shot shows calico veth associated with containers running on master node.
As mentioned above ICP makes use of Calico BGP based overlays and hence creates a BGP mesh between all participating nodes on ICP as seen below:
The Peer Address (HostIP) to Host name mappings as below, indicates mesh between Master node and all other nodes participating in ICP cluster.
The IP address of tunnel could be found using below command on each participating nodes. For example below command is run on Worker#4.
The overall node communication architecture is as depicted below. BGP peers interact with each other through IP-in-IP tunnels between these nodes labelled as tunl0 there by creating a mesh.BGP peer end points are felix daemon sets running on each physical workload nodes.
Kubernetes requires that nodes should be able to reach each pod, even though pods are in an overlay network. Similarly pods should be able to reach any node as well. We will need host routes in the nodes set such that pods and nodes can talk to each other. Since inter host pod-to-pod traffic should not be visible in the underlay, we need a virtual/logical network that is overlaid on the underlay. Pod-to-pod traffic would need to be encapsulated at the source node. The encapsulated packet is then forwarded to destination node where it is de-encapsulated. A solution can be built around any existing Linux encapsulation mechanisms. We need to have a tunnel interface (with VXLAN, GRE, etc. encapsulation) and a host route such that inter node pod-to-pod traffic is routed through the tunnel interface. We will be testing in next section these communications.
Exploring ICP Network
Lets try to understand ICP networking through available kubernetes examples. Lets clone kubernetes example repository
After we have installed guestbook sample kubernetes application we could see two pods getting created and fresh routes being updated in route table.
One can you verify the status of all pods in one go through jq and fping command as shown below:
Getting IP address of Pods of guestbook application
Each Kubernetes pod gets assigned its own network namespace. Network namespaces (or netns) are a Linux networking primitive that provide isolation between network devices.
It can be useful to run commands from within a pod’s netns, to check DNS resolution or general network connectivity. To do so, we first need to look up the process ID of one of the containers in a pod. For Docker, we can do that with a series of two commands. First, list the containers running on a node:
The above output we’re showing two containers:
1. The first two containers are the redisslave and frontend running in their respective pods.
2. The others are a pause container running in the redisslave and frontend pod. This container exists solely to hold onto the pod’s network namespace
The nsenter tool is part of the util-linux package since version 2.23. It provides access to the namespace of another process. nsenter requires root privileges to work properly. The potential benefit of this would be debugging and external audit but for a remote access, docker exec is the current recommended approach.You cannot run nsenter inside the container that you want to access, and hence, you need to run nsenter on host machines only. By running nsenter on a host machine, you can access all of the containers of that host machine. Also, you cannot use the running nsenter on a particular host, say host A to access the containers on host B.
Finding a Pod’s Virtual Ethernet Interface
Each pod’s network namespace communicates with the node’s root netns through a virtual ethernet pipe. On the node side, this pipe appears as a device that typically begins with veth and ends in a unique identifier, such as cali77f2275 etc. Inside the pod this pipe appears as eth0.
It can be useful to correlate which veth device is paired with a particular pod. To do so, we will list all network devices on the node, then list the devices in the pod’s network namespace. We can then correlate device numbers between the two listings to make the connection.
As seen in previous image eth0 has a number if35 appended to it which means that pod’s eth0 is linked to the node’s 35th interface.
Connection Tracking with Conntrack
A common problem on Linux systems is running out of space in the conntrack table, which can cause poor iptables performance. This can happen if you run a lot of workloads on a given host, or if your workloads create a lot of TCP connections or bidirectional UDP streams.
Lets see another use case with some flow Information to get the outputs in conntrack command. We will use busybox two identical busybox containers with a Replication Controller.
Lets verify basic kubernetes requirements for node and pod connectivity
1. nodes should be able to talk to all pods with out NAT
2. A pod should be able to communicate with all nodes without NAT
3. A pod should be able to communicate with all pods with out NAT
Case1: lets check ping from one container in one busybox pod to another pod in second busybox pod.
Pinging other busybox pods on other nodes
Case2: Node to Pod communication
Case3: Pod to Node communication