Narrowing down networking issues in K8s

Narrowing down networking issues in K8s
Sometimes the problem lays in the abyss, sometimes it's in the pod.

Often figuring out issues with networking can be a real pain. Adding the complexity of Kubernetes on top of it, and the problem can become daunting.

Let's take the following exception as an example. We have a pod running in a cluster, when it tries to connect to a service outside the cluster:

Error: Connection timeout

There is not a lot of information to work with, clearly the request is having issues connecting outbound. Since there isn't any more information we have to work a little harder.

First Steps - Check if the host is working

There's a couple of obvious things that we should be checking before we dig too deep. First things first we check on a different machine that we can connect to the host and the port We can check the host with ping, and the port with telnet or netcat.

ping SERVER

If you can ping, you can check the port with either:

nc -zv SERVER PORT
telnet SERVER PORT

If you don't get a response to either of these, you know that you can debug this locally as it isn't isolated to the server or the cluster. It could be:

  • That the service is down on that port (if you can ping but cannot telnet)
  • That the server is down
  • That you have DNS issues
  • That there is something wrong with the route to the server (see checking mtr below)

If you get a response, you can continue as it's clearly not an issue with the server you are trying to connect to.

To sanity check that it isn't an issue with the route to the host, you can run mtr to see if the packets are actually making it to the host:

mtr -s 1000 -r -c 20 SERVER

If you see the output contains a large loss percentage at some point, it could be an issue with the routing. Something is getting stuck, and depending on where it is you will have to diagnose further.

HOST: debug-container      Loss%   Snt   Last   Avg  Best  Wrst StDev
|-- foobar.example.com     70.2%  1000    0.5   0.4   0.3  11.0   0.9

We can see that the path is clear and there are no blocking points, which means that this likely isn't an issue from outside the network.

Checking if it's the cluster

If you can connect to the remote machine from a different machine, you know that their server is not down. Now we can determine if it's specific to K8s or not. You can re-run the same commands but on the K8s server cluster. If you get the same response, you know that it's also likely not an issue with the underlying server.

If you do run into an issue at this stage, you likely have an issue with the setup on your machine, as it's not isolated to the cluster.

Checking in the cluster

Since we know that it's likely not an issue with the remote server, nor an issue with the underlying server, we need to dig deeper into the cluster. Since it is hard to debug the pod itself, we can instead start a basic Debian pod in the cluster:

kubectl run -it --rm debug-container --namespace test-namespace --image=debian:stable

Once it has started, you should get a basic Linux command line where you can install some tools.

apt update && apt upgrade
apt install iputils-ping telnet netcat-traditional mtr

We can start by checking if the server is reachable. Try running the following a couple of times to see if it consistently connects.

ping SERVER
telnet SERVER PORT

In this case, we narrow down to a more interesting problem. ping works all the time, but telnet fails intermittently.

When we get to intermittent issues, it actually gets easier to diagnose (if you can manage to avoid pulling your hair out).

We now have found where it's happening, but since it's not 100% of the time we need to dive a little deeper. My personal choice it to see if TCP or UDP is working differently:

nc -vvu SERVER PORT # UDP
nc -vv SERVER PORT # TCP

If you find that there is no difference, then great!

In this example, there was a difference. UDP was working 100% of the time whereas TCP was intermittent.

Success!!!

Now a key difference between TCP and UDP, is that TCP requires acknowledgements as part of the process.

You must allow ACKs through your firewall in order for TCP connections to work. The standard for many devices is to open ports 32768-65535 for ACK TCP requests. These ports are ephemeral ports designated by the Linux system to allow temporary communication.

Normally this works as the system limits it's usage to these ports. The issue is that Kubernetes can end up requesting ports outside of this range, which causes the ACKs to never get delivered. Therefore you either need to limit the ephemeral ports used by Kubernetes, or you need to expand the firewall to allow ACK over TCP on more ports.