Arie Bregman

Linux And Stuff

Openstack Neutron: troubleshooting and solving common problems

Important note: this post is based on the great sessions I Can’t Ping My VM! Learn How to Debug Neutron and Solve Common Problems of Rossella Sblendido & OpenStack Neutron Troubleshooting by Assaf Muller . So the credit goes to them. I simply gathered it here in a written form and added little bit of description and examples.  Enjoy =)

Common problems classification

The problems you may experience can be divided into several categories:

  • Misconfiguration –  you may experience issues due to inadequate configuration you put in the config files used by neutron. Wrong usage of the configuration tools may also be relevant and cause some issues. In addition, misconfigured underlying network will affect neutron functionality as every packet goes eventually through the physical. For example, it can be external network that isn’t reachable or firewall rule that is blocking traffic from your VMs or to them. So if the underlying network isn’t working, neutron will also fail to work properly.
  • Bug in the code – you may found a bug in the code. Good chances you are not the first to bump into this bug so it’s worth checking here if someone already reported it. If you can’t find the bug there, they you are probably the first one to catch it and you should report it so that the developers can start fixing it.

Issue #1: I can’t ping/ssh my VM using private IP

One of the common issues out there, especially for anyone who is starting to explore the OpenStack world. So In order to debug such issue, it will be wise to understand  how our VM getting an IP in the first place.

How does a VM get an IP?

In order to answer that we need to introduce the DHCP agent. If you are familiar with networking, you know DHCP is a protocol for distributing different network parameters (including IP addresses).

The DHCP agent communicates with neutron-server over RPC. It ensures network isolation using namespaces, so every network has its own dhcp namepsace. Inside this namespace there is a process called ‘dnsmasq’ and it’s the one that actually serves the DHCP parameters, including the IP address. So the DHCP agent configures this dnsmasq using a lease file.

Let’s see in more detail the IP allocation process:

vm_get_ip

At the end of the process, the new ip will be served and the VM will get its IP.

Let’s follow the traffic in more detail. It’s important to understand our packets flow. It will allow us to know where to look and hopefully find the issue more quickly.

We have two default implementation in neutron – openvswitch and linux bridge. Let’s start with openvswitch:

openvswitch_flow

Little bit of explanation on what we can see in the drawing:

The firewall bridge is a linux bridge. It’s there to be able to apply security groups which are firewall rules. They are implemented using iptables. You can not apply iptables to an interface that’s connected to openvswitch port, so that’s why we need the firewall bridge in the middle.

The integration bridge (br-int) is in charge of tagging & untagging the traffic that is coming from the vm and going to the vm, using the VLAN id assosicated with the network. Every network has a VLAN id and this VLAN id is used internally in the compute host to isolate the traffic (that’s why it’s called local VLAN id).

The tunnel bridge (br-tun) is the bridge in charge of the tunneling. It has the flows that will translate the VLAN id assigned to the network, into the segmentation id. If for example you are using GRE tuneel, the GRE tunnel id would be the segmentation id assigned to the network.

Now let’s see the flow with using linux bridge:

linux_bridge_flow

In linux bridge implemantion we have one linux bridge for every networ. You can see the we have net1 and the vm is connected to this network. We can also see the infterface plugged into net1 bridge is eth0.100, meaning vlan 100 assigned to net1 network.

Debugging Steps

First of all, check if the instance is up. It may sound trivial, but let’s not skip anything:

The output should be similar to this:

nova_list

In the above output we can see the instance is running. If it wasn’t running, we would want to peek in the logs to get a clue on what went wrong. Looking in the logs is always a wise step, as many issues should be reflected there.

Remember, at this point, the issue can be caused by anything. It can be even be not directly related to your OpenStack deployment, but rather to your hardware. For example, if you don’t have enough space or memory for VMs to boot and run. You can verify it with:

Anther common cause for this issue is the default security group rules. The default is not allowing ICMP (the protocol used by the ping command) traffic. So you may need to configure it so ICMP wouldn’t be blocked and you will be able to ping the machine.

As mentioned earlier, the physical underlying network may also cause the issues. Make sure you are able to ping between nodes in your environment.

Port binding

If the vm didn’t boot, check if you ran into port binding failure on either the vm port or router, DHCP ports.

For a vm port, it will be logged as port binding failure and so it will be easy for you to spot. For a DHCP or router, it’s not so easy since the ports are created asynchronously, meaning you will not see it right away. Let’s take routers for example. When you create a router and adding new interface, the operation will succeeded even if the ports created behind the scenes entered binding failure state. That’s because it happens asynchronously.

binding_failure2

There are two reasons this usually happens:

1. OVS agent was dead when you added new subnet or new interface port in your router.  This can be easily verified with:

You would see in ‘Open vSwitch agent’ line, under the ‘alive’ column, this: ‘xxx’.

Anther symptom of dead OVS agent is no VLAN tag under the tap device. You can verify it with:

At the moment, the only solution for this issue is to recreate the resource.

2. Misconfiguration in your agents or server config files.  This usually happens when you are using non-defaults values in the configuration file

Did the VM receive an IP?

So now that you know how a VM gets an IP, check if it happened. To check if your vm has an IP,  you can simply issue from the VM console:

No IP? Check If the DHCP agent is up and running:

The above command will list all the agents with their status. If the DHCP agent is up and running, you should see under the ‘alive’ column a smiley like this:   🙂

Next, you can check if dnsmasq is running on the network node on the specific network you are dealing with.

You can also check if the lease file is populated and your VM mac is in the host file.

Still no IP and everything looks fine? check the DHCP agent log and look for errors and traces.

At this point, you would also want to ensure we are not having cross node connectivity issues,  so try ping between hosts and VMs. Don’t forget to set fixed IP first, using the console.

Remember,  if you have issues in the underlying network, it will affect neutron. For example, if certain vlan ids are not allowed in the physical switch, it will reflect in neutron and you may have connectivity issues.

Still didn’t find any issue? time to pull out the ultimate tool – tcpdump.  tcpdump will allow you to track the full course of your packets and see how they changed at each step. There are many great online tutorials that explains how to use it, but for the most basic use, try running: ‘tcpdump -i <name_of_the_device>

Issue #2:  my VM can’t reach external network

To handle this issue, we need to understand how L3 agent works. Its main responsibility is to allow L3 connectivity/routing. It also providing NAT. and uses namespaces for network isolation. Normally it will be installed on the network node. It’s also the agent that provides access to the external network.

Let’s follow the packet course when a VM trying to reach the external network:

l3_external_flow

Now let’s do the same, but this time from the external network to the VM and with linux bridge:

l3_external_linux_bridge_flow

Debugging Steps

First, make sure you have the right security group rules configured in you environment. You need to allow ssh and ping explicitly.

Check if you can ping the private IP.  If not, don’t expect the floating IP to work.  Check also if the vm can reach the router, because it will not be able to reach external network unless it can get to the router first.

From the router namespace (on the network node)  try to ping the VM using the floating IP:

That maybe stupid check since the floating IP basically lives in the router namespace, but at least it will give an idea on how bad the situation is.

You may spot issue with the bridges configuration. Check it with the following command:

Don’t forget to check L3 agent log for errors:

Use the console to access the instance and check if it got an IP address

From the console try also ping the default gateway and see if you can reach it.

Issue #3:  my VM can’t reach Metadata server

The metadata server is the service that serves the metadata for the VM. The data can be ssh keys, ip addresses, hostname.

The metadata agent responsible for proxying the requests from the VMs to the metdata server or nova. There are two ways to configure it:

  1. Routed networks – when you have a network that connected to a router
  2. Non routed networks – when you have network that is not connected to a router, so it’s isolated.

Let’s see routed networks work-flow in more detail:

routed_networks_metadata

Note: metdata proxy spawned by L3 agent and it listens for requests. When a request from the vm reaches the metadata proxy, it adds some information to the header – IP of the vm and router id and forward it to the metadata agent.

Now let’s look more closely on the other configuration – isolated networks:

isolated_network_metadata

Note: in order for this to work you will have to set a flag in the dhcp configuration file like this:

We also used option 121 in the above work-flow to inject a route to the VM when it’s requesting an IP address from the DHCP server. So the metadata proxy is the next hop to reach the metadata server.

Debugging Steps

First, check if metadata agent is up

You should see in metadata agent line, under the ‘alive’ column, this smiley -> 🙂

Next, check if the metadata proxy is up. Remember, it’s get spawned by L3 agent in router (or dhcp) namespace, so you should check if it’s in the process table of the namespace:

The issues should be reflected in the metadata logs. check them with the following command:

Check if you can reach the metadata server from the router/DHCP namespace:

Check if the image you are using supports Option 121. If not, your vm will not be able to get route and reach the metadata server.

Tried everything and still couldn’t find the issue? tcpdump to the rescue.

Issue #4:  VIF plugging timeout

In order to understand why we are getting plugging timeoue, we need to introduce L2 agent.

L2 agent runs on the hypervisor (compute host). Its main responsibility is to configure  the local switches on the node and wire new devices. It communicates with neutron-server over RPC. It also in charge of applying security group rules which are implemented using iptables and ip sets.

Let’s see in more detail how VIF plugging done:

vif_plugging

When Nova sends the allocate_network request, it sets a timeout of 5 minutes. If Nova don’t get reply from Neutron in 5 minutes, you will get vif plugging timeout

Debugging Steps

Check the logs. L2 agent, neutron and nova logs can help you identify the problem. On the compute hosts:

On the controller:

If your system is slow because it’s loaded or you are performing stress tests, you might want to adjust server configuartions in /etc/nova/nova.conf file:

  • Try to increase vif_plugging_timeout in order to give more time for plugging the interface
  • Try to increase rpc_thread_pool_size & rpc_conn_pool_size to make the processing faster

Tools for better tomorrow

Let’s go over some useful tools for debugging neutron and networking issues.

ip a

ip addr (ip a is just a shortcut) is really usefull to inspect the devices in your machine/namespace. It allows you to get devices names, see if the devices up, IP addresses, MTU and bunch of other network parameters.

route -n

It will display the routing table. With routing table displayed you can know which path your packets will take when they travel out, to the big world. It’s actually also possible to see with the previous command – ip route.

iptables -L

See what firewall rules exists on the node. If your packets suddenly disappear or don’t get to the final destination, a deny rule might be the cause.

arp

See the arp table on the machine. Using it, you will now if your node is not able to find the addresses of other nodes.

tcpdump

I have mentioned it several times in this post. It’s a great packet sniffing tool. Easy to install and use. I’m going to cover it in a different post since there many ways to use and it’s better to dedicate time to learn it solely.  For the most basic use, simply run:

ip netns

Used for working with namespaces. you can’t solve most of the problems without it. In order to list the namespaces available on you node, use:

You can every previous command I mentiones inside a namespace, using ‘ip netns exec’. For example to display the routing table in a namespace, use:

OpenVSwich

If you are using openvswitch in your deployment, you have several tools for debugging and troubleshooting:

  • ovs-vsctl show- shows the configuration of the bridges on the machine
  • ovs-ofctl show – shows datapaths
  • ovs-ofctl dump-flows – dump all the flows installed on the machine
  • ovs-ofctl dump-flows br-tun – dump all the flows on br-tun
  • ovs-ofctl dump-flows br-tun table=21 – dump all the flows on br-tun in specific table
LinuxBridge

for linux bridge, use the following:

  • brctl show – shows the configuration of the bridges on the machine
  • brctl show <bridge name> – shows the configuration for specific bridge

Update

I didn’t cover several  important networking devices you might probably want to be familiar with .

Let’s start with a TAP device. A TAP device is a virtual network interface. It used for connecting the virtual instance, implemented by your hypervisor ( KVM, Xen, etc).  Traffic that reach to the TAP device, received by the instance.  A good thing to remember is that TAP device is usually the starting point. You would want to start following the traffic there.

To see the tap device on your instance, simply run:

Usually it will be called tap<port_prefix>, so you could also use:

Detailed information on a TAP device, can be found HERE.

The TAP device is bridged using a Linux bridge. Usually the Linux bridge name starts with qbr, which is a shortcut for quantum bridge ( qunaum is the former name of neutron ). You can list the linux bridges on your system with brctl.

You should see in the output the TAP interface and qvb interface.

qvb (Quantum veth bridge ) and the other end – qvo (Quantum veth openvswitch) form a Virtual Ethernet Pair (veth).  It used to connect the Linux Bridge and OVS bridge. Anything that comes in on one device, should  leave the other device end. You can think of it as a tube.

If you’ll list the ports on the integration bridge, you will see that one of the ports is the qvo, which connects you to the Linude Bridge.

Router and DHCP devices connected directly to br-int. You can see TAP device entry when listing the interfaces in the DHCP namespace or listing the ports on the integration bridge.

12 Comments

  1. Very nicely written and informative. Thank you.

  2. Great job, It´s the simpliest and clearest guide that I have found so far.

  3. Hi, This is very informative and gives a clear picture of the flow of packet in neutron.
    I have a question: When I do “brctl show” on my linux host machine where openstack allinone setup is installed, the output is blank. And I checked it on other machine with same openstack installation and there I was able to see qbr,qvb,tap.
    I have done the same installation steps on both machines but on one machine I am getting blank output for brctl show command?
    The impact of this miss-configuration is that on that openstack installation I am not able to ping outside world through VM or through router namespace. My VM is able to ping the router IP’s (gatewayIP and internal network IP) but packet is lost after it reaches the router gateway IP.
    Could you please help me on this?

    • Hi Sach,

      I think you might be using openvswitch as an l2 backend and not linuxbridge, which is why you might not be seeing any bridges in ‘brctl show’. Try running ‘ovs-vsctl show’ and see if the bridges show up this way.

      Another way to see which l2 agent you use is to run ‘ps -ef | grep neutron’.

      If you’re using ovs this might explain the differences you see and should get you going. Let us know if we can help further 🙂

      John.

    • For the first one you would be checking brctl show on networking node and the other machine could be a compute node.. Check it out !!

  4. Hi,

    Great job !!!!

    The issue I’m facing is that I’m not able to resolve any host address from my instances .

    I’m able to ping in and out of my instances.

    Only DNS isn’t working despit rules includes port 53 ingress and egress TCP and UDP protocols.

    Any help is wlecome.

    Thx.

    Regards,

    J.P.

    • bregman

      July 11, 2016 at 9:51 pm

      Do You have ping to the DNS server from the vm?
      Do you have the right entries in /etc/resolv.conf?

  5. Hi,

    By reinstalling all stuff ( Oses and OpenStack), problem disappeared !!

    Thx for help.

    Regards

    J.P.

  6. Hello, thank you for sharing so impressive article. It is very informative and clear to introduce how to troubleshoot the components of Neutron.
    Looks like the links of the images are invalid, could you please fix this problem?

    Best Regards

    Yi

  7. I am in the beginning stages of doing an allinone packstack setup. After I configured br-ex to have IP addresses of my external network, I can no longer ping my gateway (192.168.0.1).

    below are the contents of my ifcfg-ens3

    DEVICE=ens3
    TYPE=OVSPort
    BOOTPROTO=none
    OVS_BRIDGE=br-ex
    ONBOOT=yes
    DEVICETYPE=ovs

    below are the contents of ifcfg-br-ex

    TYPE=OVSBridge
    DEVICE=br-ex
    BOOTPROTO=static
    DEVICETYPE=ovs
    ONBOOT=yes
    IPADDR=192.168.0.100
    NETMASK=255.255.255.0
    GATEWAY=192.168.0.1

    I have seen multiple people having the same issue but none have got any response though.

    • bregman

      August 17, 2016 at 2:39 pm

      Why you are configuring ‘br-ex’ by yourself? Packstack should do it for you
      What command did you use?

Leave a Reply

Your email address will not be published.

*

© 2017 Arie Bregman

Theme by Anders NorenUp ↑