Arie Bregman

Linux And Stuff

TripleO: Debugging Overcloud Deployment Failure

You run ‘openstack overcloud deploy’ and after a couple of minutes you find out it failed and if that’s not enough, then you open the deployment log just to find a very (very!) long output that doesn’t give you an clue as to why the deployment failed. In the following sections we’ll see how can we get to the root of the problem.

Overcloud stack status

The way overcloud is being deployed is using heat templates. The result is a heat stack which includes all the different resources of the overcloud ( more stack, networks, subnets, etc.).

If you are not familiar with Heat, I recommend to read a little bit about it before proceeding, since it might be more confusing than helpful without any knowledge regarding Heat.

So our first step in finding out what happened to our overcloud deployment is to list the existing stacks

You can also use the deprecated command heat stack-list.

We get a lot of useful information from the above command. First of all we know how our overcloud stack deployment is called – ‘overcloud’. We’ll use it soon enough. We know also that the deployment was unsuccessful due to the stack status – “CREATE_FAILED”.

What we still don’t know is why the deployment failed. We can choose one of two ways to continue.

The short way

If you are lucky enough to deploy the latest available release (OSP 10+ / Newton+), then you can use the magical openstack stack failures list command

It will instantly give you all the information regarding the failure. The above output was very long so I pasted only portion of it.

You can see which resources in the stack failed and what exactly happened with the ‘delpy_stdout’ field.

Note that ‘overcloud’ in the command, is the name of the stack and it’s not a fixed word. You should change it in case your stack called differently.

Older releases

Don’t have ‘openstack stack failures list’ command? don’t worry, we’ll simply use the old good way.

List nested stacks

From openstack stack list it looks like there is only one stack for overcloud deployment, but actually there could be more than 50(!) stacks.

The one we are interested in, is every stack that failed during the creation (deployment). To see all of them, we’ll use the flag ‘–nested’ and to filter only those with “FAILED” status we’ll use grep.

We can see there three failed stacks. One of them (the last) is our original top-level stack, which we already saw when using openstack stack list

The other two are nested stacks. So we still don’t know what caused the deployment to fail, but we are getting closer 🙂

List stack resources

Once we know which nested stack failed, we can proceed with checking which resources failed.

We got three resources, all failed for some reason (don’t worry, we’ll find out why eventually). The valuable information for us here, is the physical_resource_id. This is the last piece of information we need for finding out why our deployment failed.

If you are wondering where ‘c738ba27-b6d7-475e-86c6-937d5bd4ac6c’ is coming from then it’s from the previous command’s output. This is the nested stack. Why specifically this one? because it’s most nested one of the 3 stacks. You can tell by the name.

The holy grail (a.k.a why my deployment failed)

We’ll now use the physical_resource_id from of the first resource (‘0’) from the previous section, to find out what happened to our deployment.

We can finally see what exactly the puppet modules executed and why the deployment failed in ‘output_values’ field:

If the output is too long for you, try to grep “Error” but with several lines before and after

Also, don’t forget to use the ‘–long’ flag to see the full output. Without using it, you might miss the actual cause of the failure.

To conclude, I’ll just say use openstack stack failures list life is too short.

1 Comment

  1. awesome, thanks for that

Leave a Reply

Your email address will not be published.


© 2017 Arie Bregman

Theme by Anders NorenUp ↑