How to Avoid Graph Loops in Puppet So That You Can Avoid Graph Loops in Puppet

We recently ran into a slightly tricky problem with our Puppet configuration (a loop in the directed acyclic graph in the catalog), and I’d like to talk a little bit about what the problem was and how we fixed it.

Quick Overview: Puppet and Graph Theory

Graph Theory?!

When Puppet compiles a manifest file (or files), it builds a data structure that relates each resource to every other resource it either depends on or is needed by. This kind of data structure is called a “directed acyclic graph” (DAG). Puppet uses this structure to decide where to begin applying changes and in what order to apply them; this is how it can explicitly order even complex dependency trees (install a package, edit these three files, then create this directory, and once all of those things are done, start this service).

The “Acyclic” Part of “Directed Acyclic Graph”

One of the primary reasons Puppet goes to all this trouble is to make sure that there are no loops (or “cycles”) in the DAG. A loop is created when a resource both is depended on and depends on some other resource (no matter how indirectly), and it means Puppet has no idea where to start applying changes. If resource A depends on resource B, which depends on resource A, then neither ordering is correct (doing either A or B first violates the DAG), and in those cases Puppet will throw up its hands and refuse to take any action at all.

Graph Loops and You

For a simple example of what happens when Puppet encounters a graph loop, the following manifest:

    file { '/dir1':
      ensure => directory
    }
    file { '/dir1/file1':
      require => File['/dir1']
    }

creates the following data structure

    File[/dir1] -> File[/dir1/file1]

In this example, Puppet will make sure to create a dir1 as a directory before it creates file1 inside of it.

If we added the following resource:

    file { '/file2':
      notify => File['/dir1'],
      require => File['/dir1/file1']
    }

a loop will be created like the following:

    File[/dir1] -> File[/dir1/file1] -> File[/file2] ~> File[/dir1]

This causes Puppet’s hair to catch on fire, and it immediately complains:

    $ puppet apply --noop test.pp
    Error: Could not apply complete catalog: Found 1 dependency cycle:
    (File[/dir1/file1] => File[/file2] => File[/dir1] => File[/dir1/file1])
    Try the --graph option and opening the resulting .dot file in OmniGraffle or GraphViz

Be Vewwy Vewwy Quiet, We’re Hunting Graph Loops

In the simple example above, it’s pretty easy to see where the graph loop is and how to resolve it. However, once you start adding custom resources, virtual resources, classes included based on Facter facts, etc., it can get tricky real fast. For instance, here’s a real error message we ran into recently after a change:

    err: Could not apply complete catalog: Found dependency cycles in the following relationships: File[/var/svc/nsca-cluster/log/run] => Service[nsca-cluster], File[/var/svc/nsca-cluster/run] => Service[nsca-cluster], File[/var/log/qc/nagios-cluster] => Service[nsca-cluster], File[/var/svc/nsca-cluster/log] => File[/var/svc/nsca-cluster/log/run], File[/var/svc/nsca-cluster/log/log.run] => File[/var/svc/nsca-cluster/log/run], File[/var/log/qc/nagios-cluster] => File[/var/svc/nsca-cluster/log/run], File[/var/svc/nsca-cluster/log] => File[/var/svc/nsca-cluster/log/log.run], File[/var/log/qc/nagios-cluster] => File[/var/svc/nsca-cluster/log/log.run], File[/var/svc/nsca-cluster/env] => File[/var/svc/nsca-cluster/env/RUNCONFIG], File[/var/log/qc/nagios-cluster] => File[/var/svc/nsca-cluster/svc.run], File[/var/svc/nsca-cluster] => File[/var/svc/nsca-cluster/svc.run], File[/var/log/qc] => File[/var/log/qc/nsca-cluster], File[/var/log/qc/nagios-cluster] => File[/var/log/qc/nsca-cluster], File[/var/log/qc/nagios-cluster] => File[/var/svc/nsca-cluster/env], File[/var/svc/nsca-cluster] => File[/var/svc/nsca-cluster/env], File[/var/svc/nsca-cluster/env] => File[/var/svc/nsca-cluster/env/USER], File[/var/log/qc/nagios-cluster] => File[/var/svc/nsca-cluster/env/USER], File[/var/log/qc/nagios-cluster] => File[/var/log/qc], File[/var/log/qc/nagios-cluster] => File[/var/svc/nsca-cluster/log], File[/var/svc/nsca-cluster] => File[/var/svc/nsca-cluster/log], File[/var/log/qc] => File[/var/log/qc/nagios-cluster], File[/var/log/qc/nagios-cluster] => File[/var/svc/nsca-cluster], File[/var/svc/nsca-cluster/env] => File[/var/svc/nsca-cluster/env/LOGDIR], File[/var/log/qc/nagios-cluster] => File[/var/svc/nsca-cluster/env/LOGDIR], File[/var/svc/nsca-cluster/svc.run] => File[/var/svc/nsca-cluster/run], File[/var/log/qc/nagios-cluster] => File[/var/svc/nsca-cluster/run], File[/var/svc/nsca-cluster] => File[/var/svc/nsca-cluster/run]; try using the --graph option and open the .dot files in OmniGraffle or GraphViz

(It’s probably obvious that this configuration deals with our Nagios infrastructure.)

Looking through this log message, it definitely wasn’t immediately obvious to me where the problem was. It seemed like all of the services were depending on and also required by their log directories, but a close inspection of the manifests where these definitions are yielded no fruit.

The --graph Option, You Say?

Puppet supports creating GraphViz-format files that detail the DAG it builds while compiling the manifests. These files can be rendered in a variety of graphing tools to get a visual picture of the relationships between various resources. For example, here’s a rendering of this very configuration:


It turns out that for complicated configurations, looking at a rendering of the DAG is not necessarily helpful.

Actual Analysis (as opposed to squinting)

While researching this problem, I found a blog post that addressed it specifically. It includes a simple Python script that will analyze a GraphViz file and display just the graph loops to make them easier to find.

Running this script against our configuration gave the following output:

    $ ./dot_find_cycles.py expanded_relationships.dot
    [File[/var/log/qc/nagios-cluster], File[/var/log/qc], File[/var/log/qc/nagios-cluster]]

This showed me immediately that the problem was with the resource that sets up log directories; somehow, the creation of the root logdir had become dependent on the creation of one of its children.

Armed with that information, I was able to quickly track down a recent change to our custom service setup resource where I’d added a subscribe metaparameter that wound up eventually watching for itself and thus creating a loop. I reverted that change and everything was happy once more.

Lessons Learned

In this specific case, of course I learned that this custom resource didn’t work exactly as I thought it did. However, I also learned some valuable troubleshooting techniques for the next time we run into this problem. I’m going to try to figure out a way to incorporate the loop-finding script with our continuous integration system; whenever a change is made to the Puppet configuration, one of the tests it runs should be to check for this sort of loop and to reject the change if one is found.

Also, I learned exactly how long it takes my laptop to open a 19862×21525 pixel image (82 seconds).

Did this post make your heart race at the thought of spelunking through a maze of twisty little Puppet manifests (all alike)? Come work with us and see how it’s done!

Posted by Adam Compton, Platform Operations Engineer