Scenario / Questions

I use Amazon EC2 for my mobile app. Depending on load of the application at a given time, I might spawn new instances and then take them down when load is lower to save costs.

How does one keep up with Nagios configurations for such a dynamic environment? When one deals with managed hardware, configuration files are predictable. In this case Nagios, Capistrano and a bunch of other configuration files would need to be added. Capistrano needs to know where to deploy a new build to for an app server. Nagios needs to know to remove an existing instance or add a new instance for monitoring. Nagios also needs to know if a node was intentionally taken down or if the host is down due to error.

How is this done with the wonderful world of VPS/dynamic instances?

Find below all possible solutions or suggestions for the above questions..

Suggestion: 1

We use a configuration management tool (Chef in our case) which writes out Nagios configuration from the node information.

Suggestion: 2

Wrote my own little set of php scripts that write nagios configurations to a file. Nagios is easy because its just a text file so all you need to do is create a template for each type of server. Then when the server starts add a file using the template. The only data that changes in the file is the host ip and name.

For more static servers I created a script that runs ec2-describe-instances and creates a file for each instance returned. Each instance is marked with tag:Purpose=XXXX so i know which template to apply.

For our auto-scaling groups we set up a notification using the as-put-notification-configuration command which sends a message to a SQS queue. The php script is executed by a crontab. When the script executes it checks the queue for any new servers. Whenever it finds a new server it creates a new file. The same happens when a server gets removed. Probably easier to use Chef or something if you’re already using it but if you’re not you can write a simple php service like mine in a few days.

Suggestion: 3

We use Opsview, which is yet another nagios+database+rest-api wrapper. I don’t know if this is the best solution for everyone (or even us) but it allows us to dynamically configure the Nagios server through a simple REST API from the node (or other administrative node) when it comes up and remove it from the configuration when it’s done. I use definitions of Host Templates as part of the Opsview(/Nagios) server’s Puppet manifest and the monitored hosts just register with it and join the right Host Template as part of their Puppet manifest.

A more “generic” approach, which should work with pretty much anything even the original Nagios and its static files, is Puppet Stored Configuration – this allows you to script something to configure any tool you want however you like based on the information puppet collects from its manifest.

I’d suggest that for forensic purposes you shouldn’t delete the node’s configuration altogether when it’s taken down but try to archive it and the monitoring information collected about it while it was up.

Suggestion: 4

Couple of ways.

  • Using Pre-configured Amazon EC2 Templates.

  • Using puppet manifest with variablized templates.

  • Setup a VPN between your nagios network and your amazon VM.Then all of your amazon VM will have statics IP, you can even setup a DNS on them. We have a nagios running and monitoring all of our amazon instances. We dont even need an elastic-ip. We uses openvpn for out VPN.

  • Build Nagios that listens to external command and update it’s configuration accordingly.
    Eventually machines can register, unregister, suspend, resume them self on the Nagios.

Suggestion: 5

I don’t have a silver bullet for solve this problem with nagios. But for capistrano there’s capify-ec2, an extension for capistrano which solves server role lists using amazon tagging capabilities.