A picture is worth a (few) thousand bytes

(Context alert: Know Chef. If you don’t, it’s seriously worth looking into for any level of infrastructure management.)

TL;DR: I wrote a Knife plugin to visualize Chef Role dependencies. It’s here.

Recently, I needed to sort out a large amount of roles and their dependencies, in order to simplify the lives of everyone using them.

It wasn’t easy to determine that changing one would affect many others, since it had become common practice to embed roles within other roles’ run_list, resulting in a tree of cross-dependency hell.
A node’s run_list would typically contain a single role-specific item, embedding the lower-level dependencies.

A sample may look like this:

node[web1] => run_list = role[webserver] => run_list = role[base], recipe[apache2], ...
node[db1] =>  run_list = role[database]  => run_list = role[base], recipe[mongodb], ...

Many of these roles had a fair amount of code duplication, and most were setting the same base role, as well as any role-specific recipes. Others were referencing the same recipes, so figuring out what to refactor and where, without breaking everything else, was more than challenging.

The approach I wanted to implement was to have a very generalized base role, apply that to every instance, then add any specific roles should be applied as well to a given node.

After refactoring node’s run list would typically look like:

node[web1] => run_list = role[base], role[webserver]
node[db1] =>  run_list = role[base], role[database]

A bit simpler, right?

This removes the embedded dependency on role[base], since the assumption is that every node with have role[base] applied to it, unless I don’t want to for some reason (some development environment for instance).

Trying to refactor this was pretty tricky, so I wrote a visualizer to collect all the roles from a Chef repository’s role_path, parse them out, and create an image.

I’ve used Graphviz for a number of years now, and it’s pretty general-purpose when it comes to creating graphs of things (nodes), connecting them (edges), and rendering an output. So this was my go-to for this project.

Selling you on the power of visualizing data is beyond the scope of this post (and probably the author), but suffice to say there’s industries built around putting data into visual format for a variety of reasons, such as relative comparison, trending, etc.
In fact some buddies of mine have built an awesome product that does just that – visualizes data and events over time. Check them out at Datadog. (I’ve written other stuff for their platform before, it’s totally awesome.)

In my case, I wanted the story told by the image to:

  1. Demonstrate the complexity of the connections between roles/recipes (aka spaghetti)
  2. Point out if I have any cyclic dependencies (it’s possible!)
  3. Let me focus on what to do next: untangle

Items 1 & 2 were pretty cool – my plugin spat out an increasingly complex graph, showing relationships that made sense for things to work, but also contained some items with 5-6 levels of inheritance that are easily muddled. I didn’t have any cyclic dependencies, so I created a sample one to see what it would look like. It looked like a circle.

Item 3 was harder, as this meant that human intervention needed to take place. It was almost like deciding on which area of a StarCraft map you want to go after first. There’s plenty of mining to do, but which will pay off fastest? (geeky references, are you surprised?)

I decided on some of the smaller clusterings, and made some progress, changing where certain role statements lived and the node <=> role assignment to refactor a lot out.

My process of writing a plugin developed pretty much like this:

  1. Have an idea of how I want to do this
  2. Write some code that when executed manually, does what I want
  3. Transform that code into a knife plugin, so it lives inside the Chef Ecosystem
  4. Package said plugin as RubyGem, to make distribution easy for others
  5. Test, test, test (more on this in a moment)
  6. Document (readme only for now)
  7. Add some features, rethink of how certain things are done, refactor.
  8. Test some more

Writing code, packaging and documentation are pretty standard practices (more or less), so I won’t go into those.

The more interesting part was figuring out how to plug into the Chef/Knife plugins architecture, and testing.

Thanks to Opscode, writing a plugin isn’t too hard, there’s a good wiki, and other plugins you can look at to get some ideas.

A couple of noteworthy items:

  1. Figuring out how to provide command-line arguments to OptionParser was not easy, since there was no real intuitive way to do it. I spent about 2 hours researching why that wasn’t doing what I wanted, and finally figured out that "--flag" and "--flag " behave completely different.

  2. During my initial cut of the code, I used many statements to print output back to the user (puts "some message"). In the knife plugin world, one should use the ui.info or ui.error and the like, as this makes it much cleaner and consistent with other knife commands.

Testing:

Since this is a command-line application plugin, it made sense to use a framework that can handle inputs and outputs, as that’s my primary concern.
With a background in systems administration and engineering, software testing has never been on the top of my to-learn list, so when the opportunity arose to write tests for another project I wrote, I turned to Cucumber, and the CLI extension Aruba.

Say what you will about unit tests vs integration tests vs functional tests – I got going relatively quickly writing tests in quasi-English.
I won’t say that it’s easy, but it definitely made me think about how the plugin will be used, how users may input commands differently, and what they can expect to happen when they run it.

Cucumber/Aruba also allowed me to split my tests in a way that I can grok, such as all the CLI-related commands, flags, options exist in one test ‘feature’ file, whereas another feature file contains all the tests of reading the roles and graphing them in different formats.

Writing tests early on allowed me to continue to capture how I thought the plugin will be used, write that down in English, and think about it for awhile.
Some things changed after I had written them down, and even then, after I figured out the tests, I decided that the behavior didn’t match what I thought would be most common.

Refactoring the code, running tests in between to ensure that the behavior that I wanted remained consistent was very valuable. This isn’t news for any software engineers out there, but it might be useful to more system people to learn more about testing.

Another test I use is a style-checker called tailor – it measures up my code, and reports on things that may be malformed. This is the first test I run, as if the code is invalid (i.e. missing a end somewhere), it won’t pass this test.

Putting these into a test framework like Travis-CI is so very easy, especially since it’s a RubyGem, and I have set up environment variables to test against specific versions of Chef.
This provides the fast-feedback loop that tests my code against a matrix of Ruby & Chef versions.

So there you have it. A long explanation of why I wrote something. I had looked around, and there’s a knife crawl that is meant to walk a given role’s dependency tree and provide that, but that only worked for a single role, and wasn’t focused on visualizing.

So I wrote my own. Hope you like it, and happy to take pull requests that make sense, and bug reports for things that don’t.

You can find the gem on RubyGems.org – via gem install knife-role-spaghetti or on my GitHub account.

I’m very curious to know what other people’s role spaghetti looks like, so drop me a line, tweet, comment or such with your pictures!

Quick edit: A couple of examples, showing what this does.

Sample Roles

(full resolution here)

Running through the neato renderer (with the -N switch) produces this image:

Sample Roles Neato

(full resolution here

Fast and Furious Monitoring

In the past few weeks, I’ve been working with a company that is using ScoutApp‘s hosted monitoring service, which provides a nice interface to quickly get up and running with a lot of basic information about a system.

This SaaS solution, while a paid service, allows a team to get their monitoring metrics put into place in the fastest turnaround time to get moving, while allowing to scale financially at a rate of ~$10/server/month.

Getting up and running is as simple as signing up for their risk-free 30-day trial, logging in to their interface, and following some simple instructions on installing their RubyGem plugin, aptly named scout, like so:

gem install scout

Obviously, needs Ruby installed, which is pretty common in web development these days.

Executing the scout executable will then prompt you for a GUID, provided from the web interface when “Adding a new system”, which tests connectivity to the ScoutApp service, and “checks in”.

Once the new system is added, the scout gem needs to be executed once a minute to check in with the server end, so this is typically achieved by placing an entry in the crontab, and again, the instructions are provided in the most convenient location on the command line, with variations for your system.

Once installed in crontab, it’s pretty much “fire-and-forget” – which is probably the best feature available in any system.

Heading back to the web interface, you’ll see the system details, and the real advantage of the ScoutApp system – the plugins.

Each system starts with a bunch of the basics – server load, memory profiling, disk space. Great! 90% of problems manifest in variations in these metrics, so getting them on the board from the get-go is great.

The Plugin Directory has a bunch of very commonly used applications that are used in the FLOSS stacks very popular amongst web development, so you can readily add a plugin of choice to immediately to the applicable server – so adding a monitor to check your MySQL instance for slow queries is simply choosing the plugin, and the plugin actually tells you what you need to do to make it work – like changing a config file.

Once those pieces are in place, monitoring just keeps working. Plugins typically have some default triggers and alerts, based on “what makes sense” for that plugin.

There’s currently 49 public plugins, which cover a wide range of services, applications, and monitoring methodologies, like checking a JMX counter and watching a log file for a condition you specify.

Extending functionality is pretty easy, as I found out firsthand. Beyond having a succinct plugin development guide, the support team are very helpful, as well as all of the plugins are available in open source on GitHub.

Plugins are written in Ruby – also a popular language in the tech arena these days.

Since one of the many services in our software stack is Apache Zookeeper, and there was no plugin for this service, I set out to write my own, to accomplish:

  1. Get the state of a Zookeeper instance monitored (service up, some counters/metrics)
  2. Learn some Ruby
  3. Give back

I wrote the basics of a plugin, and testing it locally on a Zookeeper instance with Scout proved to be a very fast turnaround, getting results with a day, and then thinking more about how I was doing it, and refactoring, and testing, and refactoring again.

I forked the ScoutApp GitHub repo, added my code, and issued a Pull Request, so they would take my code and incorporate it back into their Plugin Directory.

Lo and behold! It’s included, and anyone running both ScoutApp and using Zookeeper can simply add the plugin and get instant monitoring.

Here’s a screent capture of my plugin running, collecting details, and keeping us safe:

ScoutApp: Zookeeper

I encourage you to check it out, especially if you don’t have a monitoring solution, are starting a new project and have a few servers, or are looking for something else.

Sit on this, and logrotate!

Since a lot of what everyone does on those pesky devices called “comp-you-tars” is becoming increasingly more business-critical, and we’ve come to a point where a web company that has “one server that we all use” is going nowhere, we have piles of lovely silicon and metal, with electric pulses flowing through them to create the world as we see it today.

Server Room

I love these machines, as they have extended our abilities far beyond a single person, they have connected us in ways that our ancestors could only imagined and written about in fiction, and they provide a central part of our everyday lives.

Developing complex systems has provided us with a challenge of building and maintaining large amounts of machines, and done correctly, a single person can easily control thousands, if not tens-of-thousands, of machines with a high degree of stability, confidence and grace.

Back in the olden days, systems were small, resource constraints were very much a real problem, and this provided developers the incentive, nay, the requirement, of knowing about their system and how to write efficient and clean code within the constraints.

As time goes by, each resource constraint is alleviated, for a while, by hardware manufacturers Continue reading Sit on this, and logrotate!

The day my Xbox died

So today I’m hanging around home, and figured I’d geek out a bit and play around with my home entertainment setup.

I have a Samsung 42″ plasma TV, great picture, connected via HDMI to my TimeWarnerCable HD-DVR box.

Also connected is my Xbox 360, via component, and I typically use that (when not playing games) to watch videos, stored on my Drobo, with the attached DroboShare running fuppes to front the files via UPnP.

And today, when I had sat down to watch a film, I turn on the Xbox, and it freezes. And then displays the ominous Red Ring of Death. Damn.

Now I’ve submitted a repair for this, so even though it is out of warranty, M$ offers up to three years on this particular issue, and provide shipping and packaing for it all, so hopefully in a few days I’ll get their boox and send my dear console back to the for repair.

This failure spurred me into wondering how I could watch my films, so I hooked up my laptop’s video out and headphones up to the TV, and saw that work well. And then my roommate mentioned that I might want to hook up the mini-stereo system to the TV as well.

So I did. And the sound is pretty good compared to the internal speakers on the TV. They are ok, but the stereo speakers provide a much warmer sound, a fuller environment.

So now that there’s a new set of speakers involved, and my eternal desire to not have fivethousand remote controls around the house, I got a Logitech Harmony remote control a while back, so I updated it to use the correct sequence, and control the stereo volume.

So it’s all nicely playing together, all except the Xbox, which is dead. That lead me to look into other multimedia solutions, like XMBC and Plex, both pretty good looking. So I might figure out some way to create that link sometime soon, so it’s a very pretty multimedia interface.

Keep rolling, rolling, rolling…

A while back I wrote about using Nagios as a monitoring system.

Since then, I’ve had need to have it deployed via a packaging system called RPM, and since no “stable” community editions are out there, I have the need to “roll my own” for distribution on our platforms.

I’ve never used RPM from the “packager” side before – and it’s both very cool and infuriating. It has all sort of features and powerful macros, but debugging it isn’t a piece of cake at all.

If anyone has a great RPM tool out there that they want to recommend, let me know.

Monitor this.

A while back, we began investigating centralized monitoring tools for multiple systems, cross-platform, alerting, etc.

One contender was a package from MS, and a few others were tossed in the ring.

We did a proper match-up (or shootout, as I prefer) and tested a couple of candidates. While the all-inclusive MS offering is probably the best-functioning one, the cost is too prohibitive for a monitoring tool – about $1500/host monitored.

The extensivity and ease of use is uncomparable, but cost being a factor, we looked at another popular solution – Nagios.

Open source, modifiable – or should I say – Build Your Own – as it comes wth some basic egine concepts,a nd then you pretty much have to build every single monitor you want to look at.

The result is a more targeted monitoring solution, inasmuch it does exactly what you set it out to do – but absolutely no more.

The comparison showed this past week when I got an alert from my test MS instance about a SQL job running too long, something that I would have had to create some code, adapt it to monitor that specific job, and hope it could deal with exceptions I hadn’t thought of.

That’s a difference between a specialist in a particular field (i.e. DBA, mail admin, etc) and the overall concept of a systems administrator – sometimes a jack-of-all trades.

The MS offering is combined of “Management Packs” that are written by the developers of the systems that are being monitored – i.e. Exchange developers write the monitors for exchange and so on, whereas in Nagios monitoring world, you are expected to be able to figure out all of your own monitors/thresholds, etc.

I guess it makes it a little more interesting in the long run, as building something from scratch allows you the familiarity of knowing the ins-and-outs of the systems, but it’s time consuming and the returns are not as immediately apparent.

But it’s affordable. And we’ve got the techie know how to do it. So we do it.

If any readers have used Nagios, are interested in it, have advice, want advice, want to see what the color blue tastes like, let me know.

Who said that Granny Smith isn’t a good Apple?

Some of you may know that I don’t hold much love for Apple’s operating system.

It feels so clunky compared to my Windows-fu knowledge, and the change from one to the other is not at all simple.  I’d rather use Ubuntu, to be honest.

But here’s my current beef with Mac OSX – my machine is bound to Active Directory (in a corporate environment, they ALL  should!) and as any good computer, looks for a Domain Controller after a reboot, to check your login credentials, apply any scripts, etc

If it’s a mobile machine, typically you’ve set it up as a “mobile user account”, meaning that the machine is to cache your credentials, and in the absence of a DC, check the local cache and allow you to log in.

However, whenever MINE reboots, it takes about half an hour delay to log in, and there’s no progress, cancel, notification, etc as to WTF is it doing. Eventually, it might let me in. But in the meantime, time is a-wasting.

I finally got fed up enough to really research this, and it seems that there’s a way to fix it manually (in what all OSX users will deny vehemently is NOT a Registry!)  by modifying the values to a few keys, to reduce the timeout wait. But you can only do that once you’ve logged on.

So I’m stuck using another machine until mine logs me in and lets me change it. What a waste of time.

Windows will time out within a minute and let you know why.

Grumble. grumble, grumble.

Battle of the OS

This is crazy, but it needs to be put out there.

I currently have:

1 HP Laptop, running Windows XP
1 MacBookPro, Dual booting Vista and OSX Leopard, and have Paralleles to run XP and Ubuntu under OSX
1 EeePC, currently running Xandros (Eee mode), and soon to dual boot with XP and Ubuntu

Is this too many operating systems? I think it just might.

What’s your preference, and why?

Learn from others what not to do

So a while ago, my friend David sent me a funny article (funny for us, not for the article’s subjects).

It showed what some brilliant SysAdmin had done at his company’s location, and how it backfired miserably.

Read it here: A “Priceless” Server Room: Priceless – Worse Than Failure

I hope you enjoy reading the article. I think I may have Continue reading Learn from others what not to do