A decade of writing this stuff? Seriously.

In tinkering with my blogging platform and playing with different technologies, I’ve just realized that I’ve been writing online for well over a decade now.

It started a long time ago, when I was writing personal stuffs on a public Open Diary back in 1995, under an alias, which for the life of me, I can’t recall. The site is currently unavailable, and I was curious to see if they still indexed old entries, and see if I could dig anything up from back then.

It was a place that I tossed out whatever I had in mind, a place to jot down the ideas running through my head, a place for a creative outlet with the safety of knowing nothing would ever come back to me, since I lived behind the veil of anonymity (since back then, PRISM was just a dream…), and I was able to express whatever I wanted, in a safe-like place.

After writing there for a couple of years, I was witness to the 1997 Ben Yehuda Street Bombing – I was at a cafe off the street with some friends, when it happened, and went to offer whatever help I could, having had some First Aid training. After spending some 2 hours dealing with things that I’ve pushed far to the back of my mind, I was gathered by a friend, carted to his house, and sat in shock for a few hours, before making my way home.

The next day, I wrote about it on OD, and referenced my friend by first name only.

A couple of days later, a comment came on my post, asking if my friend was ‘So-and-so from Jerusalem’, and if so, that they knew him, and agreed that he was a great help. We began discussing our mutual friend, and eventually met in person.

This was the first revelation I had – you’re never truly anonymous.

We became pals, hung out a few times, and continued to stay up to date with each other for a while.
I did notice that after a while, my writing dwindled, now I knew that there is someone out there who knew who I am, not that I was saying anything outrageous, but the feeling of freedom dropped.

During my time in the Air Force, I wrote extremely rarely, since getting online was near impossible from base, so after discharge in 2000, I pretty much had stopped writing altogether.

In 2003, my friend Josh Brown invited me to the closed community (at the time) of LiveJournal, where it quickly grew into the local social networking site, where we could post, comment, and basically keep up with each other’s lives.
Online quizzes were ‘the thing’ and posting your results as an embed to your post was The Thing to do.

After spending 4 years on LJ, they began providing additional customizations, added features for paid-only users, and I didn’t want to spend any money on that, rather I wanted to host my own site.

So I did, for a while. In 2006, I built my own WordPress 2.0 site (history!), hosted it on my home server (terrible bandwidth) and began on the journey of customized web application administration. Dealing with databases, application code updates, frameworks, plugins, you name it.

I think I actually enjoyed tinkering with the framework more than actually writing.

Anyhow, I’ve written sporadically over time, about a wide variety of things, both on this site, and elsewhere.
The invention of Facebook, twitter, and pretty much any social network content outlet has replaced a lot of the heavier topic writing that went on here.

But it does indeed fill we with some sense of happiness that I’ve been doing this for a long time, and have preserved whatever I could from 2003 until now, and continue to try and put out some ideas now and then.

My hope is that anyone can take the to express their creativity in whatever fashion they feel possible, and share what they want to with the rest of us.

The Importance of Dependency Testing

Recently I revisited an Open Source project I started over a year ago.

This tool is built to hook into a much larger framework (Chef), and leverages a bunch of code many other people have written, and produce a specific result that I was looking for.

This subject is less about the tool itself, rather the process and procedure involved in testing dependencies.

This project is written in Ruby, and as many have identified in articles and tweets, some project maintainers don’t adhere to a versioning policy, making it hard to ensure working software across multiple versions of dependencies.
A lot rides on the maintainer’s adherence to a versioning standard – one very popular one is Semantic Versioning, or SemVer for short.

This introduces a few other questions, like how frequently should a writer release new versions of code, how frequently should users upgrade to leverage new fixes, features, etc.

In any case, my tool was restricted to running the framework’s version 10.x, considering that between major versions, functionality may change, and that there is no guarantee that my tool will continue working.

A new major version of Chef was released earlier this year and most of my existing projects are still on Chef 10.x, as this is still being updated with stability fixes and security patches, and the ‘jump’ to 11 is not on the schedule right now, so my tool continues functioning just fine.

Time passes, and I have a project running Chef 11 that I want to use my tool with.

Whoops. There’s a constraint built in to the tool’s syntax of dependencies that will report that “you have Chef 11, this wants Chef 10.x and not higher, have a nice day”.

So I change the constraint, install locally, and see that it works. Yay!

Now I want to commit the change that I made to the version constraint logic, but I want to continue testing the tool against the 10.x versions, as I should continue to support the active versions for as long as they are alive and in use, right?

A practice I was using for the tests that I had written was: given a static list of Chef versions, use the static entry as the Chef version for installation/test.

This required me to update the static list each time a new version of Chef was released, and potentially was testing against versions that didn’t need testing – rather I wanted to test against the latest of the mainline release.

I updated my constraint, ran the test suite that I’ve written, and whoops, it failed the tests.

Functionality-wise, it worked correctly on both versions, so the problem must be in my test suite, right?

I found a cool project called Appraisal, that’s been around for a while, and
used by a bunch of other projects, and you can read more about it here.
It allows one to specify multiple version constraints and test against each of them with the same test suite.

Sure enough, passes on version 10, not version 11. Same code, same tests. #wat

So now it’s time to do some digging. I read through some of the Chef ChangeLog, and decide there’s too much to wade through there, rather let’s take a look at the code my tool is using.

The failure was triggering here, and was showing a default value.
This meant that Chef was no longer loading the configuration file that I provide here correctly.

So I took a look at the current version of the configuration loader, and visually compared it with the 10.x version.
Sure enough, there’s one small change that’s affecting me: Old vs New

working_directory? What’s this? Oh, it’s over here, just a few lines prior.

Reading the full commit, and the original ticket, it seems like this is indeed a good idea, but why are my tests failing?

After further digging around in the aruba test suite extension I’m using, I realize that the environment variable PWD remains set to the actual working directory of my shell, not the test suite’s subprocesses.
Thus every time it runs, the chef_config_dir is looking in my current directory, not the directory the tests are running in.

After poking around aruba’s source code, and adding some debugging statements during test runs, I figured out that I need the test suite to change it’s PWD environment variable based on the test’s execution, which led to this commit.

Why is this different? Well, before, Ruby’s Dir.pwd statement would be invoked from inside the running test, loading the config from a location relative to Dir.pwd, where I was placing the test config file.
Now the test was trying to load the config from the process’ environment variable PWD instead, and failing to find the config.

Tests, pass, and now I can have Travis CI continue to test my code with multiple dependencies when it changes and catch things before they go badly.

All in all, an odd behavior to expect in a normal situation, as my tool is mean to be run interactively by a user, not via a test suite that mocks up all sorts of other environments.

So I spent about 2-3 hours digging around to essentially change one line that makes things work better and cleaner than before.

Worth it? Completely, as these changes will allow me to continue to ensure that my tool remains working with upstream releases of the framework, and maintain compatibility with supported versions of the framework.

TL, DR: Don’t skimp on testing you project against multiple versions of external dependencies, especially when your target users are going to be using more than one possible version.

P.S. Shout out to my girlfriend that generously lets me spend time hacking on these kind of things ūüėÄ

The Ripple Effect of Choices

We live our lives in a chaotic universe.

Atoms crashing into each other, at crazy fast rates, hyper-vibrations and electric pulses being at the core of our bodies, all working in tandem to get us through our day.
Do they govern our choices? Do our choices govern them? Who is in charge here, anyways?

Many traditional organized religions may express that everything is at the will of a higher power or being, and that we are governed from above.
Some may express that everything is predetermined other than moral decisions, and that is what we are responsible for.

An interesting statement I heard at one point is that “anyone trying to reconcile divine predetermination and free will has a fool’s errand.”

I like to think that I make my own choices about all things, but the subsequent impact of any given choice is pretty much up in the air.

This morning before I left my house, I made a choice of what to wear. Since today I will be presenting a short talk on a professional topic, I chose khaki slacks instead of jeans.

Once I left the house, I had to choose which form of transportation I would take to the office – take a CitiBike, or take one of two trains.

Since I was dressed “nicely” and didn’t want to get all sweaty, the bike option was out. Leaving me with the trains.

Then the choice of which train, and walking towards one of them would bring me to a breakfast cafe where I like to get a morning sandwich sometimes, so I took that option.

I walked in, saw the LONG line for food, and chose to turn around and head to the train instead of waiting. It’s not that great a sandwich anyways.

So I turn the corner, it’s a small alleyway, and a lady is walking about 2 strides ahead of me. I notice that at one point, she pauses, and lets a car coming from ahead pass completely before resuming, and I realize it’s because they are coming at high speed, and there’s deep puddles, s she was avoiding the splash.

A few moments later, another car comes along, and she speeds up, I pause, and try to remain far enough from the splash zone.

Nope.

She looks back at me with a touch of concern, while I’m spattered with droplets from a puddle of unknown cleanliness, and wrings a wry smile, as I smile back at her and say “I should have done what you did!”

Will the water dry? Pretty sure it will. Will my slacks reamin presentable? Don’t know just yet, still drying right now. Did every choice up until this point bring me here? Yes. Was this the universe pushing me here? I don’t think I’ll ever know.
But it did provide me with a new set of ideas.

Years ago, Life Magazine featured a picture (I spent 20 minutes looking for it online, couldn’t find it) of a pedestrian leaping like a ninja to dodge a monsoon-like splash from a passing vehicle, and this experience immediately re-triggered that image. I now feel like I can further relate to that particular image better, probably reinforcing some neural passageway inside my brain of a long-term memory resurfacing.

It provided me with what might call a “New York experience” since this is probably not the first time this has happened in a small alleyway in NYC, and it most certainly won’t be the last.

This short experience also reminded me that you can’t “ever go back and change something”, despite having a Flux Capacitor. If if you could, would you? I wouldn’t.

It also brought to the forefront that all the choices I’m making today were driven in part by choices I made in the past – one fo which was the choice to speak in front of others.

Your life is made up of the set of choices, experiences and hopefully the subsequent knowledge gained from them. It’s kind of what makes you: You, and me: Me. That’s one reason (of many) why we’re all different.

In short, make good choices. Good is a subjective term to you at the time you have to make them. As long as you take a moment at think: “Is this the best choice I could make right now, given everything else I know from before?” then you’ll probably be ok. We make tons of these choices unknowingly every day.

Oh, and what’s good for the goose might not always be good for the gander.


P.S. This blog post was written during my first attempt at the Pomodoro Technique in a cafe. Dunno how I feel about it yet, but it sure worked to hammer out a post in a focused amount of time.

My foray into web development

I like browsing the web. I do it a lot throughout my day.

A lot of people work hard at making the web a cool-looking place. Some sites make simplicity look so easy, that when you look under the hood, it’s all chaos and destruction, folded and crunched together, all to present something really nice and smooth for the end-user.

I’m not a developer – much less a web dev. There’s a lot to know in any field of computing – and in web it’s pretty much the most visible part of computing as a whole, since pretty much anyone anywhere is going to use a web browser to view a site at some point.

I mean, sure, we all learned some HTML – hell, I wrote some sites back in the days of Geocities, and it was awesome to learn about tiling backgrounds of animated GIFs, and when CSS came around, minds blown!

And I left that field for the frontend developers, and went into infrastructure and operations.

And as time passes, you find yourself managing a variety of systems and knowledge, and at some point, you may say to yourself, “I wish I knew how to answer this question…”

And then you write some code to answer it. Voila! You’re a developer, of sorts.

I’m a huge fan of data visualization. Telling stories with pictures dates back millennia, and it’s very relatable to most people. Recently, I wrote a tool to help myself display the dependency complexity of Chef roles, and I found that, while being very useful, the output is very limited, as it’s a static generated image, whereas we live in a web-friendly world where everything is interactive and fun!

So when I came across another hard question I wanted to answer, I thought, “Why not make this a web application?”

This time, the question I wanted to answer was: As a GitHub Organization owner for my company, what human-to-team-to-software-repository relationships do we have, and are they secure?

If you’ve ever managed an Organization in GitHub, there are a few key elements.

  1. An Organization can have many Repositories
  2. An Organization can have many Teams
  3. A Repository can have many Teams
  4. A Team can have many members, but only one permission (read only, read/write, owner)

So sorting out who is on what Team, what access they have, across many repositories, can be a security nightmare. Especially when you have more than 4-5 repositories.

During my first foray into solving this, I cobbled together a command-line tool, using Ruby with the Graphviz library. I’ve like Graphviz for years – it’s straightforward, as structured text gets rendered into a graph and then can be output to a file.

Very straightforward, has some limitations, but basically allows you to store graphs as text, and re-render them when changes happen. Basically, it’s like storing source code and not the binary output.

But since there were some limitations, and I wanted this new question to be more than a command-line tool, something I could share with the world at large, without requiring any client-side installation of any tools or dependencies.

So I spent a lot of time hemming and hawing, looking at web frameworks and trying to figure out some of them, and “how does this work?” came up a lot.

Finally, yesterday I set out to sit down and accomplish this task. I sat in a Starbucks in New York City, and had a Venti. I started banging away at about 11:30. I took a break for a refill and a snack around 1:30, and when I sat down again, I kept hacking away until 9:30pm, when I deemed completion.

The code was written, tested by me locally, pushed to GitHub, deployed to Heroku, DNS name wired up and all. As soon as I completed, I left Starbucks, and heaved a huge sigh – it was one hell of a mental high, I was in “the zone” and had been there for a long time.

You are more than welcome to browse the source code here and the finished project here. I call it the GitHub Organization Viewer, hence “GOVweb”.

I have a bunch of other ideas on how to make this better, how to model the data, which visual style to use, but I think for now, I’m going to leave it for a bit, and see what I think about it in a couple of months.

But all in all, this reinforced my opinion to never be afraid to try tackling a new idea, a new project, a new field you’re unfamiliar with – as long as you can read, comprehend and learn, the world is your oyster.

A picture is worth a (few) thousand bytes

(Context alert: Know Chef. If you don’t, it’s seriously worth looking into for any level of infrastructure management.)

TL;DR: I wrote a Knife plugin to visualize Chef Role dependencies. It’s here.

Recently, I needed to sort out a large amount of roles and their dependencies, in order to simplify the lives of everyone using them.

It wasn’t easy to determine that changing one would affect many others, since it had become common practice to embed roles within other roles’ run_list, resulting in a tree of cross-dependency hell.
A node’s run_list would typically contain a single role-specific item, embedding the lower-level dependencies.

A sample may look like this:

node[web1] => run_list = role[webserver] => run_list = role[base], recipe[apache2], ...
node[db1] =>  run_list = role[database]  => run_list = role[base], recipe[mongodb], ...

Many of these roles had a fair amount of code duplication, and most were setting the same base role, as well as any role-specific recipes. Others were referencing the same recipes, so figuring out what to refactor and where, without breaking everything else, was more than challenging.

The approach I wanted to implement was to have a very generalized base role, apply that to every instance, then add any specific roles should be applied as well to a given node.

After refactoring node’s run list would typically look like:

node[web1] => run_list = role[base], role[webserver]
node[db1] =>  run_list = role[base], role[database]

A bit simpler, right?

This removes the embedded dependency on role[base], since the assumption is that every node with have role[base] applied to it, unless I don’t want to for some reason (some development environment for instance).

Trying to refactor this was pretty tricky, so I wrote a visualizer to collect all the roles from a Chef repository’s role_path, parse them out, and create an image.

I’ve used Graphviz for a number of years now, and it’s pretty general-purpose when it comes to creating graphs of things (nodes), connecting them (edges), and rendering an output. So this was my go-to for this project.

Selling you on the power of visualizing data is beyond the scope of this post (and probably the author), but suffice to say there’s industries built around putting data into visual format for a variety of reasons, such as relative comparison, trending, etc.
In fact some buddies of mine have built an awesome product that does just that – visualizes data and events over time. Check them out at Datadog. (I’ve written other stuff for their platform before, it’s totally awesome.)

In my case, I wanted the story told by the image to:

  1. Demonstrate the complexity of the connections between roles/recipes (aka spaghetti)
  2. Point out if I have any cyclic dependencies (it’s possible!)
  3. Let me focus on what to do next: untangle

Items 1 & 2 were pretty cool – my plugin spat out an increasingly complex graph, showing relationships that made sense for things to work, but also contained some items with 5-6 levels of inheritance that are easily muddled. I didn’t have any cyclic dependencies, so I created a sample one to see what it would look like. It looked like a circle.

Item 3 was harder, as this meant that human intervention needed to take place. It was almost like deciding on which area of a StarCraft map you want to go after first. There’s plenty of mining to do, but which will pay off fastest? (geeky references, are you surprised?)

I decided on some of the smaller clusterings, and made some progress, changing where certain role statements lived and the node <=> role assignment to refactor a lot out.

My process of writing a plugin developed pretty much like this:

  1. Have an idea of how I want to do this
  2. Write some code that when executed manually, does what I want
  3. Transform that code into a knife plugin, so it lives inside the Chef Ecosystem
  4. Package said plugin as RubyGem, to make distribution easy for others
  5. Test, test, test (more on this in a moment)
  6. Document (readme only for now)
  7. Add some features, rethink of how certain things are done, refactor.
  8. Test some more

Writing code, packaging and documentation are pretty standard practices (more or less), so I won’t go into those.

The more interesting part was figuring out how to plug into the Chef/Knife plugins architecture, and testing.

Thanks to Opscode, writing a plugin isn’t too hard, there’s a good wiki, and other plugins you can look at to get some ideas.

A couple of noteworthy items:

  1. Figuring out how to provide command-line arguments to OptionParser was not easy, since there was no real intuitive way to do it. I spent about 2 hours researching why that wasn’t doing what I wanted, and finally figured out that "--flag" and "--flag " behave completely different.

  2. During my initial cut of the code, I used many statements to print output back to the user (puts "some message"). In the knife plugin world, one should use the ui.info or ui.error and the like, as this makes it much cleaner and consistent with other knife commands.

Testing:

Since this is a command-line application plugin, it made sense to use a framework that can handle inputs and outputs, as that’s my primary concern.
With a background in systems administration and engineering, software testing has never been on the top of my to-learn list, so when the opportunity arose to write tests for another project I wrote, I turned to Cucumber, and the CLI extension Aruba.

Say what you will about unit tests vs integration tests vs functional tests – I got going relatively quickly writing tests in quasi-English.
I won’t say that it’s easy, but it definitely made me think about how the plugin will be used, how users may input commands differently, and what they can expect to happen when they run it.

Cucumber/Aruba also allowed me to split my tests in a way that I can grok, such as all the CLI-related commands, flags, options exist in one test ‘feature’ file, whereas another feature file contains all the tests of reading the roles and graphing them in different formats.

Writing tests early on allowed me to continue to capture how I thought the plugin will be used, write that down in English, and think about it for awhile.
Some things changed after I had written them down, and even then, after I figured out the tests, I decided that the behavior didn’t match what I thought would be most common.

Refactoring the code, running tests in between to ensure that the behavior that I wanted remained consistent was very valuable. This isn’t news for any software engineers out there, but it might be useful to more system people to learn more about testing.

Another test I use is a style-checker called tailor – it measures up my code, and reports on things that may be malformed. This is the first test I run, as if the code is invalid (i.e. missing a end somewhere), it won’t pass this test.

Putting these into a test framework like Travis-CI is so very easy, especially since it’s a RubyGem, and I have set up environment variables to test against specific versions of Chef.
This provides the fast-feedback loop that tests my code against a matrix of Ruby & Chef versions.

So there you have it. A long explanation of why I wrote something. I had looked around, and there’s a knife crawl that is meant to walk a given role’s dependency tree and provide that, but that only worked for a single role, and wasn’t focused on visualizing.

So I wrote my own. Hope you like it, and happy to take pull requests that make sense, and bug reports for things that don’t.

You can find the gem on RubyGems.org – via gem install knife-role-spaghetti or on my GitHub account.

I’m very curious to know what other people’s role spaghetti looks like, so drop me a line, tweet, comment or such with your pictures!

Quick edit: A couple of examples, showing what this does.

Sample Roles

(full resolution here)

Running through the neato renderer (with the -N switch) produces this image:

Sample Roles Neato

(full resolution here

Recruiting via LinkedIn – Don’t Do This!

I regularly get emails from recruiters all over the planet, telling me about their awesome new technology, latest and greatest ideas, and why I should work for them.

Most get ignored.

One came in this week that annoyed me, since it was from someone at a company that had sent me the exact same email six months ago.

I felt I had to respond:

Hi <recruiter name>,

I think heard of <YourCompany> last year sometime from a friend.

I also received this same stock email from you on 8/22/11, and you had addressed it to “Pascal” – further evidence of a copy-and-paste.

It would behoove you to keep records of whom you contact, as well as reviewing the message you paste before clicking “Send”.

A stock recruiter email is not a very likely way to attract good recruits, especially if you’re listing a ton of things that are not particularly relevant or interesting in the realm of technology.

Asking me to send a resume, while being able to view my full LinkedIn profile also seems superfluous – here’s the information, you have supposedly read it, and that is what attracted you to my profile in the first place, rather than “someone who turned up in a keyword search”.

I wish you, and your company all the best, and hope that these recruiting tactics work for you.

All the best,
-M

I am very curious what kind of response, if any, I shall get.

Chatting with a robot

Here I am, sitting calmly, trying to figure out the reasoning for the universe, and I get a GChat notification that someone wants to chat with me.

Here’s the transcript:

10:29:12 AM caitlyn ball: hi
10:29:17 AM miketheman: hi
10:29:24 AM caitlyn ball: hey whats up? 22/F here. you?
[email protected] is now known as caitlyn ball. (10:29:27 AM)
10:29:41 AM miketheman: totally bored.
10:29:49 AM caitlyn ball: hmm. have we chatted before?
10:30:15 AM miketheman: probably not, since you just added me to your list
10:30:24 AM caitlyn ball: oh ok. i wasnt sure. anyways.. whats up?
10:30:35 AM miketheman: not much, working. you?
10:30:45 AM caitlyn ball: im like so boreddd…. there is nothing to do
10:31:00 AM caitlyn ball: ohhh wait! i got a great idea. have you ever watched a sexy girl like me strip live on a cam before?
10:31:18 AM miketheman: no, I don’t believe that I have.
10:31:25 AM miketheman: And that seems to be a great idea.
10:31:29 AM caitlyn ball: wellllll….. you could watch me strip if you would like?
10:31:43 AM miketheman: possibly.
10:31:56 AM miketheman: Or we could discuss the nature of the desire for people to watch other people remove their clothing
10:32:00 AM caitlyn ball: yeah? ok well my cam is setup through this website so that i cant be recorded so you have to signup there.
10:32:09 AM caitlyn ball: it only takes a minute and it is free. ok?
10:32:13 AM miketheman: That doesn’t seem likely.
10:32:28 AM caitlyn ball: http://<removed> go there then at the top of the page click on the goldish JOIN FREE button.
10:32:33 AM caitlyn ball: k?
10:33:00 AM miketheman: Are you sure you don’t want to debate the reasoning behind the attraction with exposed bodies?
10:33:16 AM caitlyn ball: also it does ask for a credit card but thats how they keep kids out. it does not charge the card. k?
10:33:34 AM miketheman: Of course it does. Are there wizards with hats on the site as well?
10:33:50 AM caitlyn ball: ok babe well hurry up and when u get logged in then u can view my cam and we can have some fun!
10:34:03 AM caitlyn ball: i also have some toys but u have to tip me some gold or join me in private to see those.
10:34:08 AM miketheman: Again, probably not going to happen.
10:34:19 AM caitlyn ball: hey lets talk on there babe. my messenger is messing up.
10:34:36 AM miketheman: I believe you completely.
10:34:52 AM miketheman: You must have reached the end of your loop.
10:35:03 AM miketheman: Bye!

So it was a fun little distraction, and the the URL provided resolves to <obviously>¬†girlcamz [net] – haven’t¬†visited, since there’s no point, really.

I was hoping the bot would be a little better than a simple responder to the next input. But alas. Developers of sex marketing spam bots are probably less inclined to put some real engineering efforts into their crap.

Ask your systems: “What’s going on?”

This is a sysadmin/devops-style post.
Disclaimers are that I work with these tools and people, and like what they do.

In some amount of our professional lives, we are tasked with bringing order to chaos, keep systems running and have the businesses we work for continue functioning.

In our modern days of large-scale computing, web technology growth explosions, multiple datacenter deployments, cloud providers and other virtualization technologies, the manpower needed to handle the vast amount of technologies, services and systems seems to have a pretty high overhead cost associated with it. “You’ve got X amount of servers? Let’s hire Y amount of sysadmins!”

A lot of tech startups start out with some of the developers performing a lot of the systems tasks, and since this isn’t always their core expertise, decisions are made, scripts are written, and “it works”. ¬†When the team/systems grow large enough to need their own handler, in walks a system admin-style person, and may keel over, due to the state of affairs.

Yes, there are many tech companies where this is not the case, and I commend them of keeping their systems lean, mean and clean.

A lot of companies have figured out that in order to make the X:Y ratio work well, automation is required. ¬†Here’s an article that covers some numbers from earlier this year. ¬†I find that the statement of a ratio of 50 servers to 1 sysadmin pretty low on my view of how things can be, especially given the tools that we have available to us.

One of the popular systems configuration tools I’ve been using heavily is Chef, from Opscode. They provide a hosted solution, as well as an open-source version of their software, for anyone to use. ¬†Getting up and running with some basics is really fast, and there’s a ton of information available, as well as a really responsive community (from mailing lists, bug tracker site and IRC channel). ¬†Once you’re working with Chef, you may wonder how you ever got anything done before you had it. ¬†It’s really treating a large part of your infrastructure as code – something readable, executable, and repeatable.

But this isn’t about getting started with Chef. It’s about “what’s next”.

In any decent starting-out tech company, the amount of servers used will typically range from 2-3 all the way to 200 – or even more. ¬†If you’ve gone all the way to 200 without something like Chef or Puppet, I commend your efforts, and feel somewhat sorry for you. ¬†Once you’re automating your systems creation, deployment and change, then you typically want some feedback on what’s going on. Did what I asked this system to do succeed, or did it fail.

Enter Datadog.

Datadog attempts to bring many sources of information together, to help whomever it is that is supposed to be looking at the systems to make more sense of the situation, from collecting metrics from systems, events from services and other sources, to allowing a timeline and newsfeed that is very human-friendly.

Having all the data at your disposal makes it easier to find patterns and correlations between events, systems and behaviors – helping to minimize the “what just happened?” question.

The Chef model for managing systems is a centralized server (either the open source in your environment or the hosted service in Opscode), which tells a server what it is meant to “be”. ¬†Not what it is meant to “do now”, but the final state it should be in. ¬†They call this model “idempotent” – meaning that no matter how many time you execute the same code on the same server, the behavior¬†should¬†end up the same every time. ¬†But it doesn’t follow up very much on the results of the actions.

An analogy could be that every morning, before your kid leaves the house, your [wife|mother|husband|guardian|pet dragon] tells them “You should wear a coat today.” and then goes on their merry way, not checking whether they wore a coat or not. The next morning, there will get the same comment, and so on and so forth.

So how do we figure out what happened? Did the kid wear a hat or not? I suppose I could check by asking the kid and get the answer, but what if there are 200 of us? Do I have time to ask every kid whether or not they ended up wearing a hat? I’m going to be spending a lot of time dealing with this simple problem, I can tell you now.

Chef has built-in functionality to report on what Chef did – after it has received its instructions from the centralized server. It’s called the “Exception and Report Handlers” – and this is how I tie these two technologes together.

I adapted some code started by Adam Jacob @Opscode, and extended it further into a complete RubyGem with modifications for content, functionality and some rigorous testing.

Once the gem was ready, now I have to distribute it to my servers, and then have it execute every time Chef runs on that server. So, based on the chef_handler cookbook, I added a new recipe to the datadog cookbook Рdd-handler.

What this does is adds the necessary components to a Chef execution, and when placed at the beginning of a “run”, will capture all the events and report back on the important ones to the Datadog newsfeed. ¬†It will also push some metrics, like how long the Chef execution too, how many resources were updated, etc.

The process for getting this done was really quite simple, once you boil down all the reading, how’s and why’s – especially if you use git to version control your chef-repo. ¬†The `knife cookbook site install` command is a great method for keeping your git repo “safe” for future releases, thus preserving your changes to the cookbook, allowing for merging of new code automatically. Read more¬†here.

THE MOST IMPORTANT STUFF:

Here’s pretty much the process I used (under chef/knife version 0.10.x):

$ cd chef-repo
$ knife cookbook site install datadog
$ vi cookbooks/datadog/attributes/default.rb

At this point, I head over to Datadog, hit the “Setup” page, and grap my organization’s API Key, as well as create a new Application Key named “chef-handler” and copy the Hash that is created.

I place these two values into the `attributes/default.rb` file, save and close.

$ knife cookbook upload datadog

This places the cookbook on my Chef server, and is now ready to be referenced by a node or role. I use roles, as it’s much more manageable across multiple nodes.

I update the `common-node`¬†role we have to include “recipe[datadog::dd-handler]” as one of the first receipes to execute in the run list.

The common-node role applies to all of our systems, and since they all run chef, I want them all to report on their progress.

And then let it run.

END MOST IMPORTANT STUFF

Since our chef-client runs on a 30 minute interval, and not all execute at the same¬†time, this makes for some interesting graphs at the more recent time slices – not all the data comes in at the same time. ¬†That’s something to get used to.

Here’s an image of a system’s dashboard with only the Chef metrics:

Single Instance dashboard
It displays a 24-hour period, and shows that this particular instance had a low variance in its execution time, as well as not much is being updated during this time (a good thing, since it is consistent).

On a test machine I tossed together, I created a failure, and here’s how it gets reported back to the newsfeed:

 

Testing a failure
As you can see, the stacktrace attempt to provide me with the information I need to diagnose and repair the issue. Once I fix it, and apache can start, this event was logged in the “Low Priority” section of the feed (since succeses are expected, and failures are¬†aberrant¬†behavior):

Test passes

All this is well and wonderful, but what about a bunch of systems? Well, I grabbed a couple snaps off the production environment for you!

These are aggregates I created with the graphing language (had never really read it before today!)

Production aggregate metrics

By being able to see the execution patterns, and a bump closer to the left side of the “Resource Updated” graph – I then investigated, and someone had deployed a new rsyslog package – so there was a temporary increase in deploying the resources, and now there are slightly more resources to manage overall.

The purple bump seen in the “Execution Time” graph led me to investigate, and found a timeout in that system’s call to an “apt-get update” request – probably the remote repo was¬†unavailable¬†for a minute. Having the data available to make that correlation made this task of investigating this problem really fast, easy, and simple – more importantly since it has been succeeding ever since, no cause for alarm.

So now I have these two technologies – Chef to tell the kids (the servers) to wear coats, and Datadog to tell the parents (me) if the kids wore the coats or not, and why.

Really, just wear a coat. It’s cold out there.

———–

Tested on:

  • CentOS 5.7 (x64), Ruby 1.9.2 v180,¬†Chef 0.10.4
  • Ubuntu 10.04 (x64), Ruby 1.8.7 v352, Chef 0.9.18
Used:

Road Tripping, Day 1

So Elyssa and I decided to go on a road trip.

More like Elyssa decided, and I agreed, but you see what I mean.

I got on a bus to meet her in NJ, and made it by running out to the bus as it had already pulled away from the station at Port Authority, so I guess since there were 4 other people on the bus, he waited for me.

Arrived in NJ, got in the car, and headed inland. I ended up dozing for about 20 minutes or so, and then a huge billboard told us about Roadside America, and we decided we HAD to stop in and see it.

It was run by a couple that had to be a million years old, and was nice and quaint, and we were told to sit down at some point to experience the “presentation”. It was very much “God Bless America”, and heavy on the religion-side of things.

Afterwards, decided to sample the local fare at Blue Mountain Restaurant. I have to say, the service was nice and friendly, the food was average. And we were the youngest people there by about 200 years.

Back on the road, I asked the Book of the Faces to suggest things in Pittsburgh, as this was our first destination.

Got some good suggestions, a lot of people telling me to sample Primanti brothers. Maybe forgetting that I’m vegetarian, but we tried anyways. They were closed, and instead we walked¬†around, and ultimately found something¬†delicious¬†– the Bigelow Grille. Organic, delicious, many choices for vegetarians.

While there, pulled out the laptop and found a hotel with Priceline Рnever had used that before. Found a hotel really close for a low price, and drove right in for the night.

Sleep.

Fast and Furious Monitoring

In the past few weeks, I’ve been working with a company that is using ScoutApp‘s hosted monitoring service, which provides a nice interface to quickly get up and running with a lot of basic information about a system.

This SaaS solution, while a paid service, allows a team to get their monitoring metrics put into place in the fastest turnaround time to get moving, while allowing to scale financially at a rate of ~$10/server/month.

Getting up and running is as simple as signing up for their risk-free 30-day trial, logging in to their interface, and following some simple instructions on installing their RubyGem plugin, aptly named scout, like so:

gem install scout

Obviously, needs Ruby installed, which is pretty common in web development these days.

Executing the scout¬†executable will then prompt you for a GUID, provided from the web interface when “Adding a new system”, which tests¬†connectivity¬†to the ScoutApp service, and “checks in”.

Once the new system is added, the scout gem needs to be executed once a minute to check in with the server end, so this is typically achieved by placing an entry in the crontab, and again, the instructions are provided in the most convenient location on the command line, with variations for your system.

Once installed in crontab, it’s pretty much “fire-and-forget” – which is probably the best feature available in any system.

Heading back to the web interface, you’ll see the system details, and the real advantage of the ScoutApp system – the plugins.

Each system starts with a bunch of the basics – server load, memory profiling, disk space. Great! 90% of problems manifest in variations in these metrics, so getting them on the board from the get-go is great.

The Plugin Directory has a bunch of very commonly used applications that are used in the FLOSS stacks very popular amongst web development, so you can readily add a plugin of choice to immediately to the applicable server – so adding a monitor to check your MySQL instance for slow queries is simply choosing the plugin, and the plugin actually tells you what you need to do to make it work – like changing a config file.

Once those pieces are in place, monitoring just keeps working. Plugins typically have some default triggers and alerts, based on “what makes sense” for that plugin.

There’s currently 49 public plugins, which cover a wide range of services, applications, and monitoring¬†methodologies, like checking a JMX counter and watching a log file for a condition you specify.

Extending functionality is pretty easy, as I found out firsthand. Beyond having a succinct plugin development guide, the support team are very helpful, as well as all of the plugins are available in open source on GitHub.

Plugins are written in Ruby – also a popular language in the tech arena these days.

Since one of the many services in our software stack is Apache Zookeeper, and there was no plugin for this service, I set out to write my own, to accomplish:

  1. Get the state of a Zookeeper instance monitored (service up, some counters/metrics)
  2. Learn some Ruby
  3. Give back

I wrote the basics of a plugin, and testing it locally on a Zookeeper instance with Scout proved to be a very fast turnaround, getting results with a day, and then thinking more about how I was doing it, and refactoring, and testing, and refactoring again.

I forked the ScoutApp GitHub repo, added my code, and issued a Pull Request, so they would take my code and incorporate it back into their Plugin Directory.

Lo and behold! It’s included, and anyone running both ScoutApp and using Zookeeper can simply add the plugin and get instant monitoring.

Here’s a screent capture of my plugin running, collecting details, and keeping us safe:

ScoutApp: Zookeeper

I encourage you to check it out, especially if you don’t have a monitoring solution, are starting a new project and have a few servers, or are looking for something else.