Ask your systems: “What’s going on?”

This is a sysadmin/devops-style post.
Disclaimers are that I work with these tools and people, and like what they do.

In some amount of our professional lives, we are tasked with bringing order to chaos, keep systems running and have the businesses we work for continue functioning.

In our modern days of large-scale computing, web technology growth explosions, multiple datacenter deployments, cloud providers and other virtualization technologies, the manpower needed to handle the vast amount of technologies, services and systems seems to have a pretty high overhead cost associated with it. “You’ve got X amount of servers? Let’s hire Y amount of sysadmins!”

A lot of tech startups start out with some of the developers performing a lot of the systems tasks, and since this isn’t always their core expertise, decisions are made, scripts are written, and “it works”.  When the team/systems grow large enough to need their own handler, in walks a system admin-style person, and may keel over, due to the state of affairs.

Yes, there are many tech companies where this is not the case, and I commend them of keeping their systems lean, mean and clean.

A lot of companies have figured out that in order to make the X:Y ratio work well, automation is required.  Here’s an article that covers some numbers from earlier this year.  I find that the statement of a ratio of 50 servers to 1 sysadmin pretty low on my view of how things can be, especially given the tools that we have available to us.

One of the popular systems configuration tools I’ve been using heavily is Chef, from Opscode. They provide a hosted solution, as well as an open-source version of their software, for anyone to use.  Getting up and running with some basics is really fast, and there’s a ton of information available, as well as a really responsive community (from mailing lists, bug tracker site and IRC channel).  Once you’re working with Chef, you may wonder how you ever got anything done before you had it.  It’s really treating a large part of your infrastructure as code – something readable, executable, and repeatable.

But this isn’t about getting started with Chef. It’s about “what’s next”.

In any decent starting-out tech company, the amount of servers used will typically range from 2-3 all the way to 200 – or even more.  If you’ve gone all the way to 200 without something like Chef or Puppet, I commend your efforts, and feel somewhat sorry for you.  Once you’re automating your systems creation, deployment and change, then you typically want some feedback on what’s going on. Did what I asked this system to do succeed, or did it fail.

Enter Datadog.

Datadog attempts to bring many sources of information together, to help whomever it is that is supposed to be looking at the systems to make more sense of the situation, from collecting metrics from systems, events from services and other sources, to allowing a timeline and newsfeed that is very human-friendly.

Having all the data at your disposal makes it easier to find patterns and correlations between events, systems and behaviors – helping to minimize the “what just happened?” question.

The Chef model for managing systems is a centralized server (either the open source in your environment or the hosted service in Opscode), which tells a server what it is meant to “be”.  Not what it is meant to “do now”, but the final state it should be in.  They call this model “idempotent” – meaning that no matter how many time you execute the same code on the same server, the behavior should end up the same every time.  But it doesn’t follow up very much on the results of the actions.

An analogy could be that every morning, before your kid leaves the house, your [wife|mother|husband|guardian|pet dragon] tells them “You should wear a coat today.” and then goes on their merry way, not checking whether they wore a coat or not. The next morning, there will get the same comment, and so on and so forth.

So how do we figure out what happened? Did the kid wear a hat or not? I suppose I could check by asking the kid and get the answer, but what if there are 200 of us? Do I have time to ask every kid whether or not they ended up wearing a hat? I’m going to be spending a lot of time dealing with this simple problem, I can tell you now.

Chef has built-in functionality to report on what Chef did – after it has received its instructions from the centralized server. It’s called the “Exception and Report Handlers” – and this is how I tie these two technologes together.

I adapted some code started by Adam Jacob @Opscode, and extended it further into a complete RubyGem with modifications for content, functionality and some rigorous testing.

Once the gem was ready, now I have to distribute it to my servers, and then have it execute every time Chef runs on that server. So, based on the chef_handler cookbook, I added a new recipe to the datadog cookbook – dd-handler.

What this does is adds the necessary components to a Chef execution, and when placed at the beginning of a “run”, will capture all the events and report back on the important ones to the Datadog newsfeed.  It will also push some metrics, like how long the Chef execution too, how many resources were updated, etc.

The process for getting this done was really quite simple, once you boil down all the reading, how’s and why’s – especially if you use git to version control your chef-repo.  The `knife cookbook site install` command is a great method for keeping your git repo “safe” for future releases, thus preserving your changes to the cookbook, allowing for merging of new code automatically. Read more here.

THE MOST IMPORTANT STUFF:

Here’s pretty much the process I used (under chef/knife version 0.10.x):

$ cd chef-repo
$ knife cookbook site install datadog
$ vi cookbooks/datadog/attributes/default.rb

At this point, I head over to Datadog, hit the “Setup” page, and grap my organization’s API Key, as well as create a new Application Key named “chef-handler” and copy the Hash that is created.

I place these two values into the `attributes/default.rb` file, save and close.

$ knife cookbook upload datadog

This places the cookbook on my Chef server, and is now ready to be referenced by a node or role. I use roles, as it’s much more manageable across multiple nodes.

I update the `common-node` role we have to include “recipe[datadog::dd-handler]” as one of the first receipes to execute in the run list.

The common-node role applies to all of our systems, and since they all run chef, I want them all to report on their progress.

And then let it run.

END MOST IMPORTANT STUFF

Since our chef-client runs on a 30 minute interval, and not all execute at the same time, this makes for some interesting graphs at the more recent time slices – not all the data comes in at the same time.  That’s something to get used to.

Here’s an image of a system’s dashboard with only the Chef metrics:

Single Instance dashboard
It displays a 24-hour period, and shows that this particular instance had a low variance in its execution time, as well as not much is being updated during this time (a good thing, since it is consistent).

On a test machine I tossed together, I created a failure, and here’s how it gets reported back to the newsfeed:

 

Testing a failure
As you can see, the stacktrace attempt to provide me with the information I need to diagnose and repair the issue. Once I fix it, and apache can start, this event was logged in the “Low Priority” section of the feed (since succeses are expected, and failures are aberrant behavior):

Test passes

All this is well and wonderful, but what about a bunch of systems? Well, I grabbed a couple snaps off the production environment for you!

These are aggregates I created with the graphing language (had never really read it before today!)

Production aggregate metrics

By being able to see the execution patterns, and a bump closer to the left side of the “Resource Updated” graph – I then investigated, and someone had deployed a new rsyslog package – so there was a temporary increase in deploying the resources, and now there are slightly more resources to manage overall.

The purple bump seen in the “Execution Time” graph led me to investigate, and found a timeout in that system’s call to an “apt-get update” request – probably the remote repo was unavailable for a minute. Having the data available to make that correlation made this task of investigating this problem really fast, easy, and simple – more importantly since it has been succeeding ever since, no cause for alarm.

So now I have these two technologies – Chef to tell the kids (the servers) to wear coats, and Datadog to tell the parents (me) if the kids wore the coats or not, and why.

Really, just wear a coat. It’s cold out there.

———–

Tested on:

  • CentOS 5.7 (x64), Ruby 1.9.2 v180, Chef 0.10.4
  • Ubuntu 10.04 (x64), Ruby 1.8.7 v352, Chef 0.9.18
Used:

Road Tripping, Day 1

So Elyssa and I decided to go on a road trip.

More like Elyssa decided, and I agreed, but you see what I mean.

I got on a bus to meet her in NJ, and made it by running out to the bus as it had already pulled away from the station at Port Authority, so I guess since there were 4 other people on the bus, he waited for me.

Arrived in NJ, got in the car, and headed inland. I ended up dozing for about 20 minutes or so, and then a huge billboard told us about Roadside America, and we decided we HAD to stop in and see it.

It was run by a couple that had to be a million years old, and was nice and quaint, and we were told to sit down at some point to experience the “presentation”. It was very much “God Bless America”, and heavy on the religion-side of things.

Afterwards, decided to sample the local fare at Blue Mountain Restaurant. I have to say, the service was nice and friendly, the food was average. And we were the youngest people there by about 200 years.

Back on the road, I asked the Book of the Faces to suggest things in Pittsburgh, as this was our first destination.

Got some good suggestions, a lot of people telling me to sample Primanti brothers. Maybe forgetting that I’m vegetarian, but we tried anyways. They were closed, and instead we walked around, and ultimately found something delicious – the Bigelow Grille. Organic, delicious, many choices for vegetarians.

While there, pulled out the laptop and found a hotel with Priceline – never had used that before. Found a hotel really close for a low price, and drove right in for the night.

Sleep.

Fast and Furious Monitoring

In the past few weeks, I’ve been working with a company that is using ScoutApp‘s hosted monitoring service, which provides a nice interface to quickly get up and running with a lot of basic information about a system.

This SaaS solution, while a paid service, allows a team to get their monitoring metrics put into place in the fastest turnaround time to get moving, while allowing to scale financially at a rate of ~$10/server/month.

Getting up and running is as simple as signing up for their risk-free 30-day trial, logging in to their interface, and following some simple instructions on installing their RubyGem plugin, aptly named scout, like so:

gem install scout

Obviously, needs Ruby installed, which is pretty common in web development these days.

Executing the scout executable will then prompt you for a GUID, provided from the web interface when “Adding a new system”, which tests connectivity to the ScoutApp service, and “checks in”.

Once the new system is added, the scout gem needs to be executed once a minute to check in with the server end, so this is typically achieved by placing an entry in the crontab, and again, the instructions are provided in the most convenient location on the command line, with variations for your system.

Once installed in crontab, it’s pretty much “fire-and-forget” – which is probably the best feature available in any system.

Heading back to the web interface, you’ll see the system details, and the real advantage of the ScoutApp system – the plugins.

Each system starts with a bunch of the basics – server load, memory profiling, disk space. Great! 90% of problems manifest in variations in these metrics, so getting them on the board from the get-go is great.

The Plugin Directory has a bunch of very commonly used applications that are used in the FLOSS stacks very popular amongst web development, so you can readily add a plugin of choice to immediately to the applicable server – so adding a monitor to check your MySQL instance for slow queries is simply choosing the plugin, and the plugin actually tells you what you need to do to make it work – like changing a config file.

Once those pieces are in place, monitoring just keeps working. Plugins typically have some default triggers and alerts, based on “what makes sense” for that plugin.

There’s currently 49 public plugins, which cover a wide range of services, applications, and monitoring methodologies, like checking a JMX counter and watching a log file for a condition you specify.

Extending functionality is pretty easy, as I found out firsthand. Beyond having a succinct plugin development guide, the support team are very helpful, as well as all of the plugins are available in open source on GitHub.

Plugins are written in Ruby – also a popular language in the tech arena these days.

Since one of the many services in our software stack is Apache Zookeeper, and there was no plugin for this service, I set out to write my own, to accomplish:

  1. Get the state of a Zookeeper instance monitored (service up, some counters/metrics)
  2. Learn some Ruby
  3. Give back

I wrote the basics of a plugin, and testing it locally on a Zookeeper instance with Scout proved to be a very fast turnaround, getting results with a day, and then thinking more about how I was doing it, and refactoring, and testing, and refactoring again.

I forked the ScoutApp GitHub repo, added my code, and issued a Pull Request, so they would take my code and incorporate it back into their Plugin Directory.

Lo and behold! It’s included, and anyone running both ScoutApp and using Zookeeper can simply add the plugin and get instant monitoring.

Here’s a screent capture of my plugin running, collecting details, and keeping us safe:

ScoutApp: Zookeeper

I encourage you to check it out, especially if you don’t have a monitoring solution, are starting a new project and have a few servers, or are looking for something else.

Verizon Web Site fail.

Need I say more?

General Info
Chat start time  Aug 3, 2011 9:44:43 PM EST
Chat end time  Aug 3, 2011 10:37:08 PM EST
Duration (actual chatting time)  00:52:25
Operator  Amber

 

Chat Transcript
info: Please hold for a Verizon Wireless sales representative to assist you with your order.  Thank you for your patience.
info: You are now chatting with ‘Amber’
Amber: Hello. Thank you for visiting our chat service.  May I help you with your order today?
Mike: yes, please
Amber: How may I?
Mike: I am curious to know about the mifi mobile hotspot
Amber: I would love to help!
Amber: It’s just like the WiFi signal you would have at your home or at like McDonalds!
Mike: cool. 
Mike: What is the monthly charge?
Amber: Totally!
Amber: We have a couple options. Let’s get the device in your cart to see the options in your specific area!
Mike: which one would be better, the mifi or the samsung?
Amber: Their the same thing, but different makers.
Mike: I understand that the mifi can’t charge via usb and be a hotspot at the same time
Amber: It can. It can. You can use it as both USB and WiFi. You can also charge it via USB or via the home charger.
Mike: that’s contrary to what I am reading online
Amber: Where are you reading it?
Mike: comparison reviews
Mike: “Unfortunately, you can’t charge the 4510L over a USB connection to a notebook and broadcast a Wi-Fi signal at the same time like you can with the Samsung SCH-LC11 “
Mike: the review was in May of this year
Amber: Reviews aren’t always right.
Amber: Most the time their wrong.
Mike: That’s not entirely true.
Amber: That is unfortunately true. Most people who do reviews are usually the people who don’t know how to use the device or have had issues. All the devices have a 2% chance of having an issue. Tvs, microwaves and cars have more of a chance of a manufacturer’s defect.
Mike: interesting.
Mike: ok, let’s proceed.
Amber: Ok. Do you have the device in your cart?
Mike: trying to add
Amber: If you click on the name of it, you should have an add to cart button.
Mike: why would I want text messaging for this device?
Mike: I am required to select a text messaging pay as you go plan for this device
Amber: It’s a just in case. It will have a phone number.
Amber: Pay as you go would be the best option.
Mike: it’s the ONLY option.
Amber: There used to be a 5 dollar option.
Mike: and the shopping keeps crashing my browser
Amber: I’m sorry, they changed things recently. I’m sorry that it keeps crashing. What error do you get?
Mike: it kills the browser window – and a hard crash
Mike: I’m trying a different browser
Amber: Strange.
Mike: it’s almost impossible to buy this online
Mike: “The selection you made is unavailable at this time”
Amber: That’s a bad error. We’ve been having some issues today. May I have your phone number to report this issue?
Mike: XXX-XXX-XXXX
Amber: Thank you.
Mike: I even tried selecting the 10gb option – same message
Amber: I know if you keep trying it eventually works.
Mike: this is not very inspiring
Mike: I don’t feel like I’m going to “Rule the Air” – rather limp along and beg
Amber: I’m sorry! Their fixing things.
Amber: It will be better when it comes to the device. It’s the website that’s having issues and being fixed.
Mike: how do I put faith in a device provided by a network that can’t let me purchase it on their own website?
Amber: We haven’t had issues like this in a very long time. Every website has issues. It is not a reflection on our devices or our services. If say Facebook is down, and that happends a good amount of time because of the growth of it’s accounts, you’re not going to delete your Facebook account are you?
Mike: Facebook has never been down – that’s their priority, because that is what they provide.
Mike: I’m a systems engineer – my job is to make sure sites never go down.
Amber: I apologize, that is incorrect. I am a Facebook user and the site has been down.
Amber: But we’re not here to discuss websites, we’re here to order you the HotSpot.
Mike: This is what facebook provides when they have an outage: http://www.facebook.com/note.php?note_id=431441338919
Mike: Where’s the Verizon note/blog post on why 
Mike: I can’t buy a device?
Amber: And with the phone number you have provided I have reported it. It’s also known and being worked on. We have fixed the main problems and now we’re continuing to fix other issues. Earlier no one could log into MyVerizon. That was a priority. And I apologize but we do not post on blogs about the site issues because that is internal work. You can buy a device, patience is needed for this process tonight. I’m sorry!
Mike: Well, I’m trying repetitively. It’s still failing.
Amber: It will work. I know it will.
Mike: I am literally on another vendor’s site right now, to see their competitive options
Amber: When you buy from third party companies they make you sign their own terms and conditions. That can effect you from return policies, to changing plans to early termination fees. Also feature changes and what phones you’re able to get.
Mike: of course it does. that’s absolutely no different than what Verizon provides
Mike: “Third Party” – they are a vendor, just like you.
Amber: I’m sorry that is incorrect.
Mike: how are you not a vendor?
Mike: how is anyone else not a vendor?
Amber: We are a vendor. But we are direct. Not third party.
Mike: Direct? Third Party? Direct to whom? Third party of whom?
Amber: Verizon Wireless lets other companied to sell our products and services. They do have the right to change most options. If you go to a direct Verizon store or VerizonWireless.com that is Verizon selling Verizon products. If you buy from say WireFly or BestBuy they are selling our products.
Mike: ah yes – but I’m not looking at a device to connect to verizon – rather an entire other network.
Mike: a competitor, not a third party distributor of your products
Amber: Oh. I understand your wording choice now. I’m sorry. The reason their plans are less, is they have a lower coverage quality and area. We have a huge service area and we have the fastest coverage area as well with 3G. We’re working on getting 10X faster in all places with 4G soon!
Mike: I choose my words with care. while that may or may not be the case – depending on whose marketing team you choose to believe – they are still allowing me to purchase their product online.
Amber: They are not currently having issues as far as I know. And I don’t use their services whom ever they are so I wouldn’t know personally if they do have issues ever. This conversation isn’t going in any direction. And I apologize for that. Would you like to get one of our devices?
Mike: I am absolutely trying to get one of your devices. 
Mike: that’s the entire purpose of this conversation.
Mike: and it should not be this difficult.
Amber: Let’s try deleting your cookies, that may help it.
Mike: sure thing.
Amber: Thank you. Please tell me when you have finished that.
Mike: cookies are gone
Mike: signing back into myverizon
Mike: fail.
Mike: again.
Mike: ah well.
Amber: Please try.
Mike: “The selection you made is unavailable at this time Browse one of the options below to choose a different plan or phone, or call (800) 2-JOIN-IN for assistance.”
Mike: nope.
Mike: well, I guess this makes my decision pretty easy, doesn’t it.
Amber: It may be easier and less frustrating for you to call customer service. The phone number to customer service is1-800-922-0204 and hours of operation are 6AM- 11PM EST. Also you can contact them via *611 from your handset. You may also call 1-800-2 JOIN IN
Mike: I thought this was customer service.
Mike: I must be mistaken.
Amber: There is over the phone sales, 1-800-2 JOIN IN and there is customer service 1-800-922-0204 or *611 from your Verizon cell phone.
Mike: Yes, I see what their number are. But the failure to provide service via methods clearly advertised is not very inspiring.
Mike: I’m curious – how many service calls are dealing with failed purchases?
Amber: I’m not sure. Also I’m sorry for the inconvenience. Is there anything else I can do for you?
Mike: No, that’s it. I only wanted to order a mobile hotspot, commit to a two year contract and give you more of my money.
Mike: But I can’t do that.
Amber: I wish I could fix the website for you. Though I’m unable to do so. Thank you for chatting with Verizon Wireless online sales. Have a great evening!

Sit on this, and logrotate!

Since a lot of what everyone does on those pesky devices called “comp-you-tars” is becoming increasingly more business-critical, and we’ve come to a point where a web company that has “one server that we all use” is going nowhere, we have piles of lovely silicon and metal, with electric pulses flowing through them to create the world as we see it today.

Server Room

I love these machines, as they have extended our abilities far beyond a single person, they have connected us in ways that our ancestors could only imagined and written about in fiction, and they provide a central part of our everyday lives.

Developing complex systems has provided us with a challenge of building and maintaining large amounts of machines, and done correctly, a single person can easily control thousands, if not tens-of-thousands, of machines with a high degree of stability, confidence and grace.

Back in the olden days, systems were small, resource constraints were very much a real problem, and this provided developers the incentive, nay, the requirement, of knowing about their system and how to write efficient and clean code within the constraints.

As time goes by, each resource constraint is alleviated, for a while, by hardware manufacturers Continue reading Sit on this, and logrotate!

The beat keeps moving on.

As I sit here in another airport waiting area, I again realize the futility of airport security.

The entire TSA was probably created to give people jobs and create a semblance of security.

This time, I got into the line at LAX – United that had the scary body scanner.
I stood in two blue rectangles and placed my hands above my head, not unlike a prisoner, about to be shackled and tortured. In defiance, I stuck my tongue out during this process.

After I got through, thinking I was in the clear, the TSA dude tells me that he has to pat me down anyway. WTF. Sigh.

And then he “handles” me.

I feel so secure right now. Ugh.

Thanks, but no thanks, Verizon!

I guess Verizon think they know what’s best for me.

Recently got a nice little Verizon USB 760 Modem from work, not a new concept for me, just something to keep in touch while on the go.

Unfortunately, the Verizon Access Manager software most decidedly does NOT install correctly on my Mac. Instead, it tells me I’m not an administrator. Feels a lot like trying to install software on Windows Vista.

See the failure, how pretty it is...

Continue reading Thanks, but no thanks, Verizon!

You call this security? You’ve got to be kidding me

I just got off the phone with PayPal’s customer service department.

The reason I was on the phone in the first place – because you probably know how much I absolutely love talking to customer service representatives – is that I was trying to be a good Netizen.

I received a couple – not one – of emails originating from Paypal’s service for password resets. This is not a foreign thing, especially for someone that has his email address all over the web. It’s usually some scripting hacker trying to get access to my stuff.

The problem is that at the bottom of the email, there is this section: Continue reading You call this security? You’ve got to be kidding me