Fast and Furious Monitoring

In the past few weeks, I’ve been working with a company that is using ScoutApp‘s hosted monitoring service, which provides a nice interface to quickly get up and running with a lot of basic information about a system.

This SaaS solution, while a paid service, allows a team to get their monitoring metrics put into place in the fastest turnaround time to get moving, while allowing to scale financially at a rate of ~$10/server/month.

Getting up and running is as simple as signing up for their risk-free 30-day trial, logging in to their interface, and following some simple instructions on installing their RubyGem plugin, aptly named scout, like so:

gem install scout

Obviously, needs Ruby installed, which is pretty common in web development these days.

Executing the scout executable will then prompt you for a GUID, provided from the web interface when “Adding a new system”, which tests connectivity to the ScoutApp service, and “checks in”.

Once the new system is added, the scout gem needs to be executed once a minute to check in with the server end, so this is typically achieved by placing an entry in the crontab, and again, the instructions are provided in the most convenient location on the command line, with variations for your system.

Once installed in crontab, it’s pretty much “fire-and-forget” – which is probably the best feature available in any system.

Heading back to the web interface, you’ll see the system details, and the real advantage of the ScoutApp system – the plugins.

Each system starts with a bunch of the basics – server load, memory profiling, disk space. Great! 90% of problems manifest in variations in these metrics, so getting them on the board from the get-go is great.

The Plugin Directory has a bunch of very commonly used applications that are used in the FLOSS stacks very popular amongst web development, so you can readily add a plugin of choice to immediately to the applicable server – so adding a monitor to check your MySQL instance for slow queries is simply choosing the plugin, and the plugin actually tells you what you need to do to make it work – like changing a config file.

Once those pieces are in place, monitoring just keeps working. Plugins typically have some default triggers and alerts, based on “what makes sense” for that plugin.

There’s currently 49 public plugins, which cover a wide range of services, applications, and monitoring methodologies, like checking a JMX counter and watching a log file for a condition you specify.

Extending functionality is pretty easy, as I found out firsthand. Beyond having a succinct plugin development guide, the support team are very helpful, as well as all of the plugins are available in open source on GitHub.

Plugins are written in Ruby – also a popular language in the tech arena these days.

Since one of the many services in our software stack is Apache Zookeeper, and there was no plugin for this service, I set out to write my own, to accomplish:

  1. Get the state of a Zookeeper instance monitored (service up, some counters/metrics)
  2. Learn some Ruby
  3. Give back

I wrote the basics of a plugin, and testing it locally on a Zookeeper instance with Scout proved to be a very fast turnaround, getting results with a day, and then thinking more about how I was doing it, and refactoring, and testing, and refactoring again.

I forked the ScoutApp GitHub repo, added my code, and issued a Pull Request, so they would take my code and incorporate it back into their Plugin Directory.

Lo and behold! It’s included, and anyone running both ScoutApp and using Zookeeper can simply add the plugin and get instant monitoring.

Here’s a screent capture of my plugin running, collecting details, and keeping us safe:

ScoutApp: Zookeeper

I encourage you to check it out, especially if you don’t have a monitoring solution, are starting a new project and have a few servers, or are looking for something else.

Sit on this, and logrotate!

Since a lot of what everyone does on those pesky devices called “comp-you-tars” is becoming increasingly more business-critical, and we’ve come to a point where a web company that has “one server that we all use” is going nowhere, we have piles of lovely silicon and metal, with electric pulses flowing through them to create the world as we see it today.

Server Room

I love these machines, as they have extended our abilities far beyond a single person, they have connected us in ways that our ancestors could only imagined and written about in fiction, and they provide a central part of our everyday lives.

Developing complex systems has provided us with a challenge of building and maintaining large amounts of machines, and done correctly, a single person can easily control thousands, if not tens-of-thousands, of machines with a high degree of stability, confidence and grace.

Back in the olden days, systems were small, resource constraints were very much a real problem, and this provided developers the incentive, nay, the requirement, of knowing about their system and how to write efficient and clean code within the constraints.

As time goes by, each resource constraint is alleviated, for a while, by hardware manufacturers Continue reading Sit on this, and logrotate!

The day my Xbox died

So today I’m hanging around home, and figured I’d geek out a bit and play around with my home entertainment setup.

I have a Samsung 42″ plasma TV, great picture, connected via HDMI to my TimeWarnerCable HD-DVR box.

Also connected is my Xbox 360, via component, and I typically use that (when not playing games) to watch videos, stored on my Drobo, with the attached DroboShare running fuppes to front the files via UPnP.

And today, when I had sat down to watch a film, I turn on the Xbox, and it freezes. And then displays the ominous Red Ring of Death. Damn.

Now I’ve submitted a repair for this, so even though it is out of warranty, M$ offers up to three years on this particular issue, and provide shipping and packaing for it all, so hopefully in a few days I’ll get their boox and send my dear console back to the for repair.

This failure spurred me into wondering how I could watch my films, so I hooked up my laptop’s video out and headphones up to the TV, and saw that work well. And then my roommate mentioned that I might want to hook up the mini-stereo system to the TV as well.

So I did. And the sound is pretty good compared to the internal speakers on the TV. They are ok, but the stereo speakers provide a much warmer sound, a fuller environment.

So now that there’s a new set of speakers involved, and my eternal desire to not have fivethousand remote controls around the house, I got a Logitech Harmony remote control a while back, so I updated it to use the correct sequence, and control the stereo volume.

So it’s all nicely playing together, all except the Xbox, which is dead. That lead me to look into other multimedia solutions, like XMBC and Plex, both pretty good looking. So I might figure out some way to create that link sometime soon, so it’s a very pretty multimedia interface.

Keep rolling, rolling, rolling…

A while back I wrote about using Nagios as a monitoring system.

Since then, I’ve had need to have it deployed via a packaging system called RPM, and since no “stable” community editions are out there, I have the need to “roll my own” for distribution on our platforms.

I’ve never used RPM from the “packager” side before – and it’s both very cool and infuriating. It has all sort of features and powerful macros, but debugging it isn’t a piece of cake at all.

If anyone has a great RPM tool out there that they want to recommend, let me know.

Monitor this.

A while back, we began investigating centralized monitoring tools for multiple systems, cross-platform, alerting, etc.

One contender was a package from MS, and a few others were tossed in the ring.

We did a proper match-up (or shootout, as I prefer) and tested a couple of candidates. While the all-inclusive MS offering is probably the best-functioning one, the cost is too prohibitive for a monitoring tool – about $1500/host monitored.

The extensivity and ease of use is uncomparable, but cost being a factor, we looked at another popular solution – Nagios.

Open source, modifiable – or should I say – Build Your Own – as it comes wth some basic egine concepts,a nd then you pretty much have to build every single monitor you want to look at.

The result is a more targeted monitoring solution, inasmuch it does exactly what you set it out to do – but absolutely no more.

The comparison showed this past week when I got an alert from my test MS instance about a SQL job running too long, something that I would have had to create some code, adapt it to monitor that specific job, and hope it could deal with exceptions I hadn’t thought of.

That’s a difference between a specialist in a particular field (i.e. DBA, mail admin, etc) and the overall concept of a systems administrator – sometimes a jack-of-all trades.

The MS offering is combined of “Management Packs” that are written by the developers of the systems that are being monitored – i.e. Exchange developers write the monitors for exchange and so on, whereas in Nagios monitoring world, you are expected to be able to figure out all of your own monitors/thresholds, etc.

I guess it makes it a little more interesting in the long run, as building something from scratch allows you the familiarity of knowing the ins-and-outs of the systems, but it’s time consuming and the returns are not as immediately apparent.

But it’s affordable. And we’ve got the techie know how to do it. So we do it.

If any readers have used Nagios, are interested in it, have advice, want advice, want to see what the color blue tastes like, let me know.

Who said that Granny Smith isn’t a good Apple?

Some of you may know that I don’t hold much love for Apple’s operating system.

It feels so clunky compared to my Windows-fu knowledge, and the change from one to the other is not at all simple.  I’d rather use Ubuntu, to be honest.

But here’s my current beef with Mac OSX – my machine is bound to Active Directory (in a corporate environment, they ALL  should!) and as any good computer, looks for a Domain Controller after a reboot, to check your login credentials, apply any scripts, etc

If it’s a mobile machine, typically you’ve set it up as a “mobile user account”, meaning that the machine is to cache your credentials, and in the absence of a DC, check the local cache and allow you to log in.

However, whenever MINE reboots, it takes about half an hour delay to log in, and there’s no progress, cancel, notification, etc as to WTF is it doing. Eventually, it might let me in. But in the meantime, time is a-wasting.

I finally got fed up enough to really research this, and it seems that there’s a way to fix it manually (in what all OSX users will deny vehemently is NOT a Registry!)  by modifying the values to a few keys, to reduce the timeout wait. But you can only do that once you’ve logged on.

So I’m stuck using another machine until mine logs me in and lets me change it. What a waste of time.

Windows will time out within a minute and let you know why.

Grumble. grumble, grumble.

Battle of the OS

This is crazy, but it needs to be put out there.

I currently have:

1 HP Laptop, running Windows XP
1 MacBookPro, Dual booting Vista and OSX Leopard, and have Paralleles to run XP and Ubuntu under OSX
1 EeePC, currently running Xandros (Eee mode), and soon to dual boot with XP and Ubuntu

Is this too many operating systems? I think it just might.

What’s your preference, and why?

Learn from others what not to do

So a while ago, my friend David sent me a funny article (funny for us, not for the article’s subjects).

It showed what some brilliant SysAdmin had done at his company’s location, and how it backfired miserably.

Read it here: A “Priceless” Server Room: Priceless – Worse Than Failure

I hope you enjoy reading the article. I think I may have Continue reading Learn from others what not to do

A little reflection

So this morning, on my way to work, I get an urgent phone call that something is wrong with the network – people can’t get to the file server, or get out to the Net, nada.

So naturally, I do what any average person would do. Whip out my BlueTooth ear piece and connect it so I can have my hands free. Then I grab my laptop – someone else was driving – and connect to the Net via BlueTooth connection to the phone I’m talking on.

Once the net is up, I connect to my office VPN and check the status of the servers.

Servers seem ok, communicating nicely with each other and the world.

My guy on the other end of the phone is not a sysadmin, but is savvy enough to check some of the more advanced diagnosis problems, and it seems that the internal network computers are just not talking to the servers.

So I decide I’ll restart the server that hands out network access numbers (for the techies: DHCP server) and see if that does anything good for the universe.

No dice.

By now, I’ve pretty much arrived at the office and begin the point-to-point diagnosis of where it seems the failure is. I see that the servers, on their own switch, seem to be happily chattering away, but anything connected to that switch is not happy.

Power cycle the switch.

Nope. Not yet. So I power cycle the other switches that are connected to that switch, effectively disconnecting anyone connected to anything anywhere.

Still no effect.

Finally, I decide tht I’m going to swap a cable – a simple, 12″ cable – that connects one switch to the other.

Voila!

So everything works, power cycle the switches again, and all is well. What a great start to an even greater day.

This leads me to the following revelation: I am a hero.

Not a super-hero, just a hero. Batman was a hero thanks to his gadgets and capability to use them. I have some of that under my (non-utility) belt.

So there.