The Null Terminator

Ethan Ram’s geeky blog on the seam of technology and product management.

OPs / Production services review: Nagios, Kiwi Syslog and Limelight CDN

A Review of Services I’ve used in GameGround – Part IV 

This is the 4th part in a series of blog posts reviewing several 3rd pty products and services I’ve used in GameGround and my take on them. The basic approach I’m taking here is the applicability of the product for a lean-startup that wants to move fast. In the last post I wrote about Analytics and BI Reporting tools for the marketing team. This post is about Monitoring the health of the system server – for the OPS team. Next in the series – development infrastructure.

Nagios

Nagios probably is “The Industry Standard in IT Infrastructure Monitoring” as their slogan says. It’s very popular among IT stuff and can be configured to monitor and alerts about up to 40-50 servers. So even a medium size company can use it. It’s free server software – basically a scheduler that executes service checks against installed agents and tests against network devices, reports back the results and raises alerts above predefined thresholds. There’s also a comprehensive list of extensions or plugins written by the community that can be utilized to monitor about anything you’ll ever want.

It’s easy to setup Nagios to watch for server disk-space, CPU and the existence of certain services. The difficult part is to create checks that would alert you if internal parts of the software behave irrational and users are not seeing what they should. E.g. certain transactions do not end in time, server response time for certain requests is going up, users suddenly cannot see their friend’s list etc. These are much harder to watch. To monitor these you’ll need to write code both on your back-end servers – special functions (REST/WSDL)  that would do some internal testing and return true/false accordingly. Nagios is able to call such functions periodically and alert if they failed. It’s an evolving process: You’ll see your systm fail without Nagios alerting about it and then add more of those checks till it functions well.

So- It’s wiser to add some testing functionality on design time: plan your server modules to have Nagios testing APIs. You’ll also need to watch that some of your 3rd pty providers are working right: If the A/B testing API you are using is down then your site is probably down too. If your Content Delivery Network is down ppl are not getting to see your website, although everything is functioning on your side.

Nagios – the CONS:

  1. Nagios was started in 1999 and is written in Perl. Although new versions have been released it fills like an old product and things seem harder to achieve than what we’re used to these days. Most of the checks have to be configured annually in config files including thresholds for alerts, the amount of times a failure does not raise an alert etc.
  2. To achieve the functional testing mentioned above and to integrate with all the plugins for the different OS types and monitoring you’d want to do you’ll need to code some scripts on Nagios side (or at least edit existing scripts). That’s Perl coding. Although Perl knowledge is still quite common it’s fast diminishing from the planet. It’s already hard to find IT managers who can code it and younger developers haven’t even heard of it… We ended up getting outside help to create the basic setup. That much about free software…
  3. The learning curve is long. Expect the system to text-message you false alarms at 2am, telling you the system is down, for a few months, until you get the thresholds right. Expect your CEO to call you at 2am to tell you the system is down but there was no alert… Lots of things went wrong in a live environment of (only) 4 servers we had in production in GameGround – it took us about 3-4 months to get to a relatively solid Nagios setup that actually alerted us on most of the real problems.
  4. Some of the things to watch for ened-up being certain errors written into the different servers’ log files. These may be critical bugs and exceptions thrown from bad things happening  down in your code stacks. So you can set up Nagios to grep the log files for those strings. This is very heavy on your servers and on the traffic. Better have a proper central log server with alerts (see below). But then this actually means that you’re going to have 2 monitoring systems – one is the Nagios and one in the central logging server.
  5. Nagios is good in giving you a green or red sign next to your servers/services. But in reality managers want to know ahead of time that things are going in the wrong direction: queues are not emptying fast enough, response time on some requests are mounting. Nagios is no good for those tasks. You cannot use it to create graphs and its dashboard is not flexible.
  6. You have to manually define each server and each service you want to monitor. This does not work for cloud-based environment where adding a server instance is  done in a click, or even automatically.

I don’t know of a good alternative. But I would like to see something that combines system health alerts with Syslog analysis and a real-time configurable dashboard. Any ideas?

Kiwi Syslog

If you want to have a good insight into what’s actually happening in your servers you must check the different servers’ logs. Getting all the logs from all the servers into one place and automating the search for errors, exceptions and irregularities is key to having a healthy working production environment. First product we checked following warm recommendations from friends was Splunk. It has excellent easy-to-use web-interface and the setup is very easy (assuming that your servers are written and configured to upload syslog/log4 to a central server…). But Splunk is VERY expensive, even for a small server setup like ours they asked for something like $6000/year. The free version is only good for internal testing and running on-top of QA systems. For production you’ll need the enterprise version. It does not make sense to pay that much in a startup… So we checked Kiwi Syslog.

Kiwi Syslog is a relatively small piece of software made by a NZ company. Their main interface is based on a Windows installed client. But they now also have a web-based dashboard that gives you the most important features. It’s easy to setup and work with. It’s cool. And it costs like 2% of Splunk’s cost. Go Kiwi Syslog! Go!

Limelight CDN

Working with a Content Delivery Network is an important factor in speeding your pages loading time. When we tested before-and-after we saw a dramatic decrease of first-time page load from 3-4 seconds to 2-2.5 seconds for US-based users. With later widgets and pages the load time was about 30% faster. This is a lot! The other reason you’d like to have a CDN is that it’s going to take a large percentage of the traffic from your servers – so you’ll end up having less servers and pay less on traffic.

The basic service a CDN offers is the speeding up of static content (Imgs, CSS, JS files) delivery. The advanced services CDNs offer are media streaming and something called Whole Site Delivery – out of scope for this blog post. For the small site/service you’re going to pay $1000-$2000/month for the basic CDN – it may not be too bad considering the reduced costs on servers and traffic.

If you know you’re going to use a CDN you can write your code and delivery procedures in a way that starting to use a CDN would just be a flip of a config file entry. If you already have a website/service functioning without a CDN you’ll probably need to do some work to separate and version the static files correctly and add proper configuration everywhere. So, with the right design you should be able to integrate with a CDN, change CDN or stop working with a CDN in a matter of minutes.

So the story goes like this: We decided we had to have a CDN because every millisecond of page load time is critical. This was before launching our initial service. We went shopping and were surprised – it seems that most of the bigger CDNs were not willing to work with us at this stage at all. Even the local rep of the local Cotendo (a startup sharing a VC with GameGround) never returned a phone call… Luckily the local rep of Limelight was willing to take the deal and after a couple of weeks on negotiations we switched NO the config and it was working well (we did have a couple of config issues – minor faults on our side)

Q: Should a small lean-startup deploy a CDN as part of their initial release?

A: NO NO NO. It’s expensive and the signing up with the local representative of a CDN will consume too much of your time.

Q: Should a lean-startup write their code with a CDN in mind?

A: Yes! Sure! This will allow you to speed up your site and offload traffic if and when your site/service is showing some signs of success. Coding with a CDN in mind won’t make it slower anyway.

Q: Can you give some hints on how to design it right to work with a CDN?

A: I promise to have a post about it later on…  << but if you have a specific Q – ask it in a comment below

Q: Are there no free/cheap alternatives?

A: There are! Check out this post about using Google Apps Engine as a free static data CDN. Also – this post about using DropBox as a free CDN solution. Note that if the delivery of the resources from those unofficial-CDNs is not faster than delivering them from your own site then adding a CDN configuration might actually slow down your site. Be ware!

Advertisements

4 responses to “OPs / Production services review: Nagios, Kiwi Syslog and Limelight CDN

  1. Liza 2012 Sep 12 at 14:59

    I find Nagios hard to understand, maybe because I’m not really a programmer. But I see that many people are using it. So I think it will be beneficial for me to start learning it.

  2. http://yahoo.com 2013 Feb 11 at 12:45

    I Think that posting, “OPs / Production services review: Nagios,
    Kiwi Syslog and Limelight CDN The Null Terminator” ended up being spot on!
    I personallycouldn’t see eye to eye along with u more! Finally seems like I reallystumbled upon a internet site definitely worth reading. I appreciate it, Ute

  3. reputation management 2013 May 2 at 19:27

    This is really interesting, You are a very skilled blogger.
    I’ve joined your rss feed and look forward to seeking more of your excellent post. Also, I have shared your website in my social networks!

  4. Useful Link 2013 May 3 at 08:25

    I am sure this post has touched all the internet users, its really really fastidious paragraph on building
    up new blog.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: