The Null Terminator
Ethan Ram’s geeky blog on the seam of technology and product management.
Monthly Archives: September 2011
OPs / Production services review: Nagios, Kiwi Syslog and Limelight CDN
2011 Sep 30Posted by on
A Review of Services I’ve used in GameGround – Part IV
This is the 4th part in a series of blog posts reviewing several 3rd pty products and services I’ve used in GameGround and my take on them. The basic approach I’m taking here is the applicability of the product for a lean-startup that wants to move fast. In the last post I wrote about Analytics and BI Reporting tools for the marketing team. This post is about Monitoring the health of the system server – for the OPS team. Next in the series – development infrastructure.
Nagios probably is “The Industry Standard in IT Infrastructure Monitoring” as their slogan says. It’s very popular among IT stuff and can be configured to monitor and alerts about up to 40-50 servers. So even a medium size company can use it. It’s free server software – basically a scheduler that executes service checks against installed agents and tests against network devices, reports back the results and raises alerts above predefined thresholds. There’s also a comprehensive list of extensions or plugins written by the community that can be utilized to monitor about anything you’ll ever want.
It’s easy to setup Nagios to watch for server disk-space, CPU and the existence of certain services. The difficult part is to create checks that would alert you if internal parts of the software behave irrational and users are not seeing what they should. E.g. certain transactions do not end in time, server response time for certain requests is going up, users suddenly cannot see their friend’s list etc. These are much harder to watch. To monitor these you’ll need to write code both on your back-end servers – special functions (REST/WSDL) that would do some internal testing and return true/false accordingly. Nagios is able to call such functions periodically and alert if they failed. It’s an evolving process: You’ll see your systm fail without Nagios alerting about it and then add more of those checks till it functions well.
So- It’s wiser to add some testing functionality on design time: plan your server modules to have Nagios testing APIs. You’ll also need to watch that some of your 3rd pty providers are working right: If the A/B testing API you are using is down then your site is probably down too. If your Content Delivery Network is down ppl are not getting to see your website, although everything is functioning on your side.
Nagios – the CONS:
- Nagios was started in 1999 and is written in Perl. Although new versions have been released it fills like an old product and things seem harder to achieve than what we’re used to these days. Most of the checks have to be configured annually in config files including thresholds for alerts, the amount of times a failure does not raise an alert etc.
- To achieve the functional testing mentioned above and to integrate with all the plugins for the different OS types and monitoring you’d want to do you’ll need to code some scripts on Nagios side (or at least edit existing scripts). That’s Perl coding. Although Perl knowledge is still quite common it’s fast diminishing from the planet. It’s already hard to find IT managers who can code it and younger developers haven’t even heard of it… We ended up getting outside help to create the basic setup. That much about free software…
- The learning curve is long. Expect the system to text-message you false alarms at 2am, telling you the system is down, for a few months, until you get the thresholds right. Expect your CEO to call you at 2am to tell you the system is down but there was no alert… Lots of things went wrong in a live environment of (only) 4 servers we had in production in GameGround – it took us about 3-4 months to get to a relatively solid Nagios setup that actually alerted us on most of the real problems.
- Some of the things to watch for ened-up being certain errors written into the different servers’ log files. These may be critical bugs and exceptions thrown from bad things happening down in your code stacks. So you can set up Nagios to grep the log files for those strings. This is very heavy on your servers and on the traffic. Better have a proper central log server with alerts (see below). But then this actually means that you’re going to have 2 monitoring systems – one is the Nagios and one in the central logging server.
- Nagios is good in giving you a green or red sign next to your servers/services. But in reality managers want to know ahead of time that things are going in the wrong direction: queues are not emptying fast enough, response time on some requests are mounting. Nagios is no good for those tasks. You cannot use it to create graphs and its dashboard is not flexible.
- You have to manually define each server and each service you want to monitor. This does not work for cloud-based environment where adding a server instance is done in a click, or even automatically.
I don’t know of a good alternative. But I would like to see something that combines system health alerts with Syslog analysis and a real-time configurable dashboard. Any ideas?
If you want to have a good insight into what’s actually happening in your servers you must check the different servers’ logs. Getting all the logs from all the servers into one place and automating the search for errors, exceptions and irregularities is key to having a healthy working production environment. First product we checked following warm recommendations from friends was Splunk. It has excellent easy-to-use web-interface and the setup is very easy (assuming that your servers are written and configured to upload syslog/log4 to a central server…). But Splunk is VERY expensive, even for a small server setup like ours they asked for something like $6000/year. The free version is only good for internal testing and running on-top of QA systems. For production you’ll need the enterprise version. It does not make sense to pay that much in a startup… So we checked Kiwi Syslog.
Kiwi Syslog is a relatively small piece of software made by a NZ company. Their main interface is based on a Windows installed client. But they now also have a web-based dashboard that gives you the most important features. It’s easy to setup and work with. It’s cool. And it costs like 2% of Splunk’s cost. Go Kiwi Syslog! Go!
Working with a Content Delivery Network is an important factor in speeding your pages loading time. When we tested before-and-after we saw a dramatic decrease of first-time page load from 3-4 seconds to 2-2.5 seconds for US-based users. With later widgets and pages the load time was about 30% faster. This is a lot! The other reason you’d like to have a CDN is that it’s going to take a large percentage of the traffic from your servers – so you’ll end up having less servers and pay less on traffic.
The basic service a CDN offers is the speeding up of static content (Imgs, CSS, JS files) delivery. The advanced services CDNs offer are media streaming and something called Whole Site Delivery – out of scope for this blog post. For the small site/service you’re going to pay $1000-$2000/month for the basic CDN – it may not be too bad considering the reduced costs on servers and traffic.
If you know you’re going to use a CDN you can write your code and delivery procedures in a way that starting to use a CDN would just be a flip of a config file entry. If you already have a website/service functioning without a CDN you’ll probably need to do some work to separate and version the static files correctly and add proper configuration everywhere. So, with the right design you should be able to integrate with a CDN, change CDN or stop working with a CDN in a matter of minutes.
So the story goes like this: We decided we had to have a CDN because every millisecond of page load time is critical. This was before launching our initial service. We went shopping and were surprised – it seems that most of the bigger CDNs were not willing to work with us at this stage at all. Even the local rep of the local Cotendo (a startup sharing a VC with GameGround) never returned a phone call… Luckily the local rep of Limelight was willing to take the deal and after a couple of weeks on negotiations we switched NO the config and it was working well (we did have a couple of config issues – minor faults on our side)
Q: Should a small lean-startup deploy a CDN as part of their initial release?
A: NO NO NO. It’s expensive and the signing up with the local representative of a CDN will consume too much of your time.
Q: Should a lean-startup write their code with a CDN in mind?
A: Yes! Sure! This will allow you to speed up your site and offload traffic if and when your site/service is showing some signs of success. Coding with a CDN in mind won’t make it slower anyway.
Q: Can you give some hints on how to design it right to work with a CDN?
A: I promise to have a post about it later on… << but if you have a specific Q – ask it in a comment below
Q: Are there no free/cheap alternatives?
A: There are! Check out this post about using Google Apps Engine as a free static data CDN. Also – this post about using DropBox as a free CDN solution. Note that if the delivery of the resources from those unofficial-CDNs is not faster than delivering them from your own site then adding a CDN configuration might actually slow down your site. Be ware!
Going Agile in a B2B Company
2011 Sep 10Posted by on
On why Agile is the right development methodology on non-internet software companies too
Of the 14 years I’ve been developing software 10 years were with companies doing B2B software (intended to be sold to another business, as opposed to B2C – software that is directed at customers online etc.) In recent years the Agile development methodology is growing strong and a recent Forrester study shows that now over 40% of development teams in the US are using some sort of Agile development methodology. I’ve heard of Agile project in some of the larger companies and had a chance of “upgrading” my own development department to work in an Agile environment (we took Kanban as our preferred Agile approach). Now this blog post is not going to be about my experience with Agile. Instead I’m going to tell you about a talk I’ve had with a friend who told me Agile was not for his (awesome) B2B software company and my response to that. So these are the reasons why he thought Agile was not for him:
- The company’s product release cycle is 3-6 months – They have a larger release and a couple of smaller releases, or Feature Packs, yearly. This is what their sales and channel guys are used to and can handle. They will not accept a daily/weekly/monthly release anyway – this is not how the work.
- Every other year they have at least one of those larger development projects where they tear apart large parts of the system and re-architect them. This takes several months to develop and runs in parallel to other projects. Those projects will not fit into an Agile development environment.
- They cannot afford the time it would take to go Agile. They are too tight with schedule to allow their devs to go back and add unit testing to existing code and development infrastructure for automation – and test automation is key to any kind of Agile-ness. Right?
- Their QA manager is happy with the code he gets and the released product quality is fair.
- The company is doing well in general – why change?!
At first this all seemed logical to me as I knew that the real power of Agile development lies in the quick release cycles (“give something small to your customers often”) and in cases that the software quality socks. Anyway, with another thought, these are the questions I’ve asked–
- Are you, the product manager (or company manager in a small operation), happy with the time it takes from when you identify a clear customer need to the time you have a new feature that you can sell that customer? If I told you could potentially be in a place to define customer-specific releases that would have the new feature out in 3 weeks – how much does that worth to you? ((AND – your R&D manager is not going to freak-out about this last-minute change, because his yearly timeline is getting fucked and his people don’t like working Saturdays.))
- How much time do you spend before every major release reviewing product requirements and development estimations, trying to fit 30 new features into a 10 features development time. Prioritizing again and again, only to find out 4 weeks before release that 20% of features you were guaranteed to have will not make it on time. Or the deadline must be push.
- How many times did you want something added to the upcoming release, as a large customer deal is pending on it, and your R&D managers said there’s no time and you’ll have to take something out of the release – – or the deadline must be pushed…
- Why is it taking the QA teams 2 full weeks to test a new version release? Does it have to be like this: they first get a version that just doesn’t works 4 weeks before deadline. Then this Ping-Pong between R&D and QA till the version stabilizes: The open bug count starts dropping to the point of reaching the ultimate go-no-go meeting, where the QA manager signs-off the version at the end of an extremely hectic and night-less month. No better way to meet the same goal?
- It happens that the more lines of code we have (and we have more every day) the more QA personal we need. We Often get to the point that the QA manager tells me “…we haven’t had the time to complete the testing of this version…”, “…we need more QA engineers just to complete the STP in time…” – We often release without completing the QA cycle and indeed we often need to patch the version a week or two after release because the version has some major bugs. WTF???
- The product manager often complains that the developers missed some of his instructions in the PRD and that a feature is f**ked as a result. Then the QA manager complains that the version has some features missing and some new features he did not know of and one of the team-leaders told him that some things were changed and showed him a couple this email thread from 2 months ago were those new features were detailed. Well sort of detailed.
- Your development manager is asking for 8 weeks of developer time to conduct a code merge of the new feature that is already tested into the main code branch. The Mac OS-X product was written on a branch that was never merged back to the main branch and devs keep complaining that it’s time we move to a better/modern/faster source control server because those XML files are never merged right and there’s so much manual work to do with every code commit. There must be a better way.
- And the competition is there. They are fast. How come they are faster to react then we are?
Well – you’re expecting this – go Agile. 🙂
I’m not saying its magic. You’ll have to invest time to make it happen. You’ll have to give it some chance and believe it can greatly improve your performance. Why “Believe???” – – We are engineers (or sales guys) and we have targets and methodologies to work. Why do we need to believe? Because you’ll have to change the way you work. People don’t like to change. People like to stick with processes they know. They mostly don’t see the flaws. They find it hard to believe that it can be so much better.
This is why I think that the goals for an Agile project must be set by a high ranking manager. The R&D manager is going to be involved for sure, but also marketing/product manager, and sales exec for sure as one of the main goals for an Agility project would ultimately be to improve the sales cycle and time to market. So, it seems the division head or CEO is probably the person that should set the goals and allocate the resources for an Agility project.
OK. So you are saying “the above is exactly the problems I see in my company but I’m not the CEO. I’m merely a team leader in the R&D…” – Now what? Well, send this post to your CEO – ask for a meeting to discuss the topic. Come prepared. Bring along an Agile couch/consultant. They are used to talking high mgmt. into investing in Agile.
Next up – some guidelines on how to go Agile without missing your yearly quota / deadlines.
Qs and comments are welcome as always.