Avoiding the ‘Meltdown Scenario’

November 30, 2009 by Paul M.. Filed under Performance modelling.

One of the most confronting experiences for a senior IT or business manager must be the failure of a major business computing system for which they are responsible.  This is even more intense if such a system is a live service for thousands of customers, like an online banking or trading system.  Or in the case of a government agency, a service that thousands of tax-paying citizens need to use.

We call this the ‘meltdown scenario’.

With its nuclear catastrophe overtones this might sound overly dramatic, but if you’re the one who ‘owns’ the operational performance of this system and you’ve got senior management, angry customers, and the press all banging down your front door, then you might be just as comfortable trying to clean up Chernobyl.

In years past, outages of such services might have been an inconvenience affecting a minority of progressive, internet-savvy users.  But increasingly the importance of such systems staying up 24/7/365 is critical and outages are very bad news.  Literally: High-profile sites going down can make for embarrassing headlines.  Google has experienced this a few times in the last 12 months, with their Gmail service going offline without warning for hours at a time.

The senior business stakeholders for these systems have to rely on their IT operations people, who in turn have had to place trust in the architects, developers and testers of the original system.  These folks probably all would have done their best in good faith and with the highest degree of quality they could provide under their project constraints, when the system was designed, built and tested.  But, performance failures continue to occur and stakeholders continue to be nervous about new service deployments.  This is because the design and even testing of systems, especially the really complex ones, is still not an exact science.  It’s a combination of experience, gut-feel, conservative hardware over-provisioning, and occasionally a little bit of capacity planning and analysis.

In response to needs seen originally in the government sector, the team at NICTA has developed an approach that provides more rigour in the architecture, design and testing for large, complex systems.  We can model the system architecture and its components, then simulate the behaviour of the system when you subject it to normal or unusual patterns of load, eg., what happens if 1M users were to log on within a 1 hour period?  We can generally tell whether the system will withstand a certain load scenario and at what point it will reach breaking point.

Because this is such a well-recognised and important problem, our team have worked hard on refining this technology, to a point now where we have a good solution that has been validated a number of times in real-world field trials. Of course, we’re now looking for more customers with the types of challenges described.  But, we’d also like to know of other types of scenarios and examples where this technology might have been applied.  If you have any horror stories that sound like the meltdown scenario, please share them here!

Share:
  • Twitter
  • del.icio.us
  • Facebook
  • Technorati
  • Digg
  • StumbleUpon
  • LinkedIn
  • Posterous
  • Google Bookmarks
  • RSS
  • email

Tags: , ,

Commentary

  1. [...] is piece of software that sits (non-intrusively) within a service-oriented IT environment and helps avoid the meltdown scenario, complimenting our performance modeling and simulation technology.  It can sense ‘trouble’ – [...]

Add a comment

Follow comments to this post by subscribing to the comment feed.