Last night, two years worth of work and planning for the first online Census night came to a screaming heap. Households around Australia were greeted with a lovely message like:
And, as of 11am (AEST) this morning, it’s still down. So, how could it all have gone so horribly wrong? Why has two years worth of planning and 10 million dollars failed so catastrophically?
This article is a bit of an educated guess and analysis from other industry experts who were commentating on the #CensusFail Twitter stream. While it may not be conclusive, the fact that the site has been unable to collect the Census data for many hours clearly shows they missed something.
1. Outsourcing risks
The project for the collecting the Census data and scalability had been outsourced to IBM. Somehow the government still hasn’t learnt its lesson on entrusting companies who have had catastrophic failures. Firstly, we have the Queensland Health debacle. This is a project which went so badly that a Commission of Inquiry had to be established and IBM were subsequently banned from all future Queensland government tenders. It was late, massively over budget and then of course when it went live it caused tens of thousands of Queensland Health employees to be paid incorrectly for up to three years.
Secondly, IBM were the ones who couldn’t scale Myer’s website for the Boxing Day sales. Not only did it fail on the day, but after a week of revamping the site and infrastructure their solution was to place potential buyers in a queue and make them wait before browsing the site. This was a terrible outcome and as a professional looking in, it’s quite embarrassing that a large company could get it so very wrong.
2. Underestimating the peak numbers
The Australian Bureau of Statistics had previously boasted that they could handle 1 million submissions an hour, double what they expected. However, even a rudimentary analysis by a normal citizen can easily predict that this isn’t going to be good enough.
An estimated 6 million households were going to complete the Census online. The majority of these households will have working adults, which means that it will be completed once the usual routines of cooking dinner and herding children are complete. Most people would have attempted it between 7pm and 9pm, which means the peak numbers are going to be over 2 million people per hour, with possible surges much higher.
For a government department who specialises in statistics, this seems to be very flawed thinking if they thought the peak was only going to be half a million. They could have staggered it over a number of nights quite easily if they thought the peak surges would be too difficult to handle and achieved a much better outcome.
3. Poor infrastructure design
Initial analysis suggests that some of the infrastructure design for the Census data was quite positively woeful. There were only 11 servers to handle all of the Census processing. What’s more, the fact that the certificate has hard coded entries means that there’s no ability to spin up additional servers to process more traffic.
There were a number of route changes as the issues started to occur, suggesting that capacity issues were hit. These may have been from denial of service attacks or simply the inrush of everyone hitting the site at once, either way the outcome is the same. The system is hosted directly with IBM, not on their SoftLayer platform (owned by IBM) which has auto-scale ability built in. In this day and age, properly designing a system to scale is a well trodden path by many companies.
4. Privacy concerns
This is the one which really stands out. The government increased the time in which they hold your name and address details associated with your Census from 18 months to 4 years without any explanation as to why. Rightly so, many were very concerned about the impacts of this.
Many Australian citizens and even government Senators were quite outraged at the notion, which would have directly increased the motivation of attacks. Nobody likes the government spying on you or keeping personal information without any sort of justification. This would have directly motivated hackers to take action, something of which we’ve no doubt seen the results of.
5. Underestimating hackers
Just to clear it up, what we’re seeing so far suggests an attack, not a hack. There’s a clear distinction here, no data has been lost (yet). The worst thing you can do is make yourself a big target without proper protection in place. The moment you trivialise or think you’ve thwarted the power that hackers have, they’ll simply prove you wrong.
Hackers have access to enormous amounts of server resources and also immense talent. Regardless of if you think their work is unethical or illegal, you simply cannot take a moral high ground and expect to win. The privacy issues no doubt gave them additional assistance to bring the site down.
Even without compromising the data, they’ve shown that they can counteract the government’s protection and planning. This isn’t a good sign at all, if they can’t stop a denial of service attack, can you trust them to protect the data itself?
Going forward
It’s hard to make suggestions without knowing the exact cause of the failure, but one thing that can certainly be done is to rebuild the trust of the Australian citizens. The level of distrust only causes further concern and again fuels hackers to take action. As the recent election shows, Australian citizens are already at high levels of distrust with the government, even without the Census debacle.
Let’s hope that the’s actual lessons learnt from this, rather than repeating it all over again in five years time.
Update: For another great perspective on the matter, check out the Risky Business blog. The lack of planning and expertise is worse than I thought, especially for the money and size of the companies involved.