Hi everyone,
Just a quick update about yesterday's outage on our network. I have written 3 versions below so you can choose how much detail you want. This in no way means we are taking it lightly, please be assured we are working hard on what caused this to ensure it cannot happen again.
Sometimes really bad stuff happens, and we deal with it as fast as we can, staff worked well into the wee hours of this morning to put it right, all we can do is apologize and let you know we are working on making sure it can’t happen again.
This was a fairly catastrophic outage, a core underlying issue with the OSPF database on several of our routers became corrupted, which then propagated the bad data out to all routers in the network, we had to climb many levels of vendor support to get a resolution, while not impacting the 10,000 or so customers that were still connected.
If you are in an IT / enterprise environment, we will be producing a full Incident report once staff debrief and we get more information back from our Vendor around a permanent fix, we are awaiting a full root cause analysis.
Please contact Inspire if you require a copy of this Incident report for your management / records.
Around lunchtime on 15th August, we started to see customers disconnecting from our network in quantity, and diagnosed an issue with one of our core routers.
What is a router
The internet is run by routers, they pretty much ‘route’ your internet to and from where it needs to go.
Yesterday lunch time one of our core routers got a corrupt route, and didn’t know how to get to some places, unfortunately it then decided to tell every other large router in our entire network the bad directions, which ended up cutting off approximately 5000 customers. In normal times all those customers would be ‘re-routed’ to our second core router in our backup data center, however as part of what happened the main core router informed the backup router of the bad route also.
We paid an insane amount of money for these boxes, and we pay a similarly insane amount of support for them from our supplier / vendor, so we logged a call about what was happening, and engaged their engineers for assistance. Due to the complexity of the fault we had to climb through many levels of engineer to get to a fault resolution.
It turns out we had a unique problem that they had not seen before, and it took over 12 hours for that resolution, with us getting all services back online around 12:53am this morning.
At this stage we are working with the vendor of the equipment for a ‘root cause analysis’ (ie: what really happened), and for a permanent fix for the problem if it happens again. If it does happen again, we have a fix in place for a quick recovery (minutes, not hours), and once we have a permanent repair, which will most likely involve us upgrading the software on the core routers, we will test and then implement this.
If you still have no service, please reboot your home router, and if that doesn’t work, please call the helpdesk, as everyone should be working, and if you are not, it’s likely a single issue for your connection, not related to this outage.
I apologize again for any inconvenience this caused, and I understand that for a lot of our wireless customers, we are your only source of connectivity. These events are very rare on our network, currently the count is 3 outages of this scale over the last 25 years. We strive to have a very robust and resilient network as we understand the importance of connectivity in the modern world, and we work very hard to keep this level high.
Cheers,
James Watts CEO
Inspire Net Limited