And the beat goes off
In the first week of June, our server went mysteriously offline to come back some days later but it was a bad omen… Around the second week the server went offline again, this time for much longer.
So, what happened? Our hoster just bricked tons of servers at the same time in a firmware upgrade. Ooops.
They said they will replace the faulted hardware with spare ones but days passed without seeing our server coming back to life. And well, it looks like the hoster did not manage to remember where he stored his communication skills…
Life support
Since we were not having any news from hoster nor any estimated time of arrival for the restoration of the service, we hosted on a temporary place on the web a landing page explaining what was happening and pointed our DNS to it. We also managed to spawn a master server instance somewhere. We announced what was happening on Twitter, Facebook and Google+ while we were answering to people’s questions on IRC, discord and Github.
Since our downloads are hosted on Github and since master server was restored somewhere else on the meantime, people were still able to download the game and join a server during the downtime but this was also for us an opportunity to discover that we were not very resilient because of some legacies in our code. We’ll talk more about that later.
Defibrillation
The server is now back again: website, wiki, forum, master server, cdn and other services were restored in the last week of june.
It looks like the hoster just didn’t have enough spare hardware and we got tired of waiting so we decided to upgrade the server, which was the better choice to restore services the quickest way possible.
Wiring the resilience of Unvanquished
So, we discovered some issues that does not help Unvanquished to be very resilient to this kind of problem. Most of them are inherited from our id tech 3 legacy.
For example, the server code only resolves the master server at start-up time and never resolves it again later. This behaviour was an intentional one that came back to Windows 95 era when resolving a domain name at run time was causing some hitches (true story)! If you find hard to believe our modern engine has some inheritance in an engine from year 1999, here is the proof.
The issue is that when we started a new master server on the same domain name but on another IP address, we had to ask server owners to restart their servers to get them listed again. We can probably ditch this kind of workaround safely today.
Also, we discovered that our support of multiple master servers was not strong enough. So we not only started a secondary master server, we are currently fixing bugs on that multiple master servers support.
So, next version will come with multiple master servers, fixes to properly support master server roaming, and proper master server fallback.