As we launched one of our biggest in-game event yet, and reached an all-time high in concurrent players, our team has been working around the clock to resolve our (unfortunately) frequent server issues.
When we launched the event, we had increased server capacity, but we were blown away by the actual number of players who joined the game. As servers have been under heavy load, we want to take some time to talk about what happened, what we're doing now and the next steps towards healthier servers.
Multiple different issues were tackled, and some are still being investigated. Issues mostly came from the backend and game servers. To give players an idea, backend hold your authentication/log-in, player inventory and unlocks, the in-game store to buy items and hunters, as well as matchmaking. Game servers on the other hand is where your mission takes place, from spawn to extraction (or death!). As we aim to be as transparent as possible, the following info will be tech-heavy.
What has been happening in the last couple of weeks:
We experienced various type of issues in the last 30 days, mostly since the start of the event on March 24th.
We had about three occurence of a bottleneck that was building in the backend, causing it to grind to a halt after about three days. We were able to submit a hotfix for this issue on April 8th, and it has not been observed since.
Then, we had the issues that would happen at the same time every day for three days in a row. Initially, we thought we were investigating network connection issues, as the symptoms are very similar. Turned out some internal processes were stalling the backend exactly 00:00 UTC. They were triggering a high disk usage on the physical machine which was not leaving enough resources for the backend processes to run. We have made some reconfiguration to prevent that from happening. For now, it appears to be solved, but we are keeping an eye on it for the next couple of weeks as only time will tell.
Afterwards, we had some network interruption in the datacenter we use for the backend. It was either causing permanent issues which resulted in a full restart or temporary issues in one specific region. We do not have control over these, as they are caused by real world issues.
And finally, we had an occurence that seems to be a human error, triggering a process that it was not needed or not yet needed. This was our mistake, but it has been identified and resolved.
Here's what we're currently investigating:
We still observe slow performance degradation, which may cause backend servers to go down in a similar fashion as the bottleneck issues. Very often, those problems do not manifest during internal testing, as they are not under the same extreme load. We are currently looking into ways to investigate them directly on the live environment, without causing any interruption. Once we understand what is causing the memory consumption, we will be able to resolve the issue. At the moment we are only able to perform regular full restart to recover enough memory for the backend to run.
What are the next steps?
We are currently working on a more sophisticated stress test system that will allow us to reproduce the behavior of thousands of players in our testing environment. It will allow us to catch issues before hitting live servers. Our current target for this system is June of 2021.
The latest release and the event were a great success for Hunt as we doubled our number of concurrent players. We underestimated the community engagement. To ensure best possible performance, we are running our servers on physical machines. This prove effective but challenging to scale rapidly, as we depend on the datacenter capacity. We will be looking to have support for rapidly scaling game servers (cloud), but we have faced challenges, as we want to ensure good playing conditions on weaker machines. Integration for this hybrid environment proves challenging, due to the environments being in two separate datacenters but the latest success showed us that scalability becomes one of the top priorities.
About the event:
We understand that server downtimes during peak hours, combined with the live event, can be very disappointing. We have decided to extend the duration of the in-game event by two days. The event will end on April 21st at 16:00 CEST.
We would like to thank the community for making this event the biggest one to date. Thank you for being there with us, and see you in the Bayou!
The Hunt Team