Reduced Downtime and Increased Stability is a vital goal of all online services. Happily, the lastest server-side updates at Fire And Ice deliver both of these improvements. By detecting and fixing these events automatically, manual intervention is less necessary. Less manual server interventions means more time is available to spend inworld with our customers.
Reduced Downtime and Increased Stability – Backup improvements
Previously we have blogged about our robust backups system. However, this required a small interruption to the service. Every backup prevented users from logging into the grid or teleporting in from the hypergrid. While the disruption only lasted for fifteen minutes, it still affected users. We are delighted to announce this is no longer true. We still carry out the same high level of backup, but now seamlessly. There is no interruption to the user experience.
Freeze detection and recovery – Increased Stability
Freeze detection and recovery are now running on all the Fire And Ice Grids region simulators. When a program stops functioning correctly, typically either it crashes and closes completely, or stops responding (freezes). We test every part of the Fire And Ice grid frequently. If something has crashed, our system restarts it automatically. Similarly, every running service gets a test. Then the system restarts any non-responsive element.
Reduced Downtime and Increased Stability – Technical Brief
Details of the Keep-alive script which checks to see if a process is running and responsive are available at Crash Detection And Recovery – Fire And Ice Grid.
Backup Improvements reduce down time
We have stopped using MySql Dump and Plump to backup the robust databases. Instead, Fire And Ice are now using Percona XtraBackup. A full update to the post about our backup procedures is also coming soon.
Responsiveness Monitoring improves stability.
The responsiveness test uses a bash script which performs an HTTP test. A description of monitoring opensimulator is available on the opensimulator wiki. This is a modification to the script already in use to detect crashes. Details of the crash detection script are available on our crash detection and recovery post.