Reducing service downtimes due to human error
An important benefit of the Red Hat subscription for customers is the support. As Technical Account Managers (TAMs), we try to understand patterns behind the issues we are investigating together with our customers and partners. One of the recurring questions is: how can I reduce downtimes due to mistakes by the system operators?
There is no fairy dust..
..at least not as part of the current Red Hat subscription. Servers and services running on them do not configure and maintain themselves, so as of now humans are performing these actions. These actions can also lead to unintended outcomes: rebooting the wrong system, configuring an agent of a HA-cluster incorrectly and leading to the service going down, or a mistyped command leading to a zeroed partition table.
You were probably guessing it: there is nothing to prevent unwanted outcomes 100% of the time. Systems are running complex software stacks, these have to be configured and maintained. What can be done to reduce the likeliness of downtimes and problems?
How can policies, regulations and best practices help?
Most importantly: education. It will pay off to give the sysadmins time and resources for training.
Permissions should be restricted to the required minimum: if someone is administrating a database on systems, then fewer permissions might be required than for sysadmins of the system. A
rm -rf / executed with lower permissions is not as bad as when executed as a root user.
Introducing rules to have admins log in with personalized accounts and just afterwards become root ensures that you see who was around on a system with root permissions when an issue occurred. Not using a full root shell at all but having the personalized users execute single commands with elevated privileges via sudo is preferred.
Ever executed “reboot” on the wrong system? Several things can help to reduce such cases. The motd of systems can be modified, but is just displayed once after login. Modifying the PS1 variable to something like
[user@host PROD]: also including color codes can help, for example:
PS1="e[41;4;33m[u@h production]$e[40;0;33m "
Also, when executing commands like
reboot then one could be required to type in the hostname of the system to be rebooted. For graphical logins into the root user, if these are still used anywhere, then red backgrounds could be used.
As for best practices, reducing complexity of setups also helps to keep the potential for mistakes down. Additional layers like virtualization, micro services and cluster software are increasing complexity, but they are of course also providing benefits. So benefits and complexity should be evaluated.
A further approach is to separate tasks into planning and performing:
In the first step, a person plans an action to be performed on a system, down to single commands, and documents them.
In the second step, a different person executes these steps.
With this, four eyes are looking at the changes, instead of just two. In a less formal manner, ‘pair admining’ (from ‘pair programming’) can be done, where two people together decide the commands to be executed on a system.
How can technology help us here?
Virtualization is opening up a collection of special options. Customers running services in guests can snapshot their guests for example once a day, or they can snapshot them right before a user logs in. That way, the partition table which was just removed by the sysadmin, can be restored from the snapshot. This approach has limits if multiple guests and storage are involved in the application: ideally they all need then to be snapshotted. As an alternative to using the snapshot for recovery, one could also perform the following before a sysadmin logs into the production system:
Do a snapshot of the production system.
Start that snapshot as a test system.
Have the sysadmin try out the intended action on the test system first.
Backups are important, yet still we often see that no backups are implemented. They are essential to restore mistakenly removed files. RHEL includes Relax and Recover (ReaR) as an image backup solution. One can also run a script which daily mounts a separate hard disk and rsync syncing up all files to that partition, to mitigate situations like hard disk failure. As an alternative to dedicated backup servers, we have also seen environments where a group’s systems are syncing their data regularly to two other systems nearby. In many setups, there is enough free disk space for this.
Config management can help here. There is an ideal where you are not logging onto several systems, but:
Write instructions for a config management system like Ansible or Puppet.
Then apply these instructions to your systems.
This can pay off quickly: it allows to apply the rules to not only one, but many systems. Once error prone actions (“let me cut’n’paste these 30 commands into a shell”) can be developed into config management instructions. When using these instructions, it is possible to first apply them to test systems, and if the outcome is as desired they can be applied to the production systems. When config management is used for the complete administration straight from system deployment, then you can also just deploy additional systems which will look the same – because the same config management instructions were used.
One could set up policies such as “all changes have to be done via config management,” maybe with multiple approvals required before rules can be applied. Of course, such strict constructs prevent quick logins to the system to debug an issue in the event of an emergency. A customer’s policy will probably be somewhere in the middle, with “direct login only in emergencies, normally all done via config management–after verification in test environments.”
Log and error tracking systems can help. Collecting system logs on a central system, and have them searchable, for example with ElasticSearch, can help. One could also record all shell sessions on systems, including the commands which are typed in and the output which was obtained. With this, a sysadmin hitting an error will be able to look up whether the same message already appeared.
An internal knowledge database will help to capture knowledge about environments and recipes to approach single tasks such as exchanging software RAID disks. OpenSource wikis are likely to fulfill most requirements, and impose no additional costs. A word of warning: having commands ready for cut’n’paste is good, but understanding the background of the commands is also important. Be mindful that website code can play evil tricks, leading to pasted code being different than what you were intending to copy from the website. Putting a ‘
#‘ in front is important, in addition to inspecting the command.
If I had to name just two important factors to reduce unintended downtimes, then these would be:
Provide appropriate education to everybody with access to the systems. Your TAM should have a good overview of existing skills, how they match to the experienced problems and can recommend trainings or other activities to fill in the gaps.
Give admins an environment where they have enough time to try out the production software on test systems. RHEL in KVM guests is enough for certain scenarios, for others real hardware is required. For example when purchasing an expensive HA cluster software, then also licenses for training the administrators should be priced in.
Thanks a lot to the Red Hat TAM team, many TAMs contributed to this post.