Gajus - stock.adobe.com
Key points for a monitoring center pandemic action plan
On-site monitoring centers come under stress when it's necessary for most workers to telecommute. Here are key points to include in a crisis plan to continue service availability.
During unpredictable times, it is essential for monitoring centers to maintain a high level of operational effectiveness. Business areas face monumental challenges when moving to remote work, and it is important for their applications to be there when they need them.
But monitoring centers face the same circumstances as the business units they support.
With the necessity for everyone to work from home, that poses a few problems for monitoring centers. It's important to address these issues. Here are some key factors to include in a monitoring center pandemic action plan.
Impact of communication interruptions
The most important component of a high-quality monitoring center is staff communications. Monitoring information from supported systems is sent to the center for analysis and resolution. Monitoring center systems interpret the incoming information and identify warning and critical statuses based on predefined events and thresholds.
The statuses provide personnel with information pertaining to the availability, performance and recoverability of the monitored platforms. Monitoring tools provide event queues that display incoming notifications. Monitoring centers use predefined procedures to assign notifications to individual staff members.
Monitoring centers rely on the close proximity of personnel and high-quality interpersonal communications to quickly solve issues and perform workload balancing. Shift leads are the monitoring center's air traffic controllers, ensuring that their team is addressing the constant flow of issues as quickly and efficiently as possible.
When unforeseen events disrupt monitoring center technicians' ability to communicate with each other, the impact can be catastrophic. Issue resolution backlogs occur, and the number of alerts dropped or mishandled begins to escalate. Monitoring centers that aren't set up for multiple team members working from home need to quickly rearchitect their communication mechanisms to accommodate off-site staff.
Remote connectivity to monitoring platforms and their alert queues is fairly easy to establish and should be the first step in a monitoring center pandemic action plan.
The challenge for monitoring centers is to preserve a high level of communications to maintain high-quality service. Fortunately, there are numerous IM products that facilitate remote discussions, from Google Chat to Microsoft Teams.
Monitoring centers should establish the following chat threads to foster high-quality communications.
Shift lead and all team members. Team leads can ask who needs help and who has bandwidth. Team members can be encouraged to proactively ask for help or state that they have bandwidth to assist others.
Team members. This thread should be designed for traditional monitoring center discussions, where team members can ask typical questions, such as "Has anyone seen this issue before?" or "Is this the best way to correct this problem?"
Complex issue resolution discussions. This channel or thread should be for multiple team members working together to solve challenging problems or issues that have widespread impact.
In addition, the monitoring center can also utilize conference bridge calls to facilitate group communications. I've found that messaging works better for shift lead-to-team members and team member-to-team member communications. Although, group problem-solving discussions work fairly well in chat, conference calls seem to work best for solving problems remotely.
Employee availability solutions
Protecting the health of all team members is the monitoring center's top priority. A key component of a monitoring center pandemic action plan is for managers to identify the minimum number of employees that need to be on-site to ensure operational efficiency.
Managers should deploy Centers for Disease Control and Prevention, industry standard and commonsense best practices to safeguard essential on-site staff:
- Maintain constant communication with all remote and on-site employees. Employee health checks, personal safety best practice reminders and stay-at-home guidelines should be sent daily.
- Develop a recording mechanism to track absenteeism and facilitate staff workload balancing.
- Determine if company guidelines on absenteeism need to be modified to reduce the chance of sick employees coming on-site.
- Maintain social distancing guidelines.
- Provide protective gear, such as face masks and gloves.
- Maintain an adequate supply of disinfectant products.
- Each shift lead should be responsible for disinfecting the operations center on a predefined timetable. A checklist should be displayed in a prominent area that provides the date and time for each item that was cleaned.
Operational solutions -- it's time to triage
Issue resolution backlogs occur for many different reasons. It's the nature of the service, and it happens more frequently than all monitoring centers would like.
From internet outages that affect multiple customers to the loss of a client data center, there are numerous reasons backlogs occur. Once the underlying issue is corrected, the monitoring center receives a flood of alert notifications. Reduced staff availability is just one of many reasons for issue resolution backlogs.
Backlogs quickly identify procedural and organizational weaknesses that prevent staff members from handling increased workloads. Monitoring center managers then implement staffing, procedural and system changes to better accommodate abnormally high workload volumes. The staff also gains important experience in backlog handling.
Most monitoring centers run three shifts to provide 24/7 service. Staff makeup is critical. The staffing for each shift should consist of a balance of experienced and less seasoned personnel. Shift leads and more experienced personnel should assume an ownership role during backlogs and triage alerts according to their importance.
System unavailability alerts take top priority, followed by critical system resource utilization issues, which include disk, CPU and memory. The exhaustion of those critical system resources can jeopardize the availability of all applications that rely on that platform. Application-specific error messages should be handled next, with backup errors and system warnings addressed later.