Nmedia - Fotolia
Now that most enterprises have settled into a pattern of remote work forced by the COVID-19 pandemic, they're preparing for the shift to be permanent and applying organizational techniques that reflect the DevOps model to ease that transition.
The pandemic and accompanying shift to remote work revealed a stark difference between companies that had embarked on digital transformation projects such as Agile development, DevOps deployment and cloud computing, and those that had not. At least anecdotally, the more companies had embraced digital transformation before the pandemic, the better off they were.
Having a DevOps model instilled good IT collaboration habits and automated workflows that lent themselves to use by a remote workforce, and using cloud computing services meant employees could access resources from outside their brick-and-mortar offices more easily.
Moreover, the DevOps model and associated Agile development philosophy address organizational issues that go beyond tech, such as empathy, collaboration between IT and business stakeholders, team flexibility and reducing repetitive tasks with automation, which also proved useful to workers as they navigated the COVID-19 crisis. Such attitudes will also be crucial to weather the long-term effects of the pandemic and establish a new way of working in the so-called "new normal."
Before enterprises could revisit long-term strategies, however, they had to address the immediate crisis as the first wave of COVID-19 reached US shores. At companies such as Rack Room Shoes, the ability to quickly shift tech priorities according to business needs under the DevOps model proved instrumental in that short-term crisis response -- in fact, the company's cloud-based e-commerce project is now credited with keeping the business afloat as the pandemic battered retail businesses.
The chain of shoe stores, founded in 1920 and headquartered in Charlotte, N.C., was a latecomer to e-commerce before the pandemic, having launched its digital division six years ago -- 10 to 15 years after most of its competitors, said Kevin McNall, director of digital products at the company.
"We had some advantages, frankly, because we didn't have to make some of the same mistakes [competitors] did," McNall said. "But we also had some growing pains -- in fact, last Christmas before the pandemic, we had a sizable outage that caused us this year to spend more resources towards stability."
Among the changes was applying a DevOps monitoring tool, Dynatrace, to legacy on-premises systems that backed cloud applications along with test environments used early in the software deployment process. The resulting app changes reduced the number of times applications connected to back-end databases, improving the stability of the e-commerce infrastructure.
This outage and its aftermath, while painful, ultimately meant Rack Room's e-commerce systems were able to withstand the sudden surge in online business that came as COVID-19 forced the closure of its 500 brick-and-mortar stores.
"Our e-commerce business, which was already growing 25% year over year, was suddenly up 150% year over year," McNall said. "On some days, we were running 300% to 400%."
Cloud-based systems and a flexible software delivery pipeline also supported the company's internal shift to a remote workforce and allowed it to quickly reprioritize marketing efforts. Employees at store locations worked to fulfill online orders, minimizing job cuts. Once stores began to reopen this summer, e-commerce maintained a higher share of the company's revenues than ever before, and it appears the company's shift to a primarily e-commerce business will be permanent, McNall said.
DevOps model shores up remote workforce efficiency
Even at cloud-native companies where DevOps collaboration and remote work were common before the pandemic, moving to a majority or entirely remote workforce prompted more systematized, software-driven practices – some developed in less than a day -- where employees had previously bridged gaps with informal in-person discussions.
"We were trying to figure out how to how to best support things that used to be in person … and how to still deliver the same quality of service but do it through ticketing," early in the pandemic, said Jason Bergado, senior director of program management and end user services at cloud file sharing company Box. "We had to get better on our own hygiene on the back end."
Bergado's IT service desk team used a Jira Service Desk add-on from Atlassian app vendor Refined to create a centralized portal for all 2,500 Box employees and contract staff to access information related to the pandemic, including company announcements and FAQs, health benefits information and office status. The team also set up an automated employee onboarding system to support new hires and relieve pressure on the service desk team.
As employees used the COVID-19 information portal, Bergado's team applied another DevOps principle, continuous feedback, to optimize the system. They sent data from a tracking mechanism in Confluence, which served documents through the Service Desk portal, into a Tableau data visualization tool to get a sense of what resources employees used most.
"Week by week, we were just toggling the different modules to make it more useful," Bergado said. "And what we saw, interestingly enough, was the traffic to the COVID portal started to increase."
Like Rack Room's McNall, Bergado said he believes it's unlikely his company will return to the pre-COVID-19 status quo.
"There have been optimizations and efficiencies that have been gained from being forced into this new normal," Bergado said. "Talking to my peers in the industry, they're already reconsidering their real estate plans, pulling back from opening new offices and reducing the footprint that they have in buildings."
Remote workforce struggles with new projects, burnout
COVID-19's fallout generated good news for tech, but it also brought plenty of bad news. The shift to a remote workforce didn't hinder short-term projects at first, but not having in-person contact has made it hard for some DevOps teams to brainstorm about what's next.
At online credit reporting company Credit Karma, teams easily repurposed a data analytics system based on Google Kubernetes Engine infrastructure and the BigQuery cloud data warehouse in response to customers' inquiries during the initial COVID-19-induced financial crisis. Credit Karma used that system to create a product called Relief Roadmap that helped users sort through available financial assistance services. Internally, Credit Karma IT teams went fully remote, and, like teams at Box, improved the way they documented work instead of relying on informal conversations.
"At first, it seemed like we weren't missing a beat -- productivity actually went up a little bit," said Credit Karma co-founder and CTO Ryan Graciano. "My read on that looking back is it's not too hard to continue work remotely. But it's difficult once you get into planning and starting new things."
In the past, Credit Karma staff planned new projects during in-person, off-site meetings, which have been difficult to replicate with a remote workforce, Graciano said.
"Any Black Swan event changes the way that we operate," he said when asked whether the switch to a remote workforce will be permanent for his company. "But I don't think we figured out a true replacement for getting in a room around a whiteboard."
In fact, it has been easier for most digitally transformed companies to shift technical resources than human ones. The strain from quarantine isolation and blurred boundaries between work and home life has taken a toll on employee performance throughout the tech industry, among traditional and cutting-edge companies alike.
In an August survey of more than 9,000 professionals by anonymous online professional network TeamBlind that asked, "Is WFH [Work From Home] hurting your mental health?" 66% of respondents said yes. Workers with the highest number of "yes" responses hailed from large tech companies such as Amazon, Microsoft and Google.
Microsoft analyzed data on U.S.-based teams of remote employees from March through June, with sobering results about a breakdown in work-life balance.
"Working in pockets helped but sometimes we found that job demands rushed in to fill spaces previously reserved for personal downtime," wrote a trio of Microsoft employees in a published report about the data analysis last month. For example, "The 10% of employees who previously had the least weekend collaboration -- less than 10 minutes -- saw that amount triple within a month."
COVID-19 recovery calls for technical, human incident response
Even at tech giants such as video streaming service Netflix, which has done a booming business during COVID-19 quarantine, the effects of the pandemic crisis on IT staff were clear.
"Everybody's sense of time and the stories that were relevant to their lives were extremely compressed," said J. Paul Reed, senior applied resilience engineer at Netflix, in a presentation about COVID-19 incident response at an "SRE from Home" virtual event July 23. "And that's very painful. It's cognitively painful."
Under the DevOps model, many enterprises shift day-to-day application management tasks to developers, and assign IT ops the role of site reliability engineer (SRE). SREs handle IT incident response, and apply lessons learned from incidents to make IT systems more reliable and efficient overall.
Tim HeckmanSenior SRE at Netflix
At Netflix, SREs treated the surge in traffic prompted by the pandemic as an ongoing IT incident that would last longer than most for the company. To keep it manageable, the SRE team set clearly defined roles for the different IT teams involved, and specific exit criteria for closing the incident.
"We recognized early on that this wasn't going to be a two-week or three-week sort of issue," said Tim Heckman, senior SRE at Netflix. "It is an incident, but it's not something where [things are] on fire, and we're not going to have somebody in the Slack channel 24/7 reacting to things. Most of the time we're just going to be waiting, observing, and seeing how the system changes around us."
The team also had to adjust the system monitoring metrics it used to understand when this type of long-running incident was subsiding, according to Reed.
"We didn't actually say, 'Well, when this one metric gets to this one level and stays at this one level, that's what we'll look at,'" Reed said. "We took the derivative of some of those metrics and said, 'When we see the metric not bouncing wildly around or not on this really high trajectory, that's what we'll look at to see that the rate of change has stabilized.'"
Most importantly, the humans managing systems ultimately determined the company's capacity to adapt to the pandemic.
"We recognized that … the systems may need to scale, and we may need to make changes to meet a new global demand, [but] that is much different than how our peers, the people we care about and work with, are going to be impacted by this," Heckman said.
Thus, the SRE team's role was not just to watch systems and shore up their reliability, but also to manage communications with other employees, Heckman said, "not only so they had the current context of what we were thinking, suggesting and where we were headed, but also to give them some confidence that the system around them would be fine."
Similar principles must be applied to manage the human impact of a longer-term shift to remote work, said Jaime Woo, co-founder of Incident Labs, an SRE training and consulting firm, in a separate presentation at the SRE from Home event.
"The answer is not 'just be stronger,'" for humans during such a transition, any more than it is for individual components in a distributed computing system under unusual traffic load, Woo said.
For both technical and human resources, companies must use monitoring techniques to anticipate issues and focus on changing system-wide, organizational factors that improve resilience, he said.
"There's actually a recipe for stress," Woo said. The main ingredients, according to research by Woo and behavioral scientists, are novelty, unpredictability, threats to the ego and a lack of control. These situational elements are measurable, and employee feedback, like IT system metrics, should be closely monitored for them.
IT managers can then take an SRE-like approach to minimize those factors for individual employees. For example, if an incident feels new to workers, introducing the novelty stressor, company game days that feature simulated incidents could reduce some of that stress, Woo said.
"Whenever I talk about wellness, someone will always ask, 'Well, if you can't handle the stress of the job, should you do this job?'" Woo said. "To me, that's kind of being like, 'Well, you know what, if you want to be a firefighter, don't have a suit. Just go and see how long you can stand by the fire, and whoever can stand the longest should be a firefighter, but everyone else, it's not for you.'"
Instead, Woo said, "Of course, we build a better suit."
The most important way organizations can optimize the shift to a remote workforce is to establish and maintain clear boundaries around work time that otherwise slip as employees juggle new schedules and commitments, according to other speakers at the SRE from Home event.
"Working at a sustainable pace is paramount," said Holly Allen, head of reliability at Slack, in a virtual Q&A after a session at the event. "Leaders have to model this and be vocal about taking time off when we need it."
A DevOps model for catching up in digital transformation
Companies already invested in digital transformation were better able to weather the pandemic, but those who were already behind on tech modernization are struggling even more in the wake of the crisis.
"We're in the very beginning stages of it all," said Milan May'r, a software engineer at a Fortune 500 company he requested not be named, in a virtual discussion during the SRE from Home event. "Right now, two of us are using our spare time to create uptime metrics for a vital internal website -- it seems nobody can even measure total uptime for a vital portal."
Instead, May'r said in a separate interview, IT ops pros at his company are focused on traditional tasks such as responding to individual incidents, rather than improving system resilience. This has been hindered further by difficulty getting people to collaborate remotely by remaining connected via chat apps and turning on cameras for video conferences, he said.
Companies that weren't fully on board with a DevOps model and digital transformation are now caught in a vicious cycle, said Sanjeev Sharma, co-founder and principal analyst at Accelerated Strategies Group, who consulted with enterprises during the COVID-19 crisis.
"Companies realize they need to step up their transformation," Sharma said. "But most ops teams have been too overwhelmed by the shift to remote work to step back and implement changes to processes and workflows."
Sharma's recommendation to clients has been to dedicate a team to transformation tasks while the rest of the IT staff handles day-to-day troubleshooting. This is easier said than done amid an ongoing IT skills shortage, but it isn't impossible, Sharma said.
Most companies are familiar with DevOps principles that represent radical change, such as frequent application updates and high-velocity deployment, known in DevOps and Lean manufacturing circles as kaikaku. But organizations struggling with digital transformation can also turn to a less-emphasized concept of incremental change, or kaizen, said Dawn Parzych, developer advocate at LaunchDarkly, in an SRE from Home presentation.
"Kaizen is where you need to slowly improve things," Parzych said, whether it's juggling parenting and distance learning for children while working from home or developing software. "DevOps is founded on the three ways of flow, feedback and learning ... you want feedback in short, amplified cycles, and then you take that feedback and incorporate it."