Impact Of Great Resignation On High Availability & Disaster Recovery

Guest: Cassius Rhue (LinkedIn)
Company: SIOS Technology (Twitter)
Show: Let’s Talk

Every industry has been affected by the great resignation (or shuffle), and the world has discovered brand new opportunities, thanks to remote work. But when it comes to technology, such a mass exodus has immediate and lasting ramifications. We invited Cassius Rhue, VP of Customer Experience at SIOS Technology, to talk about the company’s experience, how the great resignation has affected the high availability and disaster recovery and what can be done about it.

Key highlights of the discussion:

In 2021, we saw what many call the big shuffle or massive resignations across industries. What impact is it going to have on the high availability space and specifically when it is about keeping critical applications up and running?

“Every industry has been impacted by the great resignation or great shuffle or some of this awakening to new opportunities that are now available because the whole world went into this kind of remote work for many different industries. It created these opportunities for people to begin to consider work in places they hadn’t before and that has had a huge impact on high availability (HA) and disaster recovery (DR). Mainly as teams and people change roles and jobs and leave roles within an IT team responsible for critical HA infrastructure that leaves a lot of those teams scrambling to cover that loss of a person.”

Are there any trends that worry or surprise you?

“One of the trends we saw as well, we saw people leaving to take other jobs. We also saw a trend where companies trying to protect themselves from that or to recover from the loss of key staff and the loss of key knowledge choosing to go more the route of professional services or contracting, right? So losing resources and rather than replacing cloud expertise in-house, choosing to augment them with cloud experts, consultants, hiring professional services to help them build out their HA architecture and IT administration.”

What impact has it had on day-to-day operations of businesses?

“A lot of what we’ve had to work with those businesses with on the day-to-day operations is helping them understand the HA space, they’ve lost some critical resource and so helping them understand their architecture, helping them understand the products that they have deployed, and then being HA experts in the space, reminding them of what are the critical tasks and operations that are a part of making sure applications are highly available.”

And if we just narrowed it to high availability and disaster recovery, can you also talk about what kind of impact it is having on this segment?

“One real impact we’ve seen in the last weeks is that, in the short term, you’ve lost a critical resource, your team’s now short-staffed. We’re also seeing people working longer hours. You have people trying to make sure that they mitigate the risk of having lost team members in HA. It’s not a business where you can’t afford downtime, so you’re looking to make sure that even at the loss of knowledge, even at the loss of resources and staffing, that you have plans in place, and that you have your partners, your contractors, your consultants are all on board to make sure your applications stay available.”

I think that’s where a challenge of tribal knowledge also comes into place, to avoid knowledge going away with that team and also creating a lot of technical debt in-house. Can you talk about this issue?

“You start assessing your team proactively, you start looking at where are things documented? What are your procedures that you have in place? Proactively looking at can you onboard new resources and start thinking about staff augmentation or growing your staff to de-risk those critical pieces and parts. I had read before the great resignation took place, some experts in our field talked about doing the sort of mock testing and chaos testing to simulate disasters, and then trying to have your team walk through handling that disaster, using their existing playbooks and figure out where the gaps are. So that’s an option for de-risking or managing your team so that you close some of these gaps that could occur as a result of the great resignation.

Can you also talk about the importance of chaos testing for high availability and disaster recovery, so that it becomes part of their strategy?

“We worked with a customer that introduced sort of the same chaos testing, and then realized somewhere midway through that they did not have the person with the right administrative permissions in the exercise. And so that’s a discovery that they made that yes, you had admins on there, but they didn’t have the level of permissions needed to perform certain operations on the storage. That’s something you want to find out when everyone’s working their normal hours, not something you want to find out at 2:00 AM or 3:00 AM when there is a real disaster going on.”

The summary of the show is written by Jack Wallen

[expander_maker]

Swapnil Bhartiya: Hi, this is your host Swapnil Bhartiya and welcome to episode of Let’s Talk. In 2021, we saw what many call the big shuffle or massive resignations across industries. It had a major impact on industries related to hospitality, but the waves were built across all industries. To talk about its impact on the job market or high availability or disaster recovery now in 2022, we have with us once again, Cassius Rhue, VP of Customer Experience at SIOS Technology. Cassius, first of all, Happy New Year and welcome to 2022.

Cassius Rhue: Yeah. Happy New Year to you as well, and definitely Happy New Year and welcome to 2022. Looking forward to a great year and hoping that you are as well.

Swapnil Bhartiya: Excellent. And I was earlier talking about the big shuffle of 2021, let’s hope that it will not become a trend in 2022 as well. But 2022, will face the kind of impact of it. So from your perspective, if we just narrow down on IT space and especially keeping critical because we live in a cloud-centric, service-centric world. So what impact is going to have in that industry and specifically when it talks about keeping critical applications up and running?

Cassius Rhue: Right. That’s a good question. So as you stated, every industry has been impacted by the great resignation or great shuffle or some of this awakening to new opportunities that are now available because the whole world went into this kind of remote work for many different industries, it created these opportunities for people to begin to consider work in places they hadn’t before and that has had a huge impact on high availability and disaster recovery. Mainly as teams get sort of the people change roles and jobs and leave roles within an IT team responsible for critical HA infrastructure that leaves a lot of those teams scrambling to cover that loss of a person.

And that loss can include both technical and non-technical issues, right? For example, a single person may have been covering multiple roles that was critical to your IT infrastructure and now you have a team trying to figure out how they’re going to identify what those roles were and then make a plan for covering them. It also has an impact of the great resignation or shuffle has an impact on knowledge within our HA and disaster recovery space, right?

As you mentioned earlier, maybe there were older adults who were always considering retiring and then the pandemic and the shift in the way the world operated gave them that last push so to speak, to go ahead and leave and what you’re seeing is 20 and 30 plus year experts exiting the field and taking with them a lot of knowledge about the infrastructure, the architecture, the applications, the databases, and that’s a lot of information that’s hard to replace in a short span.

So companies face these critical challenges of, how do we deal with these staffing changes? How do we deal with knowledge loss and now knowledge gaps? And of course, how do we deal with the roles that are exiting our teams? Now those are from the technical perspective, right? So a group of skills walks out the door and joins another company because they’ve chosen to go to a remote opportunity that’s now available to them, how do you replace those skills? Is it training staff augmentation with contractors?

And then of course you also think about the folks that are left on the team, how that impacts them from a quality of life also to emotional standpoint. So it has a huge on high availability infrastructure, just from the personnel standpoint, but then the risk factors that occur when you start thinking about shorter staffing, loss of roles, loss of knowledge.

Swapnil Bhartiya: When we talked about the big shuffle or great resignation, the reports when you read, they were specific to specific industries, but there was nothing specific to the area that we are talking about. Were there any trends that you saw because you are in the [inaudible 00:04:54] and thinkers of you are actually, you do know what customers are feeling. Was there anything that you noticed? Any trend that worried or surprised you most?

Cassius Rhue: In our industry, we weren’t immune to it, so one of the trends we saw as well, we saw people leaving to take other jobs. We also saw a trend where companies trying to protect themselves from that or to recover from the loss of key staff and the loss of key knowledge choosing to go more the route of professional services or contracting, right? So losing resources and rather than replacing cloud expertise in house, choosing to augment their with cloud experts, consultants, hiring professional services to help them build out their HA architecture and IT administration.

So we’re starting to see that happen a lot more, especially as you think the pandemic necessitated some companies going to the cloud and then of course now you’re losing those resources and still needing to augment it, so that’s one trend that I’ve seen. And I expect that trend as one of my predictions for 2022, to see it continue to happen, that companies choose services and consultants help them with cloud migration and cloud management of resources, especially as a result of the great shuffle.

Swapnil Bhartiya: Of course, nothing was expected or planned. What kind of impact this sudden movement, this mass resignation movement is having on IT personnel on actual businesses because operations are something different, but what impact it has on day-to-day operations of businesses?

Cassius Rhue: Yeah. That’s a great question. That’s a great question. So we work with several teams that have had this actual working with teams that are experiencing the results of the big shuffle, great resignation. And a lot of what we’ve had to work with those businesses with on the day-to-day operations is helping them understand the HA space, they’ve lost some critical resource and so helping them understand their architecture, helping them understand the products that they have deployed, and then being HA experts in the space, reminding them of what are the critical tasks and operations that are a part of making sure applications are highly available.

We’ve had to do a lot more training as a result of the great resignation and shuffling of resources and that’s something that you’re seeing in day-to-day operations that are changing. When you’re having people now taking over multiple roles or you’re outsourcing work, you’re working with knowledge consultants around cloud, the day-to-day operations that we encounter with our customers is reminding them of what are their critical applications, the importance of documenting their infrastructure, defining the architecture, and then what are some of the best practices around backups, checking monitoring, and alerts, dealing with maintenance and patching, and having a regular cadence for testing and things like that.

Swapnil Bhartiya: And if we just narrowed on one second, specifically to high availability and disaster recovery, can you also talk about now what kind of impact it is having on this segment?

Cassius Rhue: Yeah. And without naming any names or customers, one really real impact we’ve seen in the last weeks is that as teams in the short term, right, so you’ve lost a critical resource, your team’s now short staffed, we’re seeing them impact is you have people working longer hours, you have people trying to make sure that they mitigate the risk of having lost team members in HA, right? So it’s not a business where you can’t afford downtime. So you’re looking to make sure that even at the loss of knowledge, even at the loss of resources and staffing, that you have plans in place, and that you have your partners, your contractors, your consultants are all on board to making sure your applications stay available.

I think the biggest impact that we are seeing and will see is that companies are hyper-focused on making sure with the prevalence of disasters that they don’t experience one or that the great resignation doesn’t trigger a great disaster in its weight. One of the problem areas that I’ve seen repeatedly, and this is even before the great resignation, but the great resignation has made this more apparent, there are a lot of in high availability spaces, architecture and keeping architecture, documentation, and design up to date, keeping systems up to date with patches and maintenance. That was always important having test systems and the ability to deploy into test before going into production. That was always an important.

One of the things that it’s triggered in the great resignation is emphasis on how important it is to have those runbooks up to date, to have documentation up to date about your architecture, to know who your vendors are, to have a test system where you can go through your disaster scenarios, because of course, you’ve got people that are no longer with your team that may have been responsible, had the knowledge about where things were, and who to call when a disaster hit, what are the first steps to do when an outage occurs, if there’s an outage, where are the maintenance procedures defined?

So we are seeing that as something that I would to highlight for companies and businesses to make sure they have that down, if they have not experienced it, any output from the great resignation. And if they have to get those things in place and understand that’s something that’s become really critical is understanding the state of your infrastructure, where it’s documented, who knows it. Right? And then staffing for that. So that’s one area I’d say I would highlight is something even before it was important and now it’s critical because as you’re losing knowledge share for onboarding, you got to have that information available for the next person as well.

Swapnil Bhartiya: Right. I think that’s where a challenge of tribal knowledge also comes into place where, so to avoid that, so that knowledge does not go away with that team and also creating a lot of technical debt in house so you are stuck. So can you talk about that also, how to avoid that or what is the importance of that?

Cassius Rhue: Yeah. So let’s start with the importance of it. You’re absolutely right, tribal knowledge. We work with a customer a few weeks ago and that was their exact problem, they were actually having to do a short term contract with a former employee to bring him back in to help them because there was just this large amount of tribal undocumented knowledge about their infrastructure, about their applications, about the role that the individual himself played. So it’s super critical that, that tribal knowledge be put into some sort of a process document, or a lot of companies call them runbooks, where they can take that runbook information and simply hand it to the next person and they be able to execute it.

Without that the companies are at a greater risk if that knowledge walks out the door, if that knowledge chooses to take another opportunity then the phrase we used to use is behind the eight ball, because you really have a lot of work to do to de-risk your environment, right? It’s not something that you can replace in a single hour if a 20 year expert in your field and in your applications and in your business walks out and chooses another opportunity, and there’s not a whole lot documented there, so that’s on the important size. And then what can companies do proactively, right?

You start assessing your team proactively, you start looking at where are things documented? What are your procedures that you have in place? Proactively looking at can you onboard new resources and start thinking about staff augmentation or growing your staff to de-risk those critical pieces and parts. I had read before the great resignation took place, some experts in our field talking about doing sort of mock testing and chaos testing to simulate disasters, and then trying to have your team walk through handling that disaster, using their existing playbooks and figure out where the gaps are. So that’s another option for kind of de-risking or managing your team so that you close some of these gaps that could occur as a result of the great resignation.

The other thing you want to do is just look at where are your skills gaps, right? So if you’ve lost employees to change or shuffle, what were their skills that they brought to the team, whether that’s cloud or a database administrator, or IT, or networking, or some other form of infrastructure security looking for those gaps, and then either hiring to replace those gaps or training to replace those gaps. So those would be some recommendations I would make there for just the team. For your HA in particular, you want to look for risks, you want to find out like you hit on, we talked about earlier the day-to-day operations.

What are the day-to-day operations that are critical to your business? If it’s backups, if it’s security scans, if it’s vulnerability testing, what are the applications that you have protected with HA and looking at your architecture and look at the ones that you currently have in flux, you don’t have HA protection for that maybe you’re using dashboards or manual scripting for at the moment. You want to understand what those are, and then you want to put in place mitigations for that. The last thing you want to have is a person that left and you not realized what they were doing and you suddenly fall on your day-to-day operations and realized that for example, backups aren’t running or replication jobs in your high availability solution have been blocked or turned off.

Swapnil Bhartiya: One thing that is interesting when a company is hit with any kind of disaster, is that, that’s when they realize that it’s a big organization, not individual teams. And that’s when all the teams start getting together is scrambling to find it. And I think sometimes chaos [inaudible 00:16:13] does do that because they bring all these teams together to kind of not break things in line, but to test so can you also talk about the importance of chaos [inaudible 00:16:22] for high availability and disaster recovery, so that, that kind of becomes part of their strategy?

Cassius Rhue: Yeah. Absolutely. So our director of support, Sandi Hamilton’s always taught, talking about testing and the importance of having a test cluster and that’s where we often see if you have that available, right? Some hesitancy companies have, or businesses, or customers have is that they don’t want to do that chaos testing on a production cluster. So harping back to our director of supports advice, if you have a test cluster available that mimics a lot of your production environment, you can really benefit greatly by doing chaos testing, because you obviously can try multiple things.

As you mentioned, you get other teams involved, right? So we work with teams that have different groups that are responsible for storage, there’s a group responsible for networking, there’s a different group responsible for the database, another one for the applications. And when they’re doing their individual parts, of course the things work smoothly. But if you can generate a disaster that each of them to think on the fly and understand how their part coordinates or affects the other parts of the solution before there’s an actual disaster that improves your disaster response, right?

Just as a real example, worked with a customer that had a disaster that they introduced sort of the same chaos testing, and then realized somewhere midway through that they did not have the person with the right administrative permissions in the exercise. Right? And so that’s a discovery that they made that yes, you had admins on there, but they didn’t have the level of permissions needed to perform certain operations on the storage. That’s something you want to find out when everyone’s working their normal hours, not something you want to find out at 2:00 AM or 3:00 AM when there is a real disaster going on.

The other things that it provides is seeing the system operate under different scenarios, right? Disc failures, network failures, what happens when you have an unstable network in an HA environment, right? Testing out your fencing strategies. We work with a customer that discovered during their chaos testing that in their test environment, they actually had failed to deploy a fencing strategy and so that was something prior to going into production, just doing those level of random test involving all of different teams, they were able to discover some things that they needed to change. Same thing happens with parameters, application tuning, et cetera.

Swapnil Bhartiya: Cassius, thank you so much for, of course, talking about this big shuffle and how companies can kind of weave their path through this without having any real impact, but they do have to strategize and plan as you suggested. So thanks for your documentation there. And I look forward to our next discussions, but thanks for your time today.

Cassius Rhue: Thank you. And appreciate the time and look forward to the next discussion as well.

[/expander_maker]

Datadobi Appoints Matthias Nijs As Vice President, EMEA Sales

Impact Of Great Resignation On High Availability & Disaster Recovery

You may also like

Why AI Compounds Cloud Cost Problems and How Java Runtime Tuning Fixes It | Peter Maloney, Azul | TFiR

How to Run AWS Locally and Cut Cloud Dev Costs | Waldemar Hummer, LocalStack | TFiR

How Klutch Installs Into Any Kubernetes Cluster | Julian Fischer, anynines | TFiR

Why Platform Engineering Teams Over-Abstract and How Modular Design Fixes It | Corey McGalliard, Akamai Cloud | TFiR

Why HA Failover Fails: Overlooked Application Dependencies and Untested Runbooks | Matthew Pollard, SIOS Technology | TFiR

Why AI-Generated Code Needs a Cloud Sandbox to Be Trustworthy | Waldemar Hummer, LocalStack | TFiR