Brett Barwick, Principal Software Engineer at SIOS Technology, provides the engineering team with in-depth expertise in SAP. He is also an AWS Certified Solutions Architect — Associate, well versed in the cloud. In this episode of Let’s Talk, Barwick discusses minimizing downtime during planned maintenance for SAP HANA with SIOS Protection Suite for Linux.
The topics we covered include:
- Talk about what the typical maintenance/upgrade process looks like for a highly-available SAP HANA database.
- Since the downtime during the maintenance process is caused by switching the database over between cluster nodes, it sounds like this is the best place to focus in order to minimize downtime. What options does a database administrator have when performing this switchover?
- How does the “Takeover with Handshake” feature work? What advantages does it provide over traditional HSR takeover?
- Why is it important to always perform the HSR takeover through the company’s HA products?
- Sounds like maintenance and regular testing are very important in an HA environment. Tips to help administrators manage their highly-available SAP HANA clusters.
- Importance of documentation to avoid the pitfalls of tribal knowledge.
Swapnil Bhartiya: Brett Barwick, Principal Software Engineer at SIOS Technology. Brett it’s great to have you back on the show.
Brett Barwick: Yes, thanks for having me back on.
Swapnil Bhartiya: And today’s topic is kind of my interest as well, which is more or less about minimizing downtime during plant maintenance for SAP HANA with the SIOS Protection Suite for Linux. But before we go deeper to that, I want to understand from you, if you can talk a bit about, what the typical maintenance of a great process look like, for high availability for SAP HANA on a database.
Brett Barwick: Generally when people are protecting their database in a high availability cluster environments, and they want to do some maintenance, they’re generally going to be using a rolling upgrade process. At a high level, what this means is, so you have your database running on the primary server, so clients can still connect to it. You’re going to go ahead and upgrade, apply patches, do all your maintenance on your standby node. Then you can switch it over the database so that it’ll be running on the stand by node now. Once clients can reconnect to that, you can then go and upgrade the original primary node. And then you can switch back.
At that point, both servers are upgraded and you’re back in the original configuration that you had. I think the critical thing there are those two switchovers. When you switch from the primary to the secondary and you switch back, those are the parts as an administrator, while you’re sitting there thinking, okay, is it switched over yet? Because that’s the time when users can’t connect to the database. So that’s where you want everything to be very reliable, very automated. You want to be able to have confidence that the switchover’s going to take place quickly. All the correct steps are going to be followed.
So in SIOS Protection Suite for Linux, we have these things called Application Recovery Kits. Which you can basically think of, they’re kind of like plugins for different types of applications and databases. And we have them for all different kinds of things, but we do have one for HANA databases too, and that’s part of the job of the Recovery Kit. Is it gives you a quick and easy way to switch your resource from the primary to the secondary and back, it does it in an automated way, it follows all the right steps, all the best practices that SAP recommends and just takes a lot of those manual steps away from you as an administrator so that you can focus really just on the maintenance tasks that you need to perform on the servers.
Swapnil Bhartiya: Excellent. So, if I’m not wrong, the downtime happens mostly when switching the database between clusters. And you mentioned, the application recovery kit. So if I ask, from administrative point of view, you also touched a bit about, if I ask, what are the best practices or what options do they have of course, you mentioned ARK there, but let’s go a bit deeper in there and just look at how they can minimize it.
Brett Barwick: If you drill down into what actually happens in the switchover process, there’s a couple ways that the switchover can happen on the database side. So the database … by the way, I’m talking about databases that are using what’s called HANA System Replication, and this is SAP’s native mechanism that it designed to replicate data from one HANA database to another. So it turns out when you want to switch over the database, part of the process is going to be that you’re going to have to take your running secondary database and promote it to become the new primary, so clients can write to it.
And it turns out there’s two ways essentially to do that. So the first I call traditional system replication takeover, and this has been around since system replication has been around. And the idea there is, in order to make sure that everything’s safe, before you promote the secondary database to primary, you need to completely stop the original primary database. Make sure nobody’s writing to it, make sure everything’s in sync, then you can promote the secondary database, then you can reregister the original primary as the new secondary, and then restart that database.
There’s also a newer version of HSR takeover, which is called takeover with handshake. And this was introduced by SAP in the HANA 2.0 SPS 04 release, and this is around April 2019 that they introduced to us. And the big difference there is that instead of completely stopping the database, their idea was that it’s good enough to just put it in a frozen suspended state, which is a little bit faster, and then that allows you to then promote the secondary database more quickly, and get users back accessing the database more quickly.
So in terms of kind of under the covers, what’s going on, you have those two different takeover types. In terms of how you would actually perform the switchover or the takeover, there are a few ways. So if you happen to not be using HA software, you could use one of SAP’s administrative tools, either a graphical tool like HANA Studio, HANA Cockpit, or if you like command-line, you could use the HTBasic utility. But since we’re talking about HA environments, I do want to emphasize that you want to make sure that you’re using your HA Software to do these switchovers. The basic idea is that you want to make sure that you’re telling the HA Software, hey, I want to move this resource from one node to the other, so that if it moves underneath the covers, the HA software isn’t surprised by that and it doesn’t to restart the database on the node where it just stopped. So again, just make sure that if you’ve got HA Software in place, make sure you’re using that to switch over the database.
Swapnil Bhartiya: How important is it to perform the HSR takeover through HA part and what is the reason for that?
Brett Barwick: Yes, that’s a good question. So, it’s very important and this applies just generally to any HI Software, not just SIOS Protection Suite for Linux. But the basic way that HI Software works is you tell it your expected cluster state. So you say, okay, I expect the database to be running in primary on this cluster node. I want it to be running, but secondary on this other cluster node. So if anything changes in that expected state, so as an administrator, if you go be beneath the HA layer to the actual database layer and say, you stop the database where the HA Software expects it to be running, well, your HA software is going to react to that. That’s its job. So it’s probably going to try to restart the database because it doesn’t know that you actually want to switch it over to the other nodes. So it’s just important to essentially keep the HA software in the loop. Say, hey, I really do mean that I want the database to become primary on this other node, that way it doesn’t take any unexpected actions.
Swapnil Bhartiya: These days we talk a lot about SREs, SQL engine where there’s a lot of stress on, testing your systems against a lot of things. So can you talk about the importance of regular testing in a HA environment? What is the importance and why people should do that?
Brett Barwick: Yes, absolutely. So it’s extremely important and I know customers nowadays, there’s a lot of pressure to go live, a lot of times they want to go live yesterday. So there can be some temptation to maybe cut steps here and there, but I do think it’s very important. I’ve got a few tips around this. So number one, always do really rigorous pre-production testing. You want to make sure that when you hit that go live day, that everything’s going to go smooth. You don’t want that to be the first time that you’ve run through a failure scenario with your HA software and then on that day, you discover a misconfiguration, or you discover some behavior that you didn’t expect. So, number one, it’s really important to put the software through its paces and also your configuration through its paces.
Second thing I would recommend is making sure that you maintain, just as an IT team, you maintain a runbook, meaning a step by step guide for common failures that you might expect to come up or maintenance scenarios that you might expect to come up that has the exact steps laid out from your hardware level, your OS level, all the way up in the stack to your HA product, exactly what needs to be done to recover., Not only that, but also who’s responsible for doing those things. You don’t want something to happen in the middle of the night, it’s in one time zone, the person who’s responsible lives in another time zone. You want to make sure you know exactly who’s doing each step.
Third thing I would recommend is maintaining a QA or a test cluster that’s as close as possible to your production cluster. The idea there being that if you need to perform some maintenance, you want to make sure that you can go on to your test cluster and do essentially a dry run, make sure that all of your processes work the right way, make sure that your patches apply successfully, that way when you get on the production server and you have that scheduled maintenance window, you’re just more confident that things are going to run smoothly in the upgrade process.
And the last couple of tips I apply to HANA specifically. So number one, just keep in mind that HANA is an in-memory database. So part of the reason why it’s appealing and why it’s so fast is that it’s storing records in memory. So if, as a database admin, if you’re aware that certain tables, certain columns are frequently accessed, go ahead and configure the secondary system to preload those in memory. Just keep them loaded. That way, as soon as you have the switchover complete, as soon as you get a query, if it’s one of these commonly accessed tables or columns, you can get the data immediately. You don’t have to wait for HANA to load a terabyte of data into memory to respond to a query. So that’s the first one.
The second one is if your version of HANA supports it, so again, this would be HANA 2.0 SPS 04 later, and your HI software supports it, so if you’re using, say SIOS Protection Suite for Linux, this would be version 9.5.2 or later, consider using the takeover with handshake switchover time, because remember this can help you reduce downtime because you’re not completely stopping the primary database, you’re just suspending it, and that’s going to allow clients to connect much more quickly after the switchover.
Swapnil Bhartiya: Excellent. One additional question I have is that was a very early point you made was to have a guidelines and guidance in documentation. I want to understand sometimes what I’ve been hearing a lot about is called tribal knowledge within companies. So how do you also ensure that the practices that companies are building, they move from team to team, irrespective of whether people move. So do you have any tips also so companies can avoid tribal knowledge? I want to just stress the point of documentation that should be there in place so that one person knows how to do that should not go away with the person.
Brett Barwick: Right. Yes. Exactly. So essentially having your teams be siloed, I think is the terminology. I would think it … It’s very important that you’re sharing this knowledge between the database administrator teams, between your security teams, everybody should be aware on all of your teams to prevent this siloed knowledge that you’re talking about. Make sure everybody’s on the same page because when there’s a failure, when things are on fire, folks are panicking, that’s not the time when you want to be figuring out what to do and that’s not the first time you want to be on a certain team hearing that it’s your responsibility to do something certainly. So yes, I would agree. It’s very important to make sure you share the knowledge, share this runbook out with the team, make sure everybody is on the same page.