Prevent WHM Back-Up from Crashing Servers / Using All Disk Space

Tcalp shared this idea 2 years ago
Open Discussion

Hi Everyone,

I had been in touch with cPanel support regarding this issue and they don't feel it's a "bug", but a "design oversight" that should be addressed in "feature requests".

The new cPanel back-up system is not properly aware of disk space (cPanel support has stated that it does check for free space at the beginning of task run, though I don't know what value it assumes to be "safe" as it regularly runs itself into nothing, causing my disk array to have 0 bytes free and thus crashing MySQL / user sessions / etc etc.

Use Scenario:

Dual Quad/Hex Core Xeon

4x 240GB SSD in Hardware RAID10 (~420GB primary partion with ~250-280GB used).

60TB NFS File Share mapped as an additional local folder to move back-ups to (~120GB of compressed back-up data on a daily basis). NFS Link : 1GBPS (shared)

Not keeping a local copy of back-ups

Creating daily back-up Sunday - Saturday 14 day retention

Creating weekly back-up with 4 copies retained

Creating monthly back-up with 2 copies retained

ISSUE 1: The WHM back-up is not aware of disk space as it cycles through accounts creating back-ups, it will run the disk array to 0 bytes free if given the opportunity, in turn crashing the server. The processor should be space aware prior to running the next account back-up.

ISSUE 2: As stated in the use scenario above, we are using an NFS, but I imagine with other additional locations like Amazon AWS that the overall transit speed could be fairly restrictive. The disk array being pure SSD, the back-ups are knocked out in quick fashion. This leads to roughly 100GB of back-up data temporarily stored on the server (increasing the disk space requirement heavily), pending transfer to the 'additional location'. It would be nice if the back-up packager was self-aware of how many back-ups were pending transfer, and entered a wait state for the pending transfers to catch up, thus reducing the overall space requirement. While in our scenario the overall temp storage isn't that large, I can only imagine how this would quickly tip scale on larger disk arrays with a lot of accounts/data stored. That said, I can certainly understand that some would have a desire to have all data back-ups to complete in a timely manner (aka not space restricted), so maybe this should be a controllable option/feature in the back-up configuration.

ISSUE 3: I am not exactly sure what the root cause is, and while it doesn't happen often, we will at times end up with failed back-up packages where there is uncompressed user data/folder structure in the /backup/[day]/[user] folder and a partial tarball. This isn't happening every day, but at least once every few days and doesn't seem to clean itself up. In that I can end up having 40-100GB of 'bad/temp' back-up data that needs to be micro-managed. Multiply that issue by 20-30 or a few hundred servers, and it quickly becomes an issue. To that end, it would be nice of the back-up processor reviewed local/temp data from previous days and auto-cleaned/erased appropriately.

Feel free to add your thoughts / suggestions guys.

Comments (10)

photo
1

I agree with some kind of "stop" on backups filing up 100% of space and therefore crashing the server. We've had this alot recently, which is my main reason for implementing sshfs backups to a remote folder. (This also avoids thrashing the local SSD's and limiting lifetime endurance)

If using local, it should check the available space before writing the next backup/creating the compressed file. If it checked and waited for other files to transfer/delete then the problems you have wouldn't occur.

However, as you say, it doesn't check, it goes on writing files until it runs out of space then crashes itself.

Some people on SATA may have 20GB available and as 40GB of backups write and transfer they can get away with it, however once on SSD, writes are so much faster the transfers can't keep up! (Unless you upgrade to 1GB).

photo
1

I agree 100% -- same exact situation here. SSD drives with plenty of room for production, but when backups can potentially double the utilized space, it definitely has the potential to cause problems.

I've also run into the partially compressed issue you mention, though come to think of it it's been a while now (a month?) since that particular issue.

Originally I was thinking it should compress and start transferring the first account while the second account starts compressing, and then wait on the third compression until the first is done.

I like your idea better though -- accomplish as much as possible while still leaving a usable/safe buffer, and then wait if necessary if transfers are holding things up.

Overall, this is a very real issue. Keeping storage utilization under 50% just because of backups is nuts, and the only alternatives at the moment are disabling backups (not happening) or potentially temporarily crashing the server every night when backups run and space runs out.

photo
1

Our sshfs has been successful to avoid this issue. It's easy to install and we now set backups to go to mnt/backups which is s mounted drive on a high storage vps elsewhere.

We then select "keep backups in the backup folder" and turn off that destination in remote destinations.

Effectively what it's now doing to generating the backups directly onto the remote destination, not locally.

you can also still have a second backup destination, just set this as normal in remote destinations.

photo
1

This is actually a bigger problem every day.... is there any chance at this being looked at officially any time soon, or should I be looking at hacking it to work?

photo
1

I wouldn't hold your breath. I am a cP partner and originally raised the issue as a cPanel bug (as in my opinion a lack of 'logic' in their coding that is causing server crashes should be considered a critical level bug). Sadly, the tech team didn't agree, I later took the issue up with my account rep and was later passed on to have a meeting with one of the development managers, while the development manager heavily agreed with my sediment, I haven't heard anything since.

In the end, this is where this has ended up.

photo
1

Hey all! We've made some adjustments to how backup retentions can be configured in version 64, and I think these changes at least somewhat help the problems in this request. There's documentation that you can review, but we also have a youtube video that communicates the information in a visual way. Can you take a look and let me know how you feel about retention options in version 64?

photo
1

Hi Benny,

Not really. It is really hard to say what percentage of disk space is needed for the backups. The only real solution is to check disk space before starting the backup of each account. Personally, I think this is more like a bug than a feature request. An automated process like backups (especially one that depends on external services) should not have the ability to crash a server so easily. There needs to be checks in place. In my opinion, this is really an urgent need.

photo
1

I don't want to set a "space free" that is calculated -before- backup starts... that's not much use. If the backup script kept an eye on free space as it progressed, between each account -- now that would be much more useful.

photo
1

I agree with brt, equally, at least what is described in video in terms of back-up retention in failure scenarios seems like a huge waste of resources. It's one thing to keep success copies is great, but to hang onto bad dailies as well is creates a critical scenario.

Also, as far as retention goes, it would be nice to be provided additional options for monthly retention (monthly / bi-monthly / quarterly / semi-annually)

photo
1

Thanks everyone for the feedback! One of our feature teams is looking at the backup system as part of the work they're doing in version 66. I'm going to see where we're at specifically and ask them if this is something they could act on.