I had been in touch with cPanel support regarding this issue and they don't feel it's a "bug", but a "design oversight" that should be addressed in "feature requests".
The new cPanel back-up system is not properly aware of disk space (cPanel support has stated that it does check for free space at the beginning of task run, though I don't know what value it assumes to be "safe" as it regularly runs itself into nothing, causing my disk array to have 0 bytes free and thus crashing MySQL / user sessions / etc etc.
Dual Quad/Hex Core Xeon
4x 240GB SSD in Hardware RAID10 (~420GB primary partion with ~250-280GB used).
60TB NFS File Share mapped as an additional local folder to move back-ups to (~120GB of compressed back-up data on a daily basis). NFS Link : 1GBPS (shared)
Not keeping a local copy of back-ups
Creating daily back-up Sunday - Saturday 14 day retention
Creating weekly back-up with 4 copies retained
Creating monthly back-up with 2 copies retained
ISSUE 1: The WHM back-up is not aware of disk space as it cycles through accounts creating back-ups, it will run the disk array to 0 bytes free if given the opportunity, in turn crashing the server. The processor should be space aware prior to running the next account back-up.
ISSUE 2: As stated in the use scenario above, we are using an NFS, but I imagine with other additional locations like Amazon AWS that the overall transit speed could be fairly restrictive. The disk array being pure SSD, the back-ups are knocked out in quick fashion. This leads to roughly 100GB of back-up data temporarily stored on the server (increasing the disk space requirement heavily), pending transfer to the 'additional location'. It would be nice if the back-up packager was self-aware of how many back-ups were pending transfer, and entered a wait state for the pending transfers to catch up, thus reducing the overall space requirement. While in our scenario the overall temp storage isn't that large, I can only imagine how this would quickly tip scale on larger disk arrays with a lot of accounts/data stored. That said, I can certainly understand that some would have a desire to have all data back-ups to complete in a timely manner (aka not space restricted), so maybe this should be a controllable option/feature in the back-up configuration.
ISSUE 3: I am not exactly sure what the root cause is, and while it doesn't happen often, we will at times end up with failed back-up packages where there is uncompressed user data/folder structure in the /backup/[day]/[user] folder and a partial tarball. This isn't happening every day, but at least once every few days and doesn't seem to clean itself up. In that I can end up having 40-100GB of 'bad/temp' back-up data that needs to be micro-managed. Multiply that issue by 20-30 or a few hundred servers, and it quickly becomes an issue. To that end, it would be nice of the back-up processor reviewed local/temp data from previous days and auto-cleaned/erased appropriately.
Feel free to add your thoughts / suggestions guys.