With vSphere 6, snapshot consolidation issues are a thing of the past!

Snapshot consolidation (or commit) operations in VMware vSphere have always been a problem, especially for large and really active virtual machines. But vSphere 6 has introduced some changes that are probably going to make commits a problem of the past!

A new consolidation method

Before vSphere 6.0, the consolidation and commit phases of any VM snapshot has always followed the same procedure: an additional helper snapshot was created to “freeze” not just the base virtual disk, but also the snapshot disk, and once the changes stored in the snapshot disk have been merged into the base disk, the helper snapshot was also committed, and at some point the I/O was not directed anymore to the snapshot, but back to the original disk.

This process sounds simple, but once you start looking at the details are not that easy, and in some production environments this behaviour have caused problems. If the virtual machine has a decent I/O write activity, also the helper snapshot could grow in size, and in order to commit the changed data back into the base disk, an additional helper snapshot may be created. Cormac Hogan wrote a great post to explain the situation in vSphere 5.5, and also how it has been improved in vSphere 6.0. You can read his great post Snapshot Consolidation changes in vSphere 6.0, here I’m only reporting a part of it:

What we did previously is used a helper snapshot, and redirected all the new I/Os to this helper snapshot while we consolidated the original chain. Once the original chain is consolidated, we then did a calculation to see how long it would take to consolidate the helper snapshot. It could be that this helper snapshot has grown considerably during the consolidate operation. If the time to consolidate the helper is within a certain time-frame (12 seconds), we stunned the VM and consolidated the helper snapshot into the base disk. If it was outside the acceptable time-frame, then we repeated the process (new helper snapshot while we consolidated original helper snapshot) until the helper could be committed to the base disk within the acceptable time-frame.

We retried this process for a defined number of iterations, but we have had situations where, due to the amount of I/O in flight, we could never successfully consolidate the snapshot chain and helpers.

In vSphere 6.0, this consolidation process has been dramatically changed:

In vSphere 6.0 the snapshot consolidation process also uses the mirror driver [as in Storage vMotion]. With the mirror driver mechanism, changes to the VM are written to the active VMDK and the base disk (while protecting write order) during consolidation. One should now hopefully see snapshot consolidations completing in 1 pass (with minimal or indeed no helper disks) and with hopefully a dramatically shorter stun time, and a much small chance of consolidation failure.

Testing the new consolidation process

This sounded too good to be true, so my colleague Tom Sightler, Solutions Architect in Veeam, decided it was time to run some tests to prove this change. Tom shared with me his findings, so thanks Tom for the reports!

The test involved two servers, one with ESXi 5.5 (5.5 Update 3) and one with ESXi 6.0 (6.0 Update 1b), while the rest of the environment is exactly the same: the same exact server model and characteristics, same network, and also the same NFS datastore to run the test virtual machine. This VM is a CentOS 7 minimal install (text only). It has 12 virtual disks, 16GB each, all stored in the NFS Datastore – 1 disk for the Operating System, while the other 11 disks are just attached, but not even formatted or mounted. Why this setup? Because the biggest problems for snapshot consolidation is observed with customers that have lots of VMDKs because vSphere consolidates multiple disks and their snapshots of the same VM sequentially. Even with a completely empty VM (the 11 disks are not even formatted, so they will be completely idle) you usually end up with a stun of around 1second per disk, and since VMkernel goes through each disk and consolidates the last helper, with 11 disks you will almost always have more than 10 seconds stun.

And remember, this is a completely idle VM, there is absolutely no I/O activity. Think about a really active VM like database machines for example, where setup with multiple disks are pretty common: the problem here can we way worse. It’s not the hardware in use, there’s no backup software involved, it’s all about the snapshot technology.

For each test, two parameters have been measured: a network ping executed each second, and the effective stun time read from VMware logs. The first gives you the result has it can be seen from another application or user over the network connecting to the VM, while the second gives you the exact time spend by VMware engine to manage snapshot creation and consolidation.

ESXi 5.5 U3

In order to verify the effects of snapshot creation and removal, different tests have been conducted. First, we took a snapshot, because even the snapshot creation requires at some point an I/O redirection from the base disk to the snapshot disk, and even if this time is minimal, it’s not zero.

2016-02-05T22:52:22.064Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 1960925 us

Reply from 192.168.60.159: bytes=32 time=1ms TTL=62
Reply from 192.168.60.159: bytes=32 time=2ms TTL=62
Reply from 192.168.60.159: bytes=32 time=1ms TTL=62
Request timed out.
Request timed out.
Reply from 192.168.60.159: bytes=32 time=3ms TTL=62
Reply from 192.168.60.159: bytes=32 time=22ms TTL=62
Reply from 192.168.60.159: bytes=32 time=2ms TTL=62
Reply from 192.168.60.159: bytes=32 time=2ms TTL=62

Snapshot creation required 1,96 seconds, which is almost 2 seconds, and in fact we lost 2 pings (remember, they are executed once every second).

After the snapshot has been created, we proceeded to remove it.

2016-02-05T22:57:02.783Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 843578 us
2016-02-05T22:57:04.633Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 1224454 us
2016-02-05T22:57:06.457Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 1180217 us
2016-02-05T22:57:08.193Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 1125963 us
2016-02-05T22:57:09.870Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 1083360 us
2016-02-05T22:57:11.517Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 1061894 us
2016-02-05T22:57:13.124Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 1014733 us
2016-02-05T22:57:14.623Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 967550 us
2016-02-05T22:57:16.131Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 1004541 us
2016-02-05T22:57:17.600Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 961571 us
2016-02-05T22:57:18.980Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 907337 us
2016-02-05T22:57:20.338Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 890165 us
2016-02-05T22:57:21.162Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 383551 us

You start to see the issue arising already. There have been multiple stun operations, and the sum of them all is 12648914 microseconds, or 12,6 seconds. As said before, around 1 second per virtual disk. As a result of this, multiple pings have been lost:

Reply from 192.168.60.159: bytes=32 time=1ms TTL=62
Reply from 192.168.60.159: bytes=32 time=2ms TTL=62
Request timed out.
Reply from 192.168.60.159: bytes=32 time=1ms TTL=62
Reply from 192.168.60.159: bytes=32 time=1ms TTL=62
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Reply from 192.168.60.159: bytes=32 time=1ms TTL=62
Reply from 192.168.60.159: bytes=32 time=2ms TTL=62
Reply from 192.168.60.159: bytes=32 time=2ms TTL=62

This is the well known situation where the virtual machine becomes unresponsive.

ESXi 6.0 U1b

Then, we repeated the same tests using the latest available version of vSphere 6. First, again, we took a snapshot:

2016-02-05T22:06:11.606Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 649684 us

Reply from 192.168.60.159: bytes=32 time=1ms TTL=62
Reply from 192.168.60.159: bytes=32 time=1ms TTL=62
Reply from 192.168.60.159: bytes=32 time=1ms TTL=62
Reply from 192.168.60.159: bytes=32 time=2ms TTL=62
Reply from 192.168.60.159: bytes=32 time=2ms TTL=62
Reply from 192.168.60.159: bytes=32 time=1ms TTL=62

The stun lasted for 0.6 seconds, and no ping has been lost.
Then, we removed the snapshot:

2016-02-05T22:10:08.276Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 5326 us
2016-02-05T22:10:08.493Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 69898 us
2016-02-05T22:10:08.714Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 63095 us
2016-02-05T22:10:08.942Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 67560 us
2016-02-05T22:10:09.166Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 65347 us
2016-02-05T22:10:09.388Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 64597 us
2016-02-05T22:10:09.610Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 63449 us
2016-02-05T22:10:09.827Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 65439 us
2016-02-05T22:10:10.054Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 69483 us
2016-02-05T22:10:10.277Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 65957 us
2016-02-05T22:10:10.501Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 64563 us
2016-02-05T22:10:10.759Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 95898 us
2016-02-05T22:10:10.983Z| vcpu-0| I120: Checkpoint_Unstun: vm stopped for 67014 us

As we can notice immediately, the single stun time per disk has been dramatically reduced from around 1 second to 65 milliseconds; this is on average a 15x improvement! The entire stun time is now 828436 microsecond, that is less than a second. As a result of this, availability of this VM over the network has also dramatically improved:

Reply from 192.168.60.159: bytes=32 time=2ms TTL=62
Reply from 192.168.60.159: bytes=32 time=2ms TTL=62
Reply from 192.168.60.159: bytes=32 time=1ms TTL=62
Request timed out.
Reply from 192.168.60.159: bytes=32 time=1ms TTL=62
Reply from 192.168.60.159: bytes=32 time=3ms TTL=62
Reply from 192.168.60.159: bytes=32 time=1ms TTL=62

Just one ping lost, in line with the duration of the stun. Impressive!

Final notes

I work too for a vendor, and I know that some times the biggest technological features of a new release are not always the same that are promoted. I usually call them “little gems”, and I love to talk about them. You don’t build a marketing campaign over them, but still they are a huge part of what makes a solution a great solution.

This enhancement in vSphere snapshot consolidation is to be honest one of the biggest improvement I can remember in vSphere releases. Promoted features like VVOLs and Multi-Processor FT have made the news during the launch of vSphere 6.0, and later the issues with CBT (change block tracking) have been the most talked topic, and all contributed to overlook this improvement.

But with CBT issues now finally solved, if you were still looking for reasons to upgrade to vSphere 6.0, this one alone in my opinion would be more then enough to justify an upgrade of all my ESXi servers to 6.0.

Huge kudos to VMware for this enhancement!

Tech 5.5 6 checkpoint commit consolidation snapshot stun vsphere

A new consolidation method

Testing the new consolidation process

ESXi 5.5 U3

ESXi 6.0 U1b

Final notes

Share this: