Tuesday, April 1, 2008

.vmdk Snapshots and the Importance of CID Chains


Introduction

I spent half of yesterday troubleshooting a problem that arose from an un-synchronized VMware VI3 snapshot. It was an unpleasant experience, but I learned about the importance of CID chains!

After several hours of poking and prodding a dead VM, my colleague Dan and I finally stumbled across CID chains.

We came to the conclusion that snapshots, and specifically the delta .vmdk files VI3 creates each time it generates a new snapshot, are connected to each other in a chain via randomly assigned dynamic CID values. VI3 assigns each new snapshot file a CID, and that value changes each time the VM reboots.

If the CID chain is broken for any reason, the VM cannot mount the virtual disks assigned to it.

Problem


Your VM’s virtual disks will not mount and you receive the following error message:

Cannot open the disk ‘/vmfs/volumes/INSERT SPECIFIC VALUE HERE.vmdk’ or one of the snapshot disks it depends on.
Reason: The parent virtual disk has been modified since the child was created.

Yikes!


What do you do?!

  1. Don’t panic (yet).
  2. Shut down the VM completely, ASAP.
  3. Don’t change anything!
  4. Read this article!

You should be able to correct the problem, assuming you haven’t made any changes to the .vmdk files.

Analysis

This problem likely arose because you made one or more snapshots of one or more of the virtual disks associated with a VM, and the snapshots have become unsynchronized; you are probably also running in an ESX/VI3 environment.

The root cause of the problem relates to how VI3 manages snapshots and snapshot delta changes to the original virtual disk (.vmdk). Most likely, the snapshot delta hierarchy has become unsynchronized and the various child snapshot .vmdk files associated with the original .vmdk are no longer referencing the correct parent .vmdk.

The key to piecing the snapshot delta files back together is in the CID chain. The CID chain is the glue behind the snapshot delta hierachy.

Each .vmdk file in the chain is assigned a CID value. The .vmdk also points to a parent CID value and parent .vmdk file. The parent CID value must point to the parent .vmdk file created immediately before the child .vmdk.

Each .vmdk file's CID is dynamically changed to a random hex value every time the VM boots. If a VM boots and the CID chain is not in the same state as per the previous boot process, the child CID-->parent CID relationships become unsynchronized. This is how VI3 determines authenticity of snapshot delta files.

The only way to re-synchronize a CID chain is to edit the .vmdk files in each chain manually.

Example CID chain:

  1. Parent .vmdk (original .vmdk) = vdisk.vmdk
  2. Child .vmdk created after 1st snapshot = vdisk-000001.vmdk
  3. Child .vmdk created after 2nd snapshot = vdisk-000002.vmdk

If this chain is broken:

  1. manually point vdisk-000002.vmdk to vdisk-000001.vmdk, and
  2. manually point vdisk-000001.vmdk to vdisk.vmdk.

Three fields deserve attention in each .vmdk file, for the purposes of this discussion:

  1. The CID field,
  2. The parentCID field, and
  3. The parentNameHint field.

Note: The parent .vmdk does not contain a parentNameHint field and its parentCID field always equals "ffffffff".

Samples

Sample output of parent vdisk.vmdk file:

[root@myvi3server]# cat vdisk.vmdk
# Disk DescriptorFile
version=1
CID=7f81b951
parentCID=ffffffff
createType="vmfs"
# Extent description
RW 50331648 VMFS "vdisk-flat.vmdk"
# The Disk Data Base
#DDB
ddb.virtualHWVersion = "4"
ddb.geometry.cylinders = "3133"
ddb.geometry.heads = "255"
ddb.geometry.sectors = "63"
ddb.adapterType = "lsilogic"
ddb.toolsVersion = "7202"

Sample output of child vdisk-000001.vmdk file:

[root@myvi3server]# cat vdisk-000001.vmdk
# Disk DescriptorFile
version=1
CID=8eb633b8
parentCID=7f81b951
createType="vmfsSparse"
parentFileNameHint="/vmfs/volumes/478b9802-ce7ed955-96a4-0015c5fd9308/servername/vdisk.vmdk"
# Extent description
RW 50331648 VMFSSPARSE "vdisk-000001-delta.vmdk"
# The Disk Data Base
#DDB
ddb.toolsVersion = "7202"

Note: The file path in the parentNameHint field in this sample is pointing to an iSCSI SAN volume. The long hex guid number is a directory pointer.

Sample output of child vdisk-000002.vmdk file:

[root@myvi3server]# cat vdisk-000002.vmdk
# Disk DescriptorFile
version=1
CID=249e6aff
parentCID=8eb633b8
createType="vmfsSparse"
parentFileNameHint="sryulris0cogp01-000001.vmdk"
# Extent description
RW 50331648 VMFSSPARSE "vdisk-000002-delta.vmdk"
# The Disk Data Base
#DDB
ddb.toolsVersion = "7202"

Note: The file path in the parentNameHint field is pointing to a file that resides in the same local directory as vdisk-000002.vmdk. If the file is located in a different directory, the path must indicate this accordingly.

Solution

The following steps synchronize the snapshot delta files so that VI3 recognizes their relationship with each other, as well as their authenticity:

Note: Perform the following steps in a text editor. Backup the original files prior to making any changes.

Edit vdisk-000002.vmdk:

  1. Take note of the CID value in vdisk-000001.vmdk (=8eb633b8)
  2. Take note of the file path of vdisk-000001.vmdk (=local directory)
  3. Point the parentCID field to the CID value of vdisk-000001.vmdk
    --> parentCID=8eb633b8
  4. Point the parentNameHint field to the vdisk-000001.vmdk file
    --> parentFileNameHint="sryulris0cogp01-000001.vmdk"

Edit vdisk-000001.vmdk:

  1. Take note of the CID value in vdisk.vmdk (=7f81b951)
  2. Take note of the file path of vdisk.vmdk (="/vmfs/volumes/478b9802-ce7ed955-96a4-0015c5fd9308/servername/)
  3. Point the parentCID value to the CID of vdisk.vmdk
    --> parentCID=7f81b951
  4. Point the parentNameHint field to the vdisk.vmdk file
    --> parentFileNameHint="/vmfs/volumes/478b9802-ce7ed955-96a4-0015c5fd9308/servername/vdisk.vmdk"

If the virtual disks assigned to the VM were not changed or altered in any way, the solution ends here. If the virtual disks were altered, or removed and then re-added, manually edit the appropriate virtual SCSI controller’s settings in the VM’s .vmx configuration file to point to the last child .vmdk in the snapshot chain.

Note: Perform the following steps in a text editor. Backup the original file prior to making any changes.

Edit servername.vmx:

  1. Locate the SCSI controller associated with the virtual disk that must be repaired. The first virtual disk will normally be assiged to virtual SCSI controller scsi0:0, the second virtual disk will normally be assigned to virtual SCSI controller scsi0:1, and the third virtual disk will normally be assigned to virtual SCSI controller scsi0:2 etc…
  2. Assuming that the second virtual disk must be repaired, locate the section of the file that references scsi0:1.
  3. Point the following field to the correct location of the last child .vmdk in the snapshot chain:
    scsi0:1.fileName="/vmfs/volumes/46fd62d479749c9697e30015c5fd9308/servername/vdisk-000002.vmdk - Note: The file path may vary, though the snapshots are stored in the same directory as the parent .vmdk (vdisk.vmdk) by default. Despite this, enter the complete path to the child .vmdk (vdisk-000002.vmdk), including any iSCSI volume guid’s etc...
  4. Save the file and restart the VM.
  5. If the VM does not boot, or if it complains that "The parent virtual disk has been modified since the child was created", verify the CID chain hierarchy and file paths, then retry. If the VM still does not boot, and the CID chains and file paths are all correct, the parent or child .vmdk files were probably altered or changed in an unrecoverable way, or they may not be the correct files!

40 comments:

brendan.daniel said...

Thanks so much!! You are a legend! You save me the time of re-building 2 servers from scratch after a failed migration between 2 seperate ESX Clusters and physical datastores.

Oliver O'Boyle said...

Hi Brendan,

Thanks for your comment. I'm glad the article was of use to you!

Oliver

malaysiavm said...

This article really helpful although my case is little different, I am able to gather the information here to rebuild manually a vmdk file and force commit the delta to the flat file with the help on this post.

Good Job guy, and thanks.

Malaysiavm

Marc said...

Thank you very much! Your explanation allowed me to fix in a few minutes what obliged me a couple of years ago to re-install a complete machine. Thanks again.
Marc

Oliver O'Boyle said...

Hi Marc,

That's great! I'm happy the solution worked for you. Thanks for leaving a comment about your success!

Oliver

Anonymous said...

Hi Oliver

A lot of valuable information here...appreciate you sharing your experience.

I want to do the opposite. I actually want to backup my /vmfs/volumes/..../vm's directory in order to maintain my delta-snapshots. I have users who are developers who take delta snapshots whenever they load a patch to their VM and they label that snapshot. They can then go into vCenter and use Snap Manager to revert to a specific "state."

The problem is I need to backup this directory while the VM is online. There lies my problem.. the active "You are here" snapshot is locked so it gets skipped in my backup. Is there anyway around this?

Because of the parent-child relationship, I need that active delta for a DR recovery in the event they blow away the whole VM directory.

I can't do virtual disk .vmdk backups because then all of the deltas are written out to the disk. So then when I recover the vm directory, I do not have my delta-snapshots.

Any help would be appreciated.

Thanks

Oliver O'Boyle said...

Hi Anonymous :)

That's a tricky one. If you really can't shut down the VM even during off-hours, your choices are limited.

Where is the storage located? Is it on a SAN? If so, and if your SAN vendor supports volume snapshots, you could do a SAN snapshot instead. Vendors like Equallogic provide this capability in their products, but it will cost you.

Alternatively, you could try a commercial product like Backup Exec and use the Open File option. I've never tried this though, and you'll likely run into problems, if not outright corruption, at some point.

I also found this article which might be helpful. Note the product he recommends at the bottom of the page.

You could also try doing a second back-to-back snapshot and backup the first one of the two. Because the snapshots would be almost identical, you'd be essentially getting a full backup of the "You are here" state.

My preference would be the SAN snapshots though, if you have this capability.

Sorry I couldn't be of more use. Keep us posted though. I'm curious to know what your final decision will be, and how well it will work.

Oliver

Oliver O'Boyle said...

Hi Malaysiavm,

Sorry for not publishing your comment earlier, I only just noticed it now.

You're welcome, and I'm glad this helped with another problem as well!

Oliver

8c2gon said...

Great article - lost two days of mail for a client when a SVmotion wnet bad - had no idea there was a snapshot there, managed to get it back with this. Cheers!

Oliver O'Boyle said...

8c2gon,

That's great news! Thanks for the update.

Oliver

Anonymous said...

God bless you matey, you saved me so much time and hassle! It's a shame people aren't as willing as you to share such information.

Oliver O'Boyle said...

It's my pleasure. I've been there (I was there, which is why I wrote the article!), so I know what it's like.

I'm glad it was useful to you.

Oliver

Aaron said...

Hi Oliver,

I have a similar issue which i believe can be fixed using this method; I have migrated my vmware host from linux (ubuntu) to windows, and the path of the parent disk file is wrong (ie /dev/hdd/disk.vmdk instead of E:\vmware\disk.vmdk)and I just need to update it.

What text editor (or other method) did you use to edit the files. The hard drive in question is about 70GB (around 50 in the parent file and the rest in the snapshot 'child' file)

thanks,
Aaron

Oliver O'Boyle said...

Hi Aaron,

I used VI on my installations. In theory, you should be able to use any text editor. However, if you've tried this, and it's still not working, keep in mind that *nix systems end lines with LF (line feed) caracters, while Windows systems end lines with CR+LF (Carriage Return). Word Pad (instead of Notepad) should allow you to manage these special ASCII characters.

Let us know if that helped any.

Oliver

Aaron said...

Hi Oliver,

I tried vim for windows to no avail, notepad, notepad++ and wordpad all gave file to big errors. I did though find a text editor which worked for me! EmEditor uses a temp directory when opening files, and it opened it fairly quickly actually. This did the job, I am just waiting for it to save changes back to the disk which is taking a little longer!

Oliver O'Boyle said...

Aaron,

Excellent! I'm curious to see if that will do the job.

Oliver

Anonymous said...

Just saved me some serious DR... thanks buddy!

Oliver O'Boyle said...

Anytime :)

Shan said...

Hi,

This is really informative.

I have to merge delta disk to base disks. What command or tool should I use. Snapshots are not available. I want to build the disk from available base + delta disk and attach it to a VM.

Your answer is awaited and appreciated.

Thanks.

Shan

Anonymous said...

Thanks man for your incredible document. It's saved me a lot of hours.
My VM is running smootly again

Oliver O'Boyle said...

Shan,

Sorry for the delay in replying, I've been busy and haven't had a chance to test this yet.

It sounds like you don't have full access to your VM's. Have you tried cloning the VM that has delta vmdk files?

My test shows that this causes the delta files to merge into a single file.

Let me know if that's what you were looking for.

Oliver

Oliver O'Boyle said...

Thanks Anonymous :)

Anonymous said...

Many thanks, sharing your experience saved me a lot of time in solving my problem.

Luke

Oliver O'Boyle said...

Thanks, Luke. I'm glad I was able to help so many people!

Oliver

Stephan said...

Hi Oliver,

Thanks a lot, it saved me a lot of time and work. You solution worked for VM Fusion as well, only the last part was different (SCSI --> IDE).

Really helpful.

Cheers,
Stephan

Oliver O'Boyle said...

Stephan,

Thanks for the follow up! That's good to know.

Oliver

Anonymous said...

Can I manually update the CID references in this situation ... I deleted snapshot 4 out of 8 accidentally. My vm won't currently boot. I cannot recover the deleted file with any tools I have found so far. Please reply to my email dwsalmon @ rochester . rr . com

Oliver O'Boyle said...

dwsalmon,

I don't believe that will work at all. Though the snapshots are linked with CID chains, the data contained within them still needs to match up and make sense. If you really wanted to get all forensic, you might be able to extract the data from each remaining file using a binary editor, but I've never tried this and can't even say if it would actually work. Even then, you would be missing data.

When you delete a file, the file isn't immediately removed from the disk. It's first character is replaced with a special character and the rest of the file and name remain intact until another file needs the space on disk. When the space is needed, the OS will then replace the deleted file with whatever is going to replace it.

You may still be able to recover the deleted files uaing a technique similar to this, or, you could look for a commercial product that does the same thing. However, keep in mind that the longer you take to recover these files, the more likely you are to be unable to recover them!

Good luck and I hope that helps.

Oliver

Matt Hinson said...

Hi Oliver,

What text editor did you use? I tried to open up my vmdk files with notepad and just got a bunch of garble.

Thanks,

MDH

Oliver O'Boyle said...

Matt,

I used vi from the hypervisor console.

Notepad is probably adding extra CR or LF characters. If your ESX server is installed on bare metal (not sitting on top of a host OS like Windows or Linux) then you should just edit the files from the console itself.

Oliver

Anonymous said...

Hi Oliver,

thanks for your article but extra thanks for your comment after, where you sugested to clone vm to merge delta disks :) I was already loosing hope and falling back to new v-disks and ghost solution.

Keep up the good work :)
Elvis

Oliver O'Boyle said...

Hi Elvis,

Thanks for the feedback. I'm glad to hear that things worked out for you too!

Oliver

Vidar said...

Thank you, thank you, thank you. Saved us hours of having to rebuild a server with some complex apps running on it.
We had a poweroutage, and the vmserver came backonline without several of the virtual machines in inventory. We went to the datastore and added these vm's to inventory, but one of them had a messed up cid chain. This fixed it perfectly.

Oliver O'Boyle said...

Hi Vidar,

You're welcome! I'm glad it worked for you :)

Oliver

Anonymous said...

Thanks Oliver. This process helped me recover from a near disaster.

Oliver O'Boyle said...

Glad it worked for you, Anonymous!

O.

Jim said...

You got me out of a massive hole.
Thanks for your fab article.

Oliver O'Boyle said...

Hi Jim,

I'm glad it was helpful and that people are still using it!

Oliver

JimiSweden said...

Hi Oliver, thanks for this great post. As a complement I would like to add this for those who do not know how to edit, download and upload the files:

If you are not able to access the ESXi server console and don´t want to mess with vCLI vifs you can browse the data store
From the web based data store browser "Browse datastores in this host's inventory" at your servers address httpS://
From here you can download the configuration .vmdk files, in vSphere Client the flat.vmdk/delta.vmdk and their respective configuration file are shown as one and hence you cannot download only the configuration file. But you can use the vSphere Client datastore browser to upload the edited configuration file.
I used this approach and it worked out fine.
Notepad++ is my editing tool.

Thanks again,
JimiSweden

Oliver O'Boyle said...

Hi JimiSweden,

Thanks for the comment and added information. I'm glad to see the post is still helpful to people!

Oliver