Scenario / Questions

After Amazon’s Aug 8 outage, all (EBS based) AMIs stopped working for many users. This is due to corruption of some sectors in snapshots that the AMIs are based on.

However, Amazon created recovery snapshots where the disk problems should be fixed. Those are named along the lines of “Recovery snapshot for vol-xxxxxxxx”.

I created a new AMI from recovery snapshot which worked fine, but instances launched from this new AMI do not work: their state is “Running”, but I cannot ssh into the machine nor access any of the web services that should be running there. It boils down to this (from System Log, accessible through AWS management console):

EXT3-fs: sda1: couldn't mount because of unsupported optional features (240).

EXT2-fs: sda1: couldn't mount because of unsupported optional features (244).

Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(8,1)

I’ve mounted a volume created from that recovery snapshot in a another server on AWS, and everything looks quite normal though. For example, fsck says:

$ sudo fsck -a /dev/xvdg
fsck from util-linux-ng 2.17.2
uec-rootfs: clean, 53781/524288 files, 546065/2097152 blocks

In one of the AWS forum discussions, I found this advice from someone with similar problems:

A work around will be to make a volume from the snapshot and attach
it to a running instance, use fsck –force to force the checking of
the filesystem and once cleared, you can make a snapshot and use it
for the AMI.

But I don’t know how to force fsck on Ubuntu (11.04):

$ sudo fsck --force /dev/xvdg
fsck from util-linux-ng 2.17.2
fsck.ext3: invalid option -- 'o'

Anyone know how to force file system check on the volume on Ubuntu? Any other ideas on how to launch working instances that are based on the recovery snapshot?

Right now it looks like it might be quicker to just start over from a clean Ubuntu AMI and re-setup all our services. 🙁 But of course I would prefer not to do that if there’s any way to get the recovery snapshot to actually work.

Find below all possible solutions or suggestions for the above questions..

Suggestion: 1

I ran into the same problem when trying to duplicate a machine.

The problem turned out to be the kernel. Both when creating the AMI and the instance I selected default for the kernel image.

To resolve the problem, I recreated the AMI using the same kernel image as the original instance.

Suggestion: 2

Could you try the following command (note -f option instead of –force):
sudo fsck -f /dev/xvdg

Hope this helps.

Suggestion: 3

I didn’t want to waste more time fighting with weird AWS-specific problems, so I created a new clean instance from one of the official Ubuntu AMIs (in my case ami-359ea941 which is a 32-bit EBS-backed image of Ubuntu 11.04 in the eu-west-1 region), and re-created my server setup there.

The fact that I could mount a volume created from the recovery snapshot in the new instance made the re-setup much faster though. For example, I did something like cp -a /mnt/recovery/usr/local /usr to restore a whole lot of stuff under /usr/local.

So in my case the recovery backups were far from useless as I could access the data on them. But of course it would still have been nicer to just create an AMI from the snapshot and continue using (instances from) that like the whole incident never happened. (Feel free to add an answer if you know how to achieve that!)