Today I became suspicious of everything (part 3)

This is part of the hard disk recovery documentation.

Part 3.


Today I became suspicious of (the ext2ifs driver, the mkfs command, the USB enclosure, and basically) everything

On Christmas morning Santa Claus had not granted my wish: ddrescue was still running, but the image file had not been timestamped any more recently than when I left it, and the damaged drive had spun down by itself. dmesg revealed a syslog message “too many IO errors” or something like that, which had caused Linux to give up on reading from the damaged drive. I was very frustrated because, well let’s see, I had expected the disk imaging to make good progress, but instead… I must suffer a reboot and the induced indefinite re-churning of the drive, with even more data loss! What.

Attempting to reboot, I got impatient and tried the Dell Diagnostic partition again. To my dismay, the Dell Diagnostic partition had become unreadable, a sure sign that the disk’s failures were worsening, adding more urgency to the recovery mission. With no viable second option, I booted Knoppix again and resigned myself to the disk churning. Finally I was in again and started ddrescue back up at about 4000MB, to skip the errors at 3200MB.

/media/sda2# ddrescue -B -n -i4000M /dev/hda rescue.image rescue.log

At some point I had to get a replacement hard drive. 80GB seemed to be the right price point, and somehow I narrowed it down to just two “choices”: a Seagate Momentous 5400.3 (with Perpendicular Recording Technology) believe it or not, and another one by Fujitsu, so it wasn’t much of a choice. The Seagate had already received reviews noting bad sectors (!) and I really did not want another one of those. I also rang Dell to see if the parts would be interchangeable (mostly on the question of the CD drive). Unfortunately, Dell would not give me a straight answer, but after insisting on getting my personal information, they told me “your warranty has expired,” “but you have unlimited lifetime phone support,” which turned out to be a person on the other end putting me on hold for minutes at a time every time I asked a question to go read a manual and coming back with some inane suggestion. For example, I was told to repeat things I said I already tried, in particular to go into the Dell Diagnostic partition which of course was already dead. His eventual diagnosis was, “it has been determined that, sir, you have a hardware problem, not a software problem, and due to your warranty has already expired, you may get a new drive from anywhere, just make sure to get a 40GB IDE disk” (presumably to match the original). So I wrote off the wasted hour on that call as sunk cost, and went ahead with the Fujitsu drive.

It continued to stick briefly at roughly 1GB intervals of data transfered, sometimes less. Sometimes I would manually interrupt the process and re-start it at a later point. This ddrescue allows with no problem because it uses its log intelligently, to patch together various chunks of recovered data, such that validly transferred data was never reread and interruptions were not costly to the recovered image. Wishing to avoid a repeat of the inextricable-3200MB-error from Sunday night, I interrupted ddrescue whenever it seemed to be stuck long. I noted the following sticking-points:

3277MiB
9798MiB
10415MiB
10917MiB
11906MiB
13885MiB
24499MiB
25890MiB
27394MiB
27851MiB
31280MiB
32799MiB
33748MiB
35065MiB
36954MiB

The data corruption pattern covers the entire disk — so far as there is any physical correspondence — but affects only a small percentage of disk’s data. In fact, the recovery to error ratio was at least 100:1. I eventually transferred about 38GB of raw bits out of the 40GB disk.

By the way, the -n option demands “no error splitting” meaning that a 64K-chunk of data with any read problems is marked as wholly erroneous (or “/” in the log file… good data is “+” in the log file). After the runs with -n, I did a run with error splitting on a small portion on the disk, forcing ddrescue to analyze only the 64K-chunks with errors on the original run. With this method, I recovered yet more data (“/” became mostly “+” and some “-”).

TRYING THE RAW IMAGE UNDER WINDOWS

Before attacking more chunks with error-splitting, I decided to examine the current state of the disk image. There would be no point to try very hard on the bad parts if the useful files were already contained in the “+” regions. Just from the log file, I was pleased to find the first 5K intact (this includes the first sector holding the MBR and partition table), then many errors in the front of the drive which were in the Dell Diagnostic partition, followed by several hundred MB intact, which would include the boot sector of the NTFS partition and the front of its master file table hopefully. A chunk of the end of the drive was also intact, which should contain the backup NTFS boot sector. Things were looking much better.

So I next planned to mount the image under Windows, since many NTFS analysis tools run only on Windows. To do this, I needed a filesystem driver to read ext2 under Windows. There are a few, but the “ext2 Installable File System for Windows” or ext2ifs boasts Kernel-mode extension of the Windows file system that “is indeed comparable to Windows NT’s native file system drivers“. Not really, as the implementation is missing some behavior one would expect from Windows but it does behave in such a way that most of the usual higher-level filesystem manipulations can be done directly on the ext2 volume, including getting real drive letters and the ability to perform file manipulation directly (without requiring a specialized file manager).

First I wanted to make a copy of the image file, for two reasons: to have a copy to work with without endangering the recovered data, and to put the original back into ddrescue to churn away at erroneous chunks according to which chunks turn out to be important. In retrospect I could have made this copy under Linux, but since I had ext2 already mounted in Windows, I just made the copy under Windows. While it was doing this copying, guess what, the source drive (the external USB drive) became unreadable. Uh……

I took the drive with the 38GB of good data back into Knoppix Linux, and Linux could not mount it either. Uh……

# dmesg | tail

“Corrupt group descriptor: bad block for inode bitmap” etc. etc. Uh……

What did the ext2ifs driver do? There is not a single reason it should have touched the group descriptor table.

# e2fsck /dev/sda2

“e2fsck: Bad magic number in super-block while trying to open /dev/sda2
The superblock could not be read or does not describe a correct ext2 filesystem.” etc. etc. Uh……

What did mkfs.ext2 do? There are backups of these file system parameters, why are they all bad now?

What did the USB enclosure do? It didn’t barf all over the disk, did it?

This completely goes against the original intention of having a good copy of the data! And at this point I need to fork a sub-project to work on recovering the ext2 volume, because I don’t think I can read all 38GB of the already-recovered data back out of the dead Seagate again. Argh!! I called it a night.

Lessons today:

  • I shouldn’t fork the project, I should just bork it altogether. The gods clearly aren’t working with me.
  • Of course I should keep working at this. I just need to use proven technology when making a crucial file copy.
  • ddrescue is good about skipping over good areas of the disk already scanned, but isn’t intelligent about scanning the bad areas of the disk.
  • Dell support is truly useless.

On to Part 4.

No comments yet. Be the first.

Leave a reply