directory with zillions of files

Posted by: DWallach

directory with zillions of files - 18/01/2011 19:26

Here's an oddball one.

I just got a giant Bluray in the mail with a large number of scanned ballots (TIFF files). So far as I can tell, they're all in one single directory, hundreds of thousands of them, and the total disk had 21GB of data on it. Tools like 'ls' and whatnot just don't work because they take too long to run. Just to make things more fun, there appear to be some hard errors at the end of the Bluray disk.

After much pain, 'dd conv=noerror' appeared to let me cleanly extract an ISO from the Bluray, so now I've got that on my hard drive and I'm now dealing with a local ISO image. Clearly, I need to extract files into a bunch of subdirectories.

The problem is getting to the point where I even know the damn file names. I've mounted the ISO, I go into the directory, and 'ls' again just sits and spins, completely consuming one core of my CPU, in kernel space. Surprisingly, even 'find . -print' is unhelpful. All it prints is "." before the kernel again gets slammed. After ten minutes, it hasn't printed a single file name.

So... any advice on how I can extract these files from the ISO image?
Posted by: peter

Re: directory with zillions of files - 18/01/2011 19:32

Recent libcdio comes with iso-info and iso-read tools that might help. Even if you end up needing to debug your image, at least using them means you'd be debugging in userland.

Peter
Posted by: DWallach

Re: directory with zillions of files - 18/01/2011 19:41

iso-read seems to insist that you know the filename you're trying to get out. It doesn't have a 'dump everything' option. Grumble. I really didn't want to be hacking libcdio just to get at my damn files.
Posted by: tonyc

Re: directory with zillions of files - 18/01/2011 19:41

ls on many platforms sorts all entries by default. Try ls -U, maybe?
Posted by: tman

Re: directory with zillions of files - 18/01/2011 19:43

What OS you doing this on?

I assume the incredibly long wait for nothing is caused by the errors.

The UDF Tool package is unmaintained now and there is a complete lack of a fsck.udf tool as well. There was a UDF Verifier tool by Phillips Research but that seems to have vanished and it doesn't fix anything anyway.
Posted by: mlord

Re: directory with zillions of files - 18/01/2011 19:44

Instead of ls, try this: echo *

-ml
Posted by: DWallach

Re: directory with zillions of files - 18/01/2011 19:51

This is on my iMac (Core 2 Duo, Mac OS X 10.6.5). I mounted the file with the default options (i.e., I just ran "open" on the ISO file). I'm trying "echo *" right now and it's just sitting there, again with one core slammed to 100% in the kernel, the other core idle.

Any thoughts on tooling to extract the files within?
Posted by: Taym

Re: directory with zillions of files - 18/01/2011 19:52

... or, you may use RAR command line, in any OS, without the need to mount the ISO, but by simply having RAR operate on the .ISO file itself, in the file system.

Edit:
I was assuming *NIX OS. I don't know if you have RAR on a Mac. If not, you may just copy the .ISO to some windows box of any type that you may have around.
Posted by: mlord

Re: directory with zillions of files - 18/01/2011 19:57

MMmm.. yes, I suppose "echo *" is really no improvement, since it wants to find them all before printing the line. Duh. smile

Best bet is a simple C program to break them up, using readdir() to walk the directory.

I can give you a basic readdir program that you can hack away at, if you like.
Posted by: mlord

Re: directory with zillions of files - 18/01/2011 20:17

Originally Posted By: mlord
I can give you a basic readdir program that you can hack away at, if you like.

This could be done more robustly, using execve() rather than system(), but it's probably good enough:

Code:
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <dirent.h>
#include <string.h>
#include <errno.h>

static void do_something (const char *dpath, const char *this_name)
{
        char tmp[16384];  /* BIG dumb buffer, rather than trying to be clever */
        /*
         * Insert code here to do whatever with "this_name".
         * Here is an example of how to do something:
         */
        char destination[] = "/tmp";
        sprintf(tmp, "/bin/cp \"%s/%s\" \"%s\"", dpath, this_name, destination);
        printf("===> %s\n", tmp);
        fflush(stdout);
        if (-1 == system(tmp))
                perror(tmp);
}

int main (int argc, char *argv[])
{
        const char *dpath;
        struct dirent *de;
        DIR *dp;

        if (argc != 2) {
                fprintf(stderr, "Usage: %s <dir>\n", argv[0]);
                return 1;
        }
        dpath = argv[1];
        dp = opendir(dpath);
        if (dp == NULL) {
                perror(dpath);
                return 1;
        }
        errno = 0;
        while ((de = readdir(dp)) != NULL) {
                const char *this_name = de->d_name;
                if (strcmp(this_name, ".") && strcmp(this_name, "..")) {
                        printf("%s\n", this_name);
                        fflush(stdout);
                        do_something(dpath, this_name);
                }
                errno = 0;  /* for next readdir() call */
        }
        if (errno) {
                perror(dpath);
                return 1;
        }
        return 0;  /* all done */
}
Posted by: mlord

Re: directory with zillions of files - 18/01/2011 20:18

Actually, that's more or less how the find command does things..
I wonder if a plain old find . works?
Posted by: DWallach

Re: directory with zillions of files - 18/01/2011 20:20

Based on your idea, I wrote my own C program to do the same basic thing. Now I've got all 223K filenames. Next task is to see what I want to do about them.

As I said up top, "find . -print" failed. I have no idea why.
Posted by: peter

Re: directory with zillions of files - 18/01/2011 20:23

Originally Posted By: DWallach
iso-read seems to insist that you know the filename you're trying to get out. It doesn't have a 'dump everything' option. Grumble. I really didn't want to be hacking libcdio just to get at my damn files.

iso-info with -f or -l to get the filenames, then iso-read to get them out.

Peter
Posted by: mlord

Re: directory with zillions of files - 18/01/2011 20:27

Cool.
Posted by: DWallach

Re: directory with zillions of files - 18/01/2011 20:33

The extraction is running now. I'm modestly curious why "find" didn't work, but that's a question for another day.
Posted by: wfaulk

Re: directory with zillions of files - 18/01/2011 21:39

"dmesg" might have provided some reasonably useful output. Also the other syslog-type logs. An "strace" or equivalent ("dtruss" in MacOS X 10.6) might have also been helpful.
Posted by: tman

Re: directory with zillions of files - 18/01/2011 22:25

Who on earth made this disc anyway? They use some sort of packet writing software and just kept dumping files onto it?
Posted by: gbeer

Re: directory with zillions of files - 19/01/2011 01:02

Originally Posted By: tman
Who on earth made this disc anyway? They use some sort of packet writing software and just kept dumping files onto it?


The first post said "ballots" and "tiff" so it's likely the take from a scanner type electronic voting machine.
Posted by: DWallach

Re: directory with zillions of files - 19/01/2011 14:39

Syslogs were not very helpful. They indicated I/O errors where there were unreadable bits of the disk. That's about it.

Who made this? A municipality that was providing me with copies of its ballots. I'm grateful for the data (now nicely segregated into hundreds of subdirectories). But what a pain to get at it.

For what it's worth, there's absolutely no standards document of any kind that indicates how somebody might give you digital scans of a million ballots.
Posted by: tfabris

Re: directory with zillions of files - 19/01/2011 18:15

Originally Posted By: gbeer
The first post said "ballots" and "tiff" so it's likely the take from a scanner type electronic voting machine.


Ah, I see that the security at Diebold is up to its usual high standards of quality.
Posted by: DWallach

Re: directory with zillions of files - 19/01/2011 18:29

I'm honestly not sure what vendor this particular data set came from, but I'm pretty sure it's not Diebold. What's amazing is that the scans are one-bit (black and white) and fairly low resolution (you can't read most of the text printed on the ballot). This makes it difficult if you want to write a post-facto ballot scanner.
Posted by: gbeer

Re: directory with zillions of files - 20/01/2011 02:18

Sounds like what you have been given is designed to meet the letter of the law.

That could make it tough if they were handing out randomized ballots, like when Schwarzenegger was first elected, and there were something like 50 names on the ballot for Governor.

Are you auditing these for your own entertainment or someone else?
Posted by: DWallach

Re: directory with zillions of files - 20/01/2011 16:39

I'm working with some attorneys who are in the midst of a public interest lawsuit. I can't really get into the details right now. Suffice to say that I may need to crunch my way through a whole lot of these ballots.
Posted by: canuckInOR

Re: directory with zillions of files - 20/01/2011 21:57

Originally Posted By: DWallach
scans are one-bit (black and white) and fairly low resolution (you can't read most of the text printed on the ballot).

Can you post one?
Posted by: DWallach

Re: directory with zillions of files - 21/01/2011 03:50

Not right now. Maybe later.
Posted by: canuckInOR

Re: directory with zillions of files - 21/01/2011 16:05

No prob. I figured you might be under an NDA, or something, but was curious...