Svndumpfilter with a lot of files

Once upon a time, I decided it was a good idea to put my photos in a Subversion repository, together with the rest of my website. After about 8 years of digital photography, I collected over 15000 images, and the repository now takes 28 GBytes of disk space.

Trying to convert the Subversion repo to Git, I ran into some problems with corrupted objects, that I haven’t been able to fix. After some time of trying different things, I started looking at ways of deleting all the images from the Subversion repo before trying to convert it to Git. Enter svndumpfilter.

Svndumpfilter is a simple tool, that takes a subversion dumpfile (created with ‘svndadmin dump‘) on standard input and outputs a new dumpfile, with certain paths filtered out (by prefix), or certain paths kept while filtering out the rest, on standard output. It takes path prefixes that need to be in- or excluded on the command line.

At first, the fact that svndumpfilter cannot take some form of pattern (glob, regexp) to match files, seems to make it hard to filter out 15000 JPEGs (and only the JPEGs) scattered over more than 200 directories. However, this appears to be easier than you’d think.

At several websites, I found examples for svndumpfilter, that use a textfile containing the path prefixes:

svndumpfilter exclude `cat filter.txt` \
< repos.dump > repos-filtered.dump

I simply decided to see what would happen, if my filter.txt file contained every file that I wanted excluded (all 15910 of them) as a separate path prefix. A Subversion dumpfile contains all the file names in the repository on a line starting with ‘Node-path: ‘, so I just used some creative grepping to create my filter file:

grep Node-path repos.dump | grep -i 'photo/.*\.jpg' | \
sed 's/Node-path: //' > filter.txt

Next, I ran the svndumpfilter command listed above, and to my big surprise, it executed without complaining. I now have a file called repos-filtered.dump of a mere 1.6 GBytes in size. After restoring that dump with ‘svnadmin load‘ (which also went flawlessly) I ended up with a subversion repository of 1.5 Gbytes. Wonderful!

Next step: see if this new repository will let me convert it to Git