Tuesday, October 10, 2006

Optimizing Multimedia and Backup Storage with Hard or Soft Links

Here are the notes on this subject in my Server Log with some annotations. Basically, it's (hard links are) a quick fix for duplicate files to free up space.

I had md5 files generated for everything that was copied onto the server, so using the shell commands sort, grep, and uniq, I was able to clean up a lot of space from files that had been copied over twice as a result of using WinMerge to prepare for deleting the backups that were on one of the hard drives that went into the RAID.



I was about to start playing with FSlint, but by chance came across a perl program called dupseek. I was looking for a script which replaced one of two duplicate files with a link.



As far as I can tell, this is an excellent program with a well thought-out algorithm for finding duplicate files on a Unix system (based on personal experience, and what it says on their page), but more importantly to me, it has a function for creating Unix soft links in place of the duplicate files. Although it's text-mode, this is the best program I've used for dealing with duplicate files. And text-mode is just fine! This is a real life-saver because I don't want duplicates sitting on the file system, and I'd like to keep some files cross-referenced in directories. Also, removing duplicates in directories which are already backed up will make the copy on hard disk seem like it has fewer files than that on the CD.


Hard links would really be better (for me), because soft (symbolic) links in Unix probably raise some compatibility issues when, for instance, trying to but the directory on a CD. Unless interpreted correctly, these links are just files. With hard links, however, the same file system inode is just referenced by two different directories. Since the directories in question won't be changing, this wouldn't raise any issues (deletion, separation, etc.).


In fact, gnomebaker currently has a bug where soft links are dereferenced for computer the size of the CD image, but the link file itself if put onto the CD.


I was able to save less than 10 GB by realizing there were duplicate media files as a result of combining directories before the big move to the server RAID, and more than 2 GB by using dupseek.


http://www.pixelbeat.org/fslint/
http://www.beautylabs.net/software/dupseek.html


Notes:

You can use a text file, generated yourself or by filtering the report generated the “-b report” function of dupseek, which contains filenames to be removed. You can pipe these names to xargs which calls rm. This is useful in case files have been copied to more specific directories and many duplicates lie in a general directory (e.g. “downloads” vs. “singles”).


cat [name of report file] | grep “/Downloads/” | tr \\n \\0 | xargs -0 rm


Which accomplishes the task of filtering the report to only files which are in a path which includes
“/Downloads/”, replacing the new line characters with null characters, and xargs passes each of these lines to rm. This removes the duplicate files which are in the common directory (which you want to clean out, and not preserve). Also, this must be executed from the same directory the report file was created in reference to (to make the relative filenames match up). For safety, try replacing rm with ls before you do anything to make sure you're about to remove the right files:


cat [name of report file] | tr \\n \\0 | xargs -0 ls | less


Only after checking over this output, hit q and then run:


cat [name of report file] | tr \\n \\0 | xargs -0 rm


Note: A much safer plan is to use the interactive mode of dupseek, or use the FSlint GUI (takes into account hard links). The interactive mode of dupseek got too repetitive and so I just used it to identify duplicates in one case, but in general the interactive or batch mode is fine (be careful of the batch mode, unless you're running my version of dupseek).


In the end, dupseek is better for batch jobs (but I feel that only with my hard link modification) and FSLint is better for compatibility and running on directories like home folders where it is the case that you want to leave some unique files alone. Of course, a compressed file system would be a step better, but who has the (CPU) time for that?

1 comment:

Ryan said...

Excellent. Such is the UNIX philosophy.