Ringlord Technologies Products

DeadSheep

DeadSheep scours your hard disk for identical files. You can outright delete the duplicates or (often a better idea) convert them to links, which merges the storage wasted by multiple files into a single unified block, thereby recovering the space that was previously wasted!

Q: Why DeadSheep, do you have something against fluffy animals?

A: Not at all! The first-ever cloned animal was Dolly the Sheep. Duplicate files on your disk are clones, too, but unlike Dolly, the files are not fluffy (at least mine aren’t!) and so you may want to get rid of the waste that these clones represent. Hence: DeadSheep!

Description:Helps eliminate duplicate files on your hard disk.
Version:1.0
License:GPLv3
Requirements:Java 1.7+
Launch (web):Coming soon: Launch DeadSheep using Java WebStart
Launch (shell):java -jar deadsheep.jar
Download:deadsheep.jar (759.5KiB)

How Does DeadSheep Work?

There are five phases to the process; step (3) and (4) take the most amount of time:

  1. Collecting files — DeadSheep scans your hard disk (or portion thereof) for the kind of files it should process: You can have DeadSheep consider only a specific set of files (like movies or zip archives), and have it skip certain sub-directories (no point searching installed system software, for example).
  2. Eliminate obvious non-candidates — DeadSheep orders these files into groups of equal size. For example, all the files with size 12345 bytes get clumped into a group, and all the ones of size 23456 go into another group. Files of unequal size need not be compared, so this cuts down on the amount of data that DeadSheep needs to spend time comparing. Also, a group with only one file in it can be ignored.
  3. Comparing file contents — DeadSheep begins the laborious comparison process: For each file in a group of equal sizes, DeadSheep compares the first 1024 bytes to build a “fingerprint” of the data (a digest hash). Files of equal size whose fingerprint differs cannot be the same. This cuts down on the number of files that must be read in full.
  4. Those files of equal size whose first 1024 bytes were the same, are now compared in full. Each of these files is read only once, and two(!) fingerprints are created using different algorithms, which eliminates virtually all chances that there is a false match.
  5. And finally, DeadSheep presents you with a listing that includes the size of matching files, how many there are, and if you were to retain only one of them in some form, how much wasted disk space you would be able to recover.

You can select any one set of files in step (5) to examine the files, delete one or more of them, move them somewhere else on your disk(s), or convert some of them into links, which preserves the files but joins their actual storage and recovers disk space.

Obviously you will get the greatest benefit by working with the largest files first. You should also consider that it can be prudent to think about how certain files are accessed, what software expects them to be in a particular state, and whether you really want to be joining their storage: You probably do not want to join the storage of a working file with a freshly created backup, because suddenly updates to the primary file will affect the backup, too, when that was exactly not what the backup was for.

In other words, do not plow ahead blindly. DeadSheep is a powerful tool. Use it judiciously, use it wisely! :-)

Hard links? Symbolic links?

If you are a Linux, BSD, Mac OS/X, etc. user, you may already know the difference between hard and symbolic links, but Microsoft® Windows® users are almost certainly not familiar with this, and almost guaranteed unfamiliar with the concept of hard links in particular. This section is for you!

A file on disk consists of two parts:

  1. The directory entry — think of it as the entry in a book’s table of contents, which has a page number where the story is to be found
  2. The file itself

In most cases you have one entry in the table of contents (one file descriptor in a directory) that leads you directly to one story (file storage). Just as nothing would prevent a publisher from adding more than one reference in the table of contents to the same story, but under several different names, there can be many references on your disk to the exact same file.

Links, whether hard or symbolic ones, should act like a normal file in every respect. Open a link, and you actually opened the file that the link is pointing to. This doesn’t quite work under Microsoft® Windows®, alas: Software under Windows® has to explicitly recognize the links and follow them (most software does), otherwise it gets the content of the .LNK file, instead. It’s a bit kludgy, for certain, so links under Windows® are certain second class citizens.

Hard Links

A hard link look exactly like any other file, and there is no way to distinguish any one of them as the “master” file because they are all the same, no matter in what directory they are, or what they are named.

Delete one hard link and all you’ve really done is remove an alias, a reference to the actual storage. Delete the last such reference, and only then is the file storage eliminated, too. Under Microsoft® Windows® no more than 1024 hard link references to a file can exist, but under Unix-like operating systems the number is practically unlimited.

The one limitation of hard links is that they cannot “cross” to another file system (or disk). You cannot create a hard link from to an external USB drive, for example, or from your C: drive to a file on the D: drive. Only symbolic links can do that.

File systems under Unix®-like operating systems have supported hard links for a long, long time, but Microsoft® Windows® NT File System has supported them, too! It’s just that Microsoft has not provided much of an interface for them, other than the “mklink” DOS command, and hasn’t bothered advertising them, either, probably because most files, when they are saved, are re-written from scratch, thus leaving the original storage and other hard links behind, and actually creating a brand-new and unshared file. This behavior could be difficult to explain if you don’t understand the underlying principles, and so Microsoft chose to go with symbolic links only.

Symbolic Links

Symbolic links, or “symlinks”, are links in name only. Move, rename, or delete what they point at, and the link is left dangling helplessly (broken) with no easy way to fix it.

Aside from their relative fragility, symbolic links have two benefits over hard links: They are easy to identify (there is one definite “master”, and all the symbolic links to that master are quite obviously merely references), and they can cross file systems (they’re just names!)

What To Do With Diskspace-Wasting Files?

The first thing you should know is that there are many reasons for the existence of duplicate files on your hard disk. At least some of these are quite legitimate, albeit annoying and wasteful.

Do not blindly delete files simply because there are multiple copies of them: It is often the case that duplicate files are libraries (.DLL or .so files) that you can’t just throw away without breaking the software that requires them.

On a Windows® system, create a hard link if at all possible: Whenever files are fairly static (are not frequently modified) a hard link is the cleanest and least complicated way for the files to share storage without advertising this fact to the software that could otherwise have difficulties with symbolic links.

Screenshots!

Yay, everyone loves screenshots, right?

The following sections showcase the user interface. It should give you an idea of what to expect. DeadSheep remembers virtually all settings and restores them the next time you launch it: Window sizes and positions, split pane positions, chosen values, column width, etc.

Click on the images below to see full-size versions in another window/tab. The screenshots were produced under Linux with a KDE desktop, but DeadSheep runs just as well under Microsoft® Windows® and is expected to work on Mac OS/X and any other operating system with Java 1.7 installed.

The Main Window

DeadSheep Main WindowThis is what you see when you launch DeadSheep: It remembers the last 10 directories you’ve picked, the optional file patterns to be included (example: Process only PDF, ODT, and AVI files), directories to be excluded (skip the C:WINDOWS directory), whether to follow symbolic links, process hidden files/directories, and limits on the file size and/or modification date.

The more technical aspect, concerns the algorithm used to perform the first and second phase of comparisons. In the image below, the first (small) section of the files is compared using the MD5 hash; if a match is detected, the full file is then compared with SHA-1. Other choices are available.

The Scanning Window

DeadSheep Scanning WindowOnce you press the “Scan” button, DeadSheep looks for files matching your criteria, and then begins the process of comparing the files against each other.

Only files of equal size, that matched a quick check, and then stood up to a full comparison are considered equal. All file comparisons are performed using ‘digest hashes’ which eliminates the need to read files over and over again, even if there is the need to compare a file against a million others. The comparison is made with the hashes, which is many, many times faster than cross-comparing all of your files by actually comparing them byte by byte.

The Resolve Window

DeadSheep Resolving WindowWhen the scanning process is completed, DeadSheep gives you a window with all the groups ordered by size (that’s the left column) whose values you can display in various abbreviated forms. The larger section shows you the current selection of matching file groups. Each group will always have at least 2 files, of course.

From the set of matching files a popup menu lets you open the file, or (by double clicking) open the directory where it is located. You can also convert the file (or multiple files) into links to one of the other files, which recovers storage space without eliminating the file references. The other operations involve renaming a file (this is there for for convenience), or outright wiping out one or more sheep, which is what a wolf would do. ;-)

If you close this window you are telling DeadSheep that you are finished. You are then returned to the main window, from where you can either start another scan, or exit DeadSheep.

Miscellaneous

DeadSheep requires Java 1.7 or later. It has been tested under Linux and Microsoft® Windows® 7, but for lack of a download from Apple, no testing could be performed on Mac OS/X (sorry, complain to Apple).

Send BitCoins to 12GoFABCjWYbmxKSdQbfps9PmaRcQmgZod specifically in support of DeadSheep’s development!

All content is copyright © Ringlord Technologies unless otherwise stated. We do encourage deep linking to our site's pages but forbid direct reference to images, software or other non-page resources stored here; likewise, do not embed our content in frames or other constructs that may mislead the reader about the content ownership. Play nice, yes?

Find something useful here? Maybe donate some Bitcoin!