Duplicate Files Finder

Today I begin a series of posts based on NARA’s 2015 report “Open Source Tools for Records Management.”  I will attempt to provide some simple user directions along with an overview of the utility of the software for records management purposes.  I will first investigate Duplicate Files Finder.  The software is available on SourceForge and can be used on Windows, Linux, and Unix platforms.  If you have used this software in your institution, I’d love to hear your feedback.

According to its homepage, Duplicate Files Finder searches for files with the same content, even if they do not share the same file name or folder location.  Users can choose to delete duplicate files or create links among files.  Rather than using a hash algorithm that must read all files completely, Duplicate Files Finder first sorts the files by size and then compares them; as soon as a difference is detected, the program moves on to another file.  This results in faster feedback than can be produced by programs running a hash algorithm.

Duplicate Files Finder has an easy-to-use graphical user interface.

  1. Step 1Choose the directory you want to search for duplicates.  Click on the button highlighted in the screen shot on the right so you can drill down in your directory structure.  If you wish to include or exclude certain file extensions, you may do so in the Include and Exclude boxes, or you may leave these blank.  Similarly, you may specify minimal and maximal file sizes to be reviewed or leave these blank.  You may choose whether to include subdirectories, hidden files, and empty files in your search.  Once you have made these decisions, click on the Add button in the Edit list on the right-hand side of the dialog box.  You will see this information appear in the list of Directories in the top half of the box; you may either chose another directory to search simultaneously, or you may click on the Go! button at the bottom of the box.  (The program is capable of searching local and network drives simultaneously.)
  2. In Step 2, Duplicate Files Finder searches the directory/directories for files and compares their sizes; it then proceeds to compare the files of the same size to identify any duplicates.
  3. In Step 3, the program returns a list of results.  You have three options at this point:
    • You can export the list of duplicate files to a TXT file by clicking on Store and choosing a location and name for this tile.
    • You can click on Show options and narrow down the list to only those files and their duplicates in a specific directory or which match a certain mask.
    • You can right click on any file in the list, and you will be given a list of options:

Step 3

Unfortunately, the program does not provide a mechanism for global deletes.  However, by exporting the list of duplicates, it is pretty easy to identify folders that are full of redundant copies.  Also, one user posted a comment on the SourceForge site suggesting the possibility of writing a VBA script to create a less labor-intensive method for deleting duplicate files.

Personally, the most exciting discovery about this software is that it reliably identifies duplicate photo files.  I have long been frustrated by the iPhone’s process for numbering (and re-numbering) pictures each time they’re downloaded, but Duplicate Files Finder found the extra copies.  It identified over 2,000 duplicates in about 10 minutes.

Obviously, comparing file size is not as sophisticated a comparison mechanism as a checksum.  But in the tests that I’ve run on this program, the only false duplicates that were returned were some LDB files.  (LDB files are Microsoft Access lock information files.  An .LDB file is created when an Access database is opened/accessed by a user — the file is created with the same name as the Access database, but with an .LDB extension.  The file is used to keep track of all users that are currently accessing the database.)  So for the price — free — and the ease of use — very easy — Duplicate Files Finder is a useful tool for identifying some low-hanging fruit in the way of redundant copies that can be quickly deleted.