Beth Cron and I had a good conversation with Seth Shaw and Jeremy Gibson about open source records management tools. If you didn’t have a chance to join us last week, you can still view it at https://www.youtube.com/watch?v=1GRQBUjtOT8.
Here are some of the highlights:
Open source tools can be very valuable when they have a strong community around them. This support leads to active development, which makes for a sustainable tool. Being able to see other people’s work also provides more entry points to solving your own problem.
The potential downside to open source tools is that they tend to have less documentation, and there’s the potential for projects to be abandoned.
Jeremy pointed out that one of the benefits of open source tools is that you can more easily find tools that do one thing very well — and then stitch those solutions together to accomplish all of your RM needs.
In answer to a query about extracting metadata from media files, Jeremy pointed to MediaInfo and JHOVE.
One of the particular gaps identified in existing open source tools is one to handle redactions.
Before adopting a records management tool, it’s important to document your functional requirements and your organizational requirements (e.g., budget, IT support). Only then can you make sure you’re choosing the right tool for your purposes.
Today I offer my third post in a series based on NARA’s 2015 report “Open Source Tools for Records Management.” I investigated MUSE (Memories USing Email), which was developed at Stanford University. It is available for use in a Windows, Mac, or Linux environment. I conducted my tests using Windows.
This program is a visualization tool for analyzing emails. It is still in active development, and it currently incorporates six tools:
tracks the ingress and egress of email correspondents and groups people who “go together” based on their receipt of messages
allows you to browse attachments on a Piclens 3D photo wall
offers the possibility of personalizing web pages by highlighting terms also found in your email (requires the use of a browser extension)
creates crossword puzzles based on your email archive
Once you download the executable file from the above site, the program runs locally on your computer. Muse can be deployed on a static archive of email messages (e.g., an mbox file) or it can fetch email from online accounts for which you have the email address, password, and server settings. It defaults to analyzing Sent mail, based on the principle that those messages more accurately reflect the topics and people with which the account owner is most engaged, but you can also include additional folders. You can then browse all messages in the embedded viewer — without having to open each message individually — or you can use any of the tools listed above.
The sentiment analyzer using natural language processing and a built-in lexicon, but it can be customized by the user to identify desired terms (see Edit Lexicon highlighted above to access the screen below).
According to their tip sheet for journalists, MUSE “was originally meant for people to browse their own long-term email archives. We have now started adapting it for journalists, archivists and researchers.” Due to the ease of use of this lightweight tool, this could be an easy way for repositories to provide an email analysis tool to researchers. This same tip sheet defines the “sweet spot” for the software as archives with about 50,000 messages.
If you’re interested in learning more about MUSE, a Ph.D. dissertation and a number of papers are available here. There’s also a video that argues for the value of analyzing personal digital archives. This project dovetails with the work being done at Stanford on ePADD — check out our Hangout for more information on that project.
“Fixity is a utility for the documentation and regular review of stored files. Fixity scans a folder or directory, creating a manifest of the files including their file paths and their checksums, against which a regular comparative analysis can be run. Fixity monitors file integrity through generation and validation of checksums, and file attendance through monitoring and reporting on new, missing, moved and renamed files. Fixity emails a report to the user documenting flagged items along with the reason for a flag, such as that a file has been moved to a new location in the directory, has been edited, or has failed a checksum comparison for other reasons. Supplementing tools like BagIt that review files at points of exchange, when run regularly Fixity becomes a powerful tool for monitoring digital files in repositories, servers, and other long-term storage locations.”
I performed my evaluation in a Windows environment. After you download the ZIP file, unzip those files onto a local drive. Inside the Fixity folder, drill down to fixity-win and open the executable file (Fixity.exe). (You’ll also see a PDF copy of the user guide saved in the fixity-win-0.5 folder.)
You can use the simple GUI to run a manual check on stored files or to schedule projects. The only problem I encountered is that without administrative rights on your computer, the scheduled scans will always return results indicating that the files in the designated folder have been removed. But you can avoid this problem by either running the software as an administrator or by doing manual scans. Either way, you first need to set up the parameters of your project. In order to choose the directory that you want to scan, click on the button and find the appropriate folder in your file structure. You can designate up to 7 directories.
You can save your project by clicking on File and Save Settings (Ctrl + S). At this point, you can run a manual scan by clicking on File and Run Now (Ctrl + R). The results will be saved in your reports folder as a .TSV file (Tab Separated Values). AVPreserve provides a ZIP file with the materials from a spreadsheet workshop that demonstrate how to open a TSV file in Microsoft Excel, but here’s a brief summary of how to do it using Excel 2016:
Click on the Data tab.
Click the arrow beside Get External Data and choose From Text.
Navigate to your reports folder and select the TSV file you wish to open. You’ll have to make sure you choose All Files (rather than the default Text Files) in order to see your TSV files. Click on Import.
The defaults in the first step of the Text Import Wizard are correct, so click on Next.
The defaults in the second step of the Text Import Wizard are correct, so click on Next.
In step 3, click on the radio button beside Text under Column data format and click on Finish.
Choose whether you want to put the data in an existing worksheet or a new worksheet and click OK.
The report provides a summary of the results at the top, and you can easily sort this spreadsheet to group the files into categories:
The default settings tell Fixity to review all files, but you can click on Preferences and Filter Files and put a check in the box beside Ignore Hidden Files. The report at the left shows some temporary files that were evaluated. You can also tell Fixity to ignore these files by going to Filter Files, and in the Add Filter box, type
Click on Save & Close, and the next time you run a report, those temporary files will not be shown. By clicking on Preferences and Select Checksum Algorithm, you can choose SHA256 or MD5.
You can find more information about the capabilities of Fixity in the user guide. (For instance, you can configure it to generate report emails.) Although there are some limitations that I identified above, I still find this to be an easy-to-use tool for identifying changes to files in a digital repository. If you use this tool, I’d love to hear your feedback, and if you have an interest in seeing other open source tools evaluated on this site, please let me know.
Today I begin a series of posts based on NARA’s 2015 report “Open Source Tools for Records Management.” I will attempt to provide some simple user directions along with an overview of the utility of the software for records management purposes. I will first investigate Duplicate Files Finder. The software is available on SourceForge and can be used on Windows, Linux, and Unix platforms. If you have used this software in your institution, I’d love to hear your feedback.
According to its homepage, Duplicate Files Finder searches for files with the same content, even if they do not share the same file name or folder location. Users can choose to delete duplicate files or create links among files. Rather than using a hash algorithm that must read all files completely, Duplicate Files Finder first sorts the files by size and then compares them; as soon as a difference is detected, the program moves on to another file. This results in faster feedback than can be produced by programs running a hash algorithm.
Duplicate Files Finder has an easy-to-use graphical user interface.
Choose the directory you want to search for duplicates. Click on the button highlighted in the screen shot on the right so you can drill down in your directory structure. If you wish to include or exclude certain file extensions, you may do so in the Include and Exclude boxes, or you may leave these blank. Similarly, you may specify minimal and maximal file sizes to be reviewed or leave these blank. You may choose whether to include subdirectories, hidden files, and empty files in your search. Once you have made these decisions, click on the Add button in the Edit list on the right-hand side of the dialog box. You will see this information appear in the list of Directories in the top half of the box; you may either chose another directory to search simultaneously, or you may click on the Go! button at the bottom of the box. (The program is capable of searching local and network drives simultaneously.)
In Step 2, Duplicate Files Finder searches the directory/directories for files and compares their sizes; it then proceeds to compare the files of the same size to identify any duplicates.
In Step 3, the program returns a list of results. You have three options at this point:
You can export the list of duplicate files to a TXT file by clicking on Store and choosing a location and name for this tile.
You can click on Show options and narrow down the list to only those files and their duplicates in a specific directory or which match a certain mask.
You can right click on any file in the list, and you will be given a list of options:
Unfortunately, the program does not provide a mechanism for global deletes. However, by exporting the list of duplicates, it is pretty easy to identify folders that are full of redundant copies. Also, one user posted a comment on the SourceForge site suggesting the possibility of writing a VBA script to create a less labor-intensive method for deleting duplicate files.
Personally, the most exciting discovery about this software is that it reliably identifies duplicate photo files. I have long been frustrated by the iPhone’s process for numbering (and re-numbering) pictures each time they’re downloaded, but Duplicate Files Finder found the extra copies. It identified over 2,000 duplicates in about 10 minutes.
Obviously, comparing file size is not as sophisticated a comparison mechanism as a checksum. But in the tests that I’ve run on this program, the only false duplicates that were returned were some LDB files. (LDB files are Microsoft Access lock information files. An .LDB file is created when an Access database is opened/accessed by a user — the file is created with the same name as the Access database, but with an .LDB extension. The file is used to keep track of all users that are currently accessing the database.) So for the price — free — and the ease of use — very easy — Duplicate Files Finder is a useful tool for identifying some low-hanging fruit in the way of redundant copies that can be quickly deleted.