A lot of ink, bytes, or whatever other method of inscribing text onto a medium have been spent in the last couple weeks on the Jeb Bush emails. There are a number of angles to take on the disclosure, takedown, redaction, and mea culpas without even getting into the content the email messages contain. From a purely archival/state records perspective, L’Archivista’s two posts on the affair are (as usual) better than most anything else. As such, I won’t try to top her here.
As it happened, I’ve been thinking (wrangling? despairing?) about email recently at work. My institution acquired a large cache of email and other electronic records (not to mention, the paper kind) from an administrator in the latter part of 2014. The recent events with Jeb Bush highlight what I had thought were particular to my case but are potentially universal.
As L’Archivista (via Fortune) points out, most of the nearly 13,000 social security numbers accidentally disclosed were from one attachment–namely a PowerPoint presentation that embedded a spreadsheet containing the SSNs. I encountered an almost identical scenario here. The presentation itself uses the data from the spreadsheet to display a graph–it is likely that the presenter had little idea that the underlying data contained sensitive information. Prior to transfer, if the office had come across a spreadsheet named “EMPLOYEE_DATA” that would throw up an obvious flag. A presentation named something innocuous (DIVERSITY_PRESENTATION, e.g.), on the other hand wouldn’t necessarily.
I have a couple tools at our disposal that apparently the Florida State Archives did not (or maybe they did but did not act on the results)–namely, I run either Identity Finder or bulk_extractor/annotate_features on all acquisitions. Email is trickier–neither bulk_extractor nor Identity Finder understand PST. Identity Finder will search through the contents as active email folders, though, so mounting copies of the acquired PSTs to an Outlook Client will at least allow me to find the messages and attachments with personally identifiable information. It’s not an ideal solution, to say the least. Other types of sensitive information, such as individual names, topics, and the like are harder to find without text mining and other text processing tools, of which there are not yet standardized or agreed upon best practices. Until we can find ways to better automate screening and processing, we won’t be able to satisfy requests for access with a certainty that we’re not disclosing sensitive information.
At my home institution, I’m somewhat fortunate in that we have a lengthy restriction on access to all administrative records. That lets me sleep a little better, although we have been known to waive the restriction to staff of the office or specific creator that transferred the records and to other researchers who have the office’s permission. Were the administrator who transferred us the records ask for access, we would likely grant that access without asking too many questions. Really, then, the restriction is only kicking the can down the road, and only then for external researchers.