After the digital materials you want to work with have been transferred, the next step is review and appraisal. This is where you have the chance to take stock of what you have, make working copies of the materials, and gather important information to be used during the accessioning process. This section will discuss a number of steps that you may want to take when doing your own appraisal, and explore the unique issues associated with appraising born-digital materials, but it is expected that readers will be familiar with the general theory and principles of appraisal.
Preparing a processing station
As briefly discussed in the transfer section, it is important to prepare a clean workstation to use while working with born-digital material. ‘Clean’ in this case means that it is isolated from potential sources of harm by being regularly scanned for viruses.
Before you start, your workstation should be installed with all of the hardware peripherals and software you will need for pre-ingest procedures and scanned for viruses again. Different institutions have different needs when it comes to how powerful and versatile their workstations need to be. For example, disk images may take up a lot of space because they are always the size of the entire capacity of the original disk. If you know that you are going to be creating a large number of disk images, the workstation you use should have access to enough storage space to house the ones you are working on. See this page’s Resources section for additional guidance on building a forensic workstation.
You should continue to regularly monitor the fixity of your files by re-validating their checksums, as any work you do on the files has a chance of unintentionally altering them. Additionally, if they are altered, regularly tracking checksums can give you an idea of which process is causing the changes.
Creating working copies of files
Accidents can always happen when working with born-digital materials. Files could be accidentally damaged while you work with them, and the last thing you want to do is lose the only copy of an important file. Therefore, you should create copies of the materials you are appraising, which can then be processed and arranged instead of the originals during future steps. Depending on how valuable these materials are, you may even wish to store another copy in a separate location.
Retrieving directory listing
A directory listing consists of a list of all materials you are accessioning, as well as the folder structure they have been placed in. Creating a directory listing is a useful step during appraisal as it both preserves a record of the original file structure as well as prepares a list of the materials to guide the review process and to use during accessioning and arrangement. You might also eventually wish to include the directory listing as part of the final SIP. Archivists may also wish to review the directory listing with the donor(s), if possible, to ensure that the materials acquired were all intended to be donated.
It is possible to print a directory listing using the command line interface (such as using the ‘tree’ command in Linux), however it also easy to use a program created for the purpose. There are several options listed under the tools section, which includes Karen’s Directory Printer, Dirlister, and Directory List and Print. These programs allow you to create directory listings and export them to a text file, or in some cases an Excel or Microsoft Word file.
If the transfer was accompanied by a manifest, match the files in the directory listing against it to make sure everything is there.
Checking files and formats
One of the key parts of the review and appraisal of born-digital materials is determining which formats the files are in. Knowing the format is essential for planning long-term digital preservation, as it gives you clues about what data these files contain and what resources you will need to work with them. For example, a JPG contains different information than an MP3 or an MP4 file, and these differences will inform how you process them in future steps. It is also important to know whether any of these formats are proprietary or open-source, whether they have built-in rights management, and whether it is a widely-used or legacy file format. Some formats may need special software in order to be accessed, and difficulties can arise if that software is unavailable or obsolete, all details that will impact your appraisal decisions. Some institutions may make appraisal decisions based on the difficulty of accessing and migrating files versus their potential value – a file in a difficult-to-access proprietary format may not merit retention. Finally, start thinking about your rules for format normalization and other processes. Knowing how Archivematica will treat certain formats ahead of time will allow for the adjustment and testing of file format rules and commands.
Archivematica uses the PRONOM-based programs FIDO and Siegfried to identify file formats during the ingest process, however it is recommended that you use a file identification program as part of appraisal, before you are ready to upload the files as a SIP. The tools section contains several suggestions for programs to use, including Brunnhilde, DROID, JHOVE, and FITS.
Reviewing for personal information
Just as it is important to keep privacy issues in mind when working with physical documents, it is likely that you will encounter Personally Identifiable Information (PII) in born-digital files as well. Examples of PII that are commonly found in born-digital files are: credit card numbers, SIN numbers, names, and addresses.
PII and other sensitive information are particular risks when creating disk images, as a physical disk image contains deleted files, which the donor may not have realized would be accessible by the archivist. This can include anything from images, draft documents, or browser search histories. It is a good idea to make donors aware of the possibility of recovering deleted files from donated drives when they consult with you.
While they cannot locate all forms of PII, there are programs that can search for certain kinds of personally identifiable data within a disk image. One widely-used example is Bulk Extractor, which identifies the more common forms of PII like credit card and social security numbers or passwords. At this time, identifying PII in an automated way is an inexact science, and archivists will have to consider the balance of methods and efforts required to identify PII. For a deeper discussion of this subject, see Ben Goldman and Timothy Pyatt’s article, “Security Without Obscurity: Managing Personally Identifiable Information in Born-Digital Archives.”
The identification of PII is important in guiding the appraisal process. Records that contain personal information must be treated differently than records that do not, and for both ethical and legal reasons donors should be made aware of what kinds of PII will eventually be accessioned. Some PII might be too sensitive to accession when compared to its worth to the archives. All of these choices depend on knowing what kinds of PII the materials you are appraising contain.
Reviewing for rights
It is important to be aware of whether the material you are working with contains any copyrighted or other kind of legally restricted content, as that will inform your choices when it comes to formally accessioning the material.
Rights-protected material often found on personal hard-drives could consist of commercial images, music, or video content, pirated media, or commercial software that a donor may not have the right to transfer to you. There might be digital content that belongs to the donor, but is saved in a proprietary format containing Digital Rights Management (DRM) protections, which cannot (currently be) legally circumvented and accessed by the archives.
There are risks associated with accessioning rights-protected material that must be taken into account during appraisal. For instance, if you go on to accession materials that you are not sure your archives holds the rights to, you may later be asked to take it down, or be legally unable to provide access to it. You should decide whether this material is worth accessioning, given these risks.
Reviewing for duplicates
Like reviewing for PII and rights, identifying duplicate files is useful during the appraisal process so that you can later decide whether to weed these files when you arrange them. And like with PII, there is software capable of identifying duplicates, such as Brunnhilde, DROID and Duplicate File Finder.
You should also keep in mind that duplicates might not always be digital. If you are working with a hybrid fonds – which contain both digital and physical materials – there might be duplicates in multiple formats. For example, an acquisition might contain a document that is in PDF format on a donated hard-drive but that was also printed out and is collected with the paper files.
Institutions may wish to define policies or processes for determining when to keep or discard duplicates and how to document these decisions.
Reviewing for encrypted files
When working with digital materials you may find some that have been encrypted and cannot be accessed. Some people choose to encrypt entire disk drives, or only certain particularly sensitive files. Without the key or password to unlock encrypted files, it is very difficult to open them. When meeting with donors concerning digital content, it is important to ask whether they will be donating any encrypted files, and if they have the means to unlock those files.
Much like rights-protected files, the presence of encrypted files should impact appraisal decisions. Take into account the value of these files, whether you believe you will eventually be able to access them, and what the relative risks and benefits are of a decision to accession them in the event that you can’t.
Appraisal of large collections
One of the unique challenges when working with digital materials is the sheer size of some of the collections that must be appraised. Archivists may be faced with thousands of disks of material, or years worth of emails to review. With such large numbers of files, it can be difficult to identify what is worth keeping and what can be safely weeded, or to locate PII and duplicate files.
It is always important to talk to the donor as a first step for appraisal, but this is especially true in the case of large collections, where even looking through the materials may take a long time. This UCI case study, also linked below, describes the difficulty of appraising a 2.5 terabyte donation of video files, and how useful it was to speak with someone with knowledge of the collection in order to understand the context of why file sizes were so large, and what kind of hardware and software was used to create them. This then informed how the archives approached their weeding, and the question of whether to compress the files.
When dealing with large numbers of files, many archives have created automated processes that are able to run programs such as DROID or ExifTool to extract metadata and identify duplicates on batches of files rather than one at a time. Here is one such example from the Carleton College Archives, and another from the Fisher Rare Book Library. While this requires some coding experience, having an automated script can significantly cut down the time it takes to appraise large collections.
Large volumes of files are what makes email often difficult to appraise. This case study explores the use of Stanford’s ePADD email ingest and processing software to batch-delete personal emails and screen for social security numbers when weeding institutional records.
Efficient methods for reviewing large numbers of files are still in the process of being tested and developed. For further suggestions, check this document’s linked case studies, many of which have examples of what others have done when appraising large collections.
This is a short case study that describes Texas A&M University-Corpus Christi first attempts at gathering email records, and the archivists’ use of ePADD to collect and appraise these records.
A blog post that touches on the appraisal problems posed by hybrid fonds, and how locating duplicates can be difficult when there is a vast amount of both physical and digital material to review.
This paper from the University of California Irvine examines the difficulties associated with the appraisal, accessioning, and arrangement of a 2.5 terabyte born-digital donation containing a number of large video files and thousands of image files.
This blog post recounts a project to accession 15.8GB worth of emails in .pst and .mbox format, and the problems encountered when trying to scan those emails for PII. The archivist used both Bulk Extractor as well as a tool called Filelocator Pro along with regular expressions to search for PII within the files.
A very useful tool for review and appraisal that bundles together a number of functions to give you a good overall picture of the materials in a particular potential transfer: it identifies file formats (using Siegfried), and duplicates, runs checksums and virus scans, and much more – all presented in a useful set of reports for the archivist.
Identifies and extracts certain kinds of information from a disk, such as Personally Identifiable Information. Bulk Extractor must be run from within a Bitcurator environment which uses a Linux-based virtual machine. Free software.
Free, lightweight program that outputs a directory and file list to either a plain text file or an excel document. Windows only.
Directory printing program that comes in both free and paid versions. Exports directory listing to a text file, PDF, XML, or HTML file. Windows only.
Digital Record Object Identification tool, developed and maintained by the UK National Archives. Identifies file formats, versions, file age and size, and can identify duplicate files. Checks file formats against the PRONOM file format registry. Free software. Available on both Windows and Mac.
Software that is able to identify and delete duplicate files located on a disk drive. There is both a paid and free version. The paid version has additional features. Windows-only.
ePADD is a browser-based email appraisal and processing software developed by Stanford libraries. Free software, can be used on Windows, Mac, and Linux.
Scans for PII, specifically designed to work with .pst email files (outlook). Paid software with a 30 day free trial. Windows only.
Developed by Harvard and JSTOR, JHOVE is a free tool that identifies, validates, and characterizes files. It runs on Java and can be used on Windows, Mac, and Linux.
Free program that creates a text file with a directory listing and associated metadata such as file size, date, time of last modification, and attributes. Windows only.
Free and open-source program that displays the technical metadata for audio and video files (format, profile, commercial name of the format, duration, overall bit rate, etc). Available on Windows, Mac, and Linux.
Both of these are useful references for what a cheap but comprehensive digital processing workstation looks like. The first is specifically oriented towards creating a machine that can run Bitcurator.
A short blog post by one of the original creators of JHOVE that lays out what it does and its strengths and weaknesses.
A detailed report by the InterPARES project that explores the practical and conceptual differences associated with the appraisal of digital materials.
From the Bitcurator wiki, gives detailed instructions on how to use Bulk Extractor, a component of Bitcurator, to find Personally Identifiable Information such as credit card numbers, email addresses, and birth dates within a disk image.
Guides prepared by the University of Minnesota about how to use duplicate detection software such as DROID and Duplicate File Finder.
Validating Social Security Numbers through Regular Expressions
Blog post with instructions on how to create regular expressions that can be used to search for social security numbers.