Identification, Validation, and Characterization are three essential concepts in the long-term preservation of digital files.
- Identification involves determining the specific type and version of a file format on hand. Files are bitstreams that are encoded into a standardized format. Examples include: JPG, MP3, TIFF, etc.
- Validation involves checking whether a file conforms to its relevant file format specification based on its identification. A file must be both well-formed – i.e. meeting all the required elements of form, such as its structure, – as well as valid, meaning that it conforms to all of the required rules of the format, such as quality characteristics.
- Characterization involves retrieving metadata related to the intrinsic properties of a file. Technical information such as aspect ratio, compression level, and sharpness, can be extracted as metadata from individual files, for example.
It is important to identify, validate, and characterize files because it answers important questions about how to best approach long-term preservation. What kinds of files are you dealing with and what kinds of information are they likely to contain? Are they of suitable quality for long-term preservation? Do they need to be converted into a different format that can be more reliably preserved or more easily accessed?
Fixity is the quality of knowing that a digital file has not been altered or changed in any way. Establishing fixity is a key method of ensuring the integrity of files. Changes could occur because of aged or failing storage media, an interruption during transfer, or human error. Unintentional changes to a file are a common risk during migration or transfer, and therefore it is important that fixity checks be conducted as part of these procedures. Common ways of checking and ensuring fixity are through calculating checksums and the use of write blocking hardware.
Checksums are a method of verifying the integrity of digital files and are commonly used for the monitoring of a file’s fixity over time. Checksums are often called a “digital fingerprint” because they provide a unique numerical identification for a particular manifestation of a file. If even a small part of that file changes, re-calculating the file’s checksum will provide a different value. This allows for the continuous monitoring of fixity, as well as the identification of duplicate files. Please note that the checksum will not change if the filename is altered, only the file itself.
BagIt-based transfer tools, such as Exactly and Bagger, automatically use checksums to establish fixity. There are also tools, such as Fixity from AVP, which are dedicated specifically to monitoring fixity by calculating checksums.
A write blocker is a piece of hardware or software that allows access to digital storage media, such as disk drives or USB drives, without permitting any changes to the contents. This is essential for preventing accidental changes during the appraisal and transfer of digital files, as even accessing a file from its original source media could unintentionally change aspects of that file.
Not all kinds of digital storage media require the use of a write-blocker. CDs and DVDs, for example, have their own built-in write-blocking technology.
Imaging a disk usually involves creating a bit-to-bit copy of the total contents of that disk in order to duplicate it completely. The resulting disk image will be the same size as the disk’s original capacity, even if this capacity was mostly unused. Capturing this kind of exact copy can also duplicate recently-deleted or hidden files even if they do not show up in the current filesystem. Disk images are commonly created when using digital forensics-based workflows for capturing data from external media in combination with write blockers.
Normalization is the process of converting copies of original files to one of a small number of non-proprietary, widely-used, and preservation-friendly formats during ingest. Normalization standardizes ingested material into a subset of formats stored by an archives, and allows the archives to avoid managing a large number of formats into the future. However, normalization can also alter file sizes and properties. Archives should assess normalization priorities and approaches through researching and defining file format policies.
SIP (Submission Information Package), AIP (Archival Information Package), and DIP (Dissemination Information Package), are the three classifications of information packages laid out by the Open Archival Information System (OAIS) model. The SIP, AIP and DIP may be separate copies of the transferred material, or may constitute the same physical data as it transitions through the preservation workflow.
- A SIP is the version of the package that is ready to be ingested into the archive. In archival terms, it is the version of the data before it is processed for preservation (as an AIP) and access (as a DIP).
- An AIP describes a package that is intended for storage as the canonical copy of data for long-term preservation. It contains archival copies of files (including originals, and if chosen, normalized copies in preservation-ready formats) alongside metadata and relevant documentation.
- A DIP describes a package that is designed to be accessed by the archives’ user community. It may offer some or all of the contents of an AIP, and in formats more appropriate for access.
Bags are structured data packages for transfer and storage that are created according to the BagIt convention. Bags are packaged so that the original data to be transferred is stored in a folder called ‘/data’ and additional text files are created that inventory the files in the package with checksums, as well as additional contextual metadata that a user may enter to identify the Bag. The advantage of Bags is that they provide a simple method of inventorying files and creating checksums that then can be validated at a later time, such as when a user creates a Bag, transfers it over a network, and the receiver validates that Bag as complete and unaltered. A variety of tools can create and validate Bags, including Exactly, Bagger, and the Library of Congress Python-based BagIt tool.
Archivematica’s wiki page showing their default normalization policies, preservation formats, access formats, and the tools that are used for conversion. Please note that the most up-to-date version of this list can only be found in the Format Policy Registry of a standard Archivematica installation, under “Preservation Planning”. For example, rules for normalizing video have changed as per Archivematica version 1.11.
An overview of fixity, checksums, and the importance of fixity checking as part of ongoing digital preservation.
The Digital Preservation handbook is a guide to the concepts and activities of digital preservation that is maintained and updated by the Digital Preservation Coalition. It provides a good overview of the concepts involved in digital preservation as well as institutional strategies, business cases, technical suggestions, and more.
This DPC report provides an in-depth but readable explanation of the OAIS Reference Model on which Archivematica is based.
Library of Congress-run resource with information about various file formats and detailed information about their identifiers, sustainability factors, and additional information and resources.
This 2013 document contains the Bentley Historical Library’s policies for of preservation file formats as well as those they will normalize.