Identification, Validation, and Characterization are three essential concepts in the long-term preservation of digital files.
- Identification involves determining the specific type and version of file format you are dealing with. Files are bitstreams that are encoded into a standardized format. Examples include: JPG, MP3, TIFF, etc.
- Validation involves checking whether the file conforms to the relevant file format specification based on its identification. The file must be both well-formed – i.e. meeting all the required elements of form, such as its structure, – as well as valid, meaning that it conforms to all of the required rules of the format, such as its quality.
- Characterization involves determining the intrinsic properties of the file’s format. Technical information such as aspect ratio, compression level, and sharpness, can be extracted as metadata from individual files, for example.
It is important to identify, validate, and characterize files because it answers important questions about how to best approach long-term preservation. What kinds of files are you dealing with and what kinds of information are they likely to contain? Do they need to be converted into a different format that can be more reliably preserved or for improving access? What are the risks associated with certain formats?
Fixity is the quality of knowing that a digital file has not been altered or changed in any way. Changes could occur because of aged or failing storage media, issues during transfer, or human error. Unintentional changes to a file are a common risk during migration or transfer, and therefore it is important that fixity checks be conducted as part of these procedures. Common ways of checking and ensuring fixity are through calculating checksums and the use of write blocking hardware.
Checksums are a method of verifying the integrity of digital files and are commonly used for the monitoring of a file’s fixity over time. Checksums are often called a “digital fingerprint” because they provide a unique numerical identification for a particular manifestation of a file. If even a small part of that file changes, re-calculating the file’s checksum will provide a different value. This allows for the continuous monitoring of fixity, as well as the identification of duplicate files. Please note that the checksum will not change if the filename is altered, only the file itself.
BagIt-based transfer tools, such as Exactly and Bagger, automatically use checksums to establish fixity. There are also tools, such as Fixity from AVP, which are dedicated specifically to monitoring fixity and eliminating duplication through calculating checksums.
A write blocker is a piece of hardware or software that allows access to digital storage media, such as disk drives or USB drives, without permitting any changes to the contents. This is essential for preventing accidental changes during the appraisal and transfer of digital files, as even accessing a file from its original source media could unintentionally change aspects of that file.
Not all kinds of digital storage media require the use of a write-blocker. CDs and DVDs, for example, have their own built-in write-blocking technology.
Imaging a disk usually involves creating a bit-to-bit copy of the total contents of that disk in order to duplicate it completely. The resulting disk image will be the same size as the disk’s original capacity, even if this capacity was mostly unused. Capturing this kind of exact copy can also duplicate recently-deleted or hidden files even if they do not show up in the current filesystem.
Normalization is the process of converting files to one of a small number of non-proprietary, widely-used, and preservation-friendly formats during ingest. Normalization standardizes ingested material and allows the archive to spend less time worrying about compatibility with a large number of formats. However, normalization can also alter file sizes and properties. Archives should assess normalization priorities through file format policies and planning.
SIP (Submission Information Package), AIP (Archival Information Package), and DIP (Dissemination Information Package), are the three classifications of information package laid out by the Open Archival Information System (OAIS) model. The SIP, AIP and DIP may be separate copies of the transferred material, or may be the same physical data as it transitions through the preservation workflow.
- SIP describes a package that is transferred during the ingest process. This may be different from the final preservation package.
- AIP describes a package that is destined for long-term storage. It contains archival copies of files (including originals, and if chosen, normalized copies in preservation-ready formats) alongside metadata and relevant documentation.
- DIP describes a package that is designed to be accessed by the average consumer. It may offer some or all of the contents of an AIP, and in formats more appropriate for access.
Bags are structured data packages for transfer and storage that are created according to the BagIt convention. Bags are packaged so that the original data to be transferred is stored in a folder called ‘/data’ and additional text files are created that inventory the files in the package with checksums, as well as additional contextual metadata that a user may enter to identify the Bag. The advantage of Bags is that they provide a simple method of inventorying files and creating checksums that then can be validated at a later time, such as when a user creates a Bag, transfers it over a network, and the receiver validates that Bag as complete and unaltered. A variety of tools can create and validate Bags, including Exactly, Bagger, and the Library of Congress Python-based BagIt tool.
Archivematica’s wiki page showing their default normalization policies, preservation formats, access formats, and the tools that are used for conversion. Please note that the most up-to-date version of this list can only be found in the Format Policy Registry of a standard Archivematica installation, under “Preservation Planning”.
An overview of fixity, checksums, and the importance of fixity checking as part of ongoing digital preservation.
The Digital Preservation handbook is a guide to the concepts and activities of digital preservation that is maintained and updated by the Digital Preservation Coalition. It provides a good overview of the concepts involved in digital preservation as well as institutional strategies, business cases, technical suggestions, and more.
This DPC report provides an in-depth but readable explanation of the OAIS Reference Model on which Archivematica is based.