Helpful resources, including:
- Reading material
- Digital storage media identification and information
- File format identification and information
- Digital preservation tools
- Scripting resources
- Useful command line scripts
- Preservation of computer-aided design
- General resources
- Lavoie, Brian. "The Open Archival Information System (OAIS) Reference Model: Introductory Guide (2nd Edition)," DPC Technology Watch Report, 2014.
- Trace, Ciaran. "Beyond the Magic to the Mechanism: Computers, Materiality, and What it Means for Records to Be 'Born Digital'," Archivaria 72, 2011.
- Caplan, Priscilla. "Understanding PREMIS," Library of Congress Network Development and MARC Standards Office, revised 2017.
- Callahan, Maureen. "The Value of Archival Description Considered," Chaos->Order, 2014: Not strictly a digital preservation read, but something that anyone arranging and describing archives should read and occasionally revisit.
- Ford, Paul. "What is code?", Bloomberg Businessweek, June 11, 2015: Quite possibly the best (and most entertaining) piece on the nature of code/software ever written.
- Siegfried: Have a file with a missing or suspect file extension? Drag the file onto Siegfried's hammer and this site will do its best to recognize the file for you.
- All known extensions: Don't know anything about a file other than its extension? This is a good place to start.
- PRONOM: The most inclusive and detailed file format registry. Developed and maintained by the UK National Archives.
- Sustainability of Digital Formats Planning for Library of Congress Collections
- COPTR: The Community Owned digital Preservation Tool Registry. Includes the helpful Tools Grid, which categorizes digital preservation tools by function.
- BagIt 0.97 Specification: The specifications for the widely used packaging and validation scheme developed by the Library of Congress.
- Bitcurator wiki
- Archivematica documentation
- Preserving optical media from the command line
- Script Ahoy: "Our community resource is intended to provide helpful one-liners and script code specifically drawn from real-life examples in archives and libraries."
- Command Line Crash Course: From Learn Python the Hard Way
-
Batch convert files in a directory and subdirectories using LibreOffice.
cd topDirectory for f in $(find . -type f -name "*EXT"); do libreoffice --convert-to doc $f --outdir $(dirname $f); done
Change "EXT" to the file extension of the original files; note that it is case-sensitive. Change "doc" to the extension that the files should convert to. See "Supported File Formats" for more information.
-
Recursively unzip files into a new folder with the title of the zip file into their current place in the directory, and then delete the original zip file when it's done:
cd topDirectory for F in $(find . -type f -name *.zip); do unzip "$F" -d "${F%.*}/" && rm "$F"; done
(Source)
For the same result with .rar files use:
cd topDirectory for F in $(find . -name "*.rar"); do unrar x "$F" "${F%.*}/" && rm "$F"; done
Note that the find command is case-sensitive, and will have to be changed according to the case of your zip or rar filenames. Both formats must not have spaces in their filename, or else the command will fail. If needed, use Detox prior to extracting files.
-
Identify duplicate directories across a collection:
Trying using direct-dedupe and then follow the Excel instructions below. It was built off of the following commands:
Run the following scripts in the command line individually. (Replace topDirectory with the file path for the highest level directory.)
cd topDirectory/ find . -type d >> /home/bcadmin/Desktop/directories.csv for D in $(find . -type d); do du -sh $D >> /home/bcadmin/Desktop/filesize.csv; done for D in $(find . -type d); do find $D -type f -exec md5sum {} + | awk '{print $1}' | sort | md5sum >> /home/bcadmin/Desktop/checksums.csv; done
This final command may take some time depending on the size of the collection. Together, these commands will create three CSV spreadsheets on the Bitcurator desktop, containing the list of directories, their human-readable size, and their checksums respectively. Move the columns into one spreadsheet, being mindful that the columns will require some light data clean up in order to get them to line up. It may be easier to have the TOPDIRECTORY be a subdirectory and combine all the spreadsheets at the end if you have more than a few thousand directories total.
Once you've ensured that they match up correctly, create headers for each column and insert filters (Data/Filters or Donnees/Filtrer). Highlight the checksum column and do Home/Conditional formatting/Rules for highlighting cells/Duplicates or Accueil/Mise en forme conditionnelle/Regles de mise en surbrillance des cellules/Valeurs en double. This will change the color of all duplicate checksums, allowing you to identify and delete duplicate directories.
-
Identify all files with problematic timestamps in a directory and modify the timestamps:
cd topDirectory find . -type f -newermt "YYYY-MM-DD" ! -newermt "YYYY-MM-DD" -exec touch -t "YYYYMMDDHHMM" {} +
In which, the script will look for files (-type f) in a certain range of dates (-newermt "YYYY-MM-DD" ! -newermt "YYYY-MM-DD") and then it will change those dates to the one specified by "YYYYMMDDHHMM".
Prior to December 2018, it was determined that if the modification was required by an issue in timestamp's interpretation by UNIX time system, the new date should be "197001010000" which correspond to time 0 in UNIX time system. This strategy was reassessed, and the new date should be derived from the content of the archive being processed. Confirm replacement date with Digital Archivist.
-
Detox utility cleans file and directory names by removing spaces and translating/cleaning up Latin-1 (ISO 8859-1) characters encoded in 8-bit ASCII, Unicode characters encoded in UTF-8, and CGI escaped characters.
To do a test run (i.e. see proposed file name changes without actually making the changes):
detox -rn topDirectory
To make the changes:
detox -r topDirectory
-
List and delete empty files and directories.
To list all empty files and directories:
cd [topDirectory] find . -empty
To delete empty files:
find . -type f -empty -delete
To delete empty directories:
find . -type d -empty -delete
-
To find and remove hidden files
cd filepath/to/directory find -type f -name '.*' (#to find all files that begin with .) cd filepath/to/directory find -type f -name '.*' -delete (#to delete all files that begin with .) find -type f -name "*.db" -delete find -type f -name "~*" -delete find -type f -name "*.DS_Store" -delete
-
Print checksum mismatches between checksum.md5 file and objects directory to terminal
cd /path/to/metadata/directory md5deep -rlX checksum.md5 ../objects
(the -X flag displays the hash and filename for each file in the objects directory that does not match the list of known hashes in the checksum.md5 file)
-
Batch remove commas from file names and replace with underscores
cd /path/to/directory for f in $(find . -name "*,*"); do rename -v 's/,/_/' $f; done
Note that this will only change the first comma in every file name. For example, if a file name contains five commas, you will have to run the command five times to replace every comma. *This command will only work if you run detox prior to get rid of all the spaces in the file name. Or else, the command will not find the files.
-
Sync folders
This script copies or syncs a source folder with a destination folder. This is particularly useful if you have a copy of your files on a RAID and on Processing, and would like to update one to reflect processing with out having to copy-and-paste the whole thing. For large archives, this will be much faster as it keeps all files that stay the same.
This script will copy a folder with the same name as the source folder in the destination directory. If that folder title already exists, it will sync the files.
-qam
is for quiet (i.e. error messages only), archival copy, and trims empty directories.--delete
deletes any files at the destination that do not exist at the source.Note that this script overwrites, deletes, and uses sudo meaning that it is very powerful. It's not recommended that you use it without having some rsync experience.
sudo rsync -qam --delete "/PATH/TO/SIPs/" "/PATH/TO/PARENT_OF_SIPs/"
-
Unlock folders
This script unlocks a folder's contents using sudo.
Often when disk imaging, a file may be created in a locked mode. If you need to unlock a folder change the * to the folder path.
sudo chmod 777 *
or use it recursevely, to unlock all subfolders.
sudo chmod -R 777 *
-
Virus scan
This command uses clamAV to scan files for viruses.
clamscan -r /path/to/staging --max-filesize=Xm --max-scansize=Ym > collection.log
X is the largest filesize (in megabytes) you want to scan, and Y is the largest number of megabytes you want to extract from a single compressed file.
-
For adding prefix and suffix for files (directories)
ls | xargs -I {} mv {} PRE_{} ls | xargs -I {} mv {} {}_SUF
- Ball, Alex. "Preserving Computer-Aided Design (CAD)," DPC Technology Watch Report 13-02, April 2013.
- Barrett, Anne. "Born-digital Architectural Records: Defining the Archivable Record," UNC Master's Thesis, 2012.
- Smith, MacKenzie. "Curating Architectural 3D CAD Models," The International Journal of Digital Curation 1, vol. 4, 2008.
- Digital Curation Google group
- DPC Digital Preservation Handbook: Revised handbook by the UK's Digital Preservation Coalition.
- The Signal: Library of Congress digital preservation blog.
- DSHR's blog: Blog of David S. Rosenthal, digital preservation veteran and developer of LOCKSS.
- Bentley Historical Library ArchivesSpace-Archivematica-DSpace Integration project blog: Despite being the project blog of a particular project, this features many good posts on issues universally faced by archivists and institutions dealing with born-digital archives.