Skip to content

Latest commit

 

History

History
executable file
·
214 lines (129 loc) · 12.9 KB

resources.md

File metadata and controls

executable file
·
214 lines (129 loc) · 12.9 KB

Additional resources

Helpful resources, including:

Reading material

Digital Preservation 101

On archival description

  • Callahan, Maureen. "The Value of Archival Description Considered," Chaos->Order, 2014: Not strictly a digital preservation read, but something that anyone arranging and describing archives should read and occasionally revisit.

On code/software

  • Ford, Paul. "What is code?", Bloomberg Businessweek, June 11, 2015: Quite possibly the best (and most entertaining) piece on the nature of code/software ever written.

Digital storage media identification and information

File format identification and information

Digital preservation tools

Scripting resources

  • Script Ahoy: "Our community resource is intended to provide helpful one-liners and script code specifically drawn from real-life examples in archives and libraries."
  • Command Line Crash Course: From Learn Python the Hard Way

Useful command line scripts

  • Batch convert files in a directory and subdirectories using LibreOffice.

    cd topDirectory
    for f in $(find . -type f -name "*EXT"); do libreoffice --convert-to doc $f --outdir $(dirname $f); done
    

    Change "EXT" to the file extension of the original files; note that it is case-sensitive. Change "doc" to the extension that the files should convert to. See "Supported File Formats" for more information.

  • Recursively unzip files into a new folder with the title of the zip file into their current place in the directory, and then delete the original zip file when it's done:

    cd topDirectory
    for F in $(find . -type f -name *.zip); do unzip "$F" -d "${F%.*}/" && rm "$F"; done
    

    (Source)

    For the same result with .rar files use:

    cd topDirectory
    for F in $(find . -name "*.rar"); do unrar x "$F" "${F%.*}/" && rm "$F"; done
    

    Note that the find command is case-sensitive, and will have to be changed according to the case of your zip or rar filenames. Both formats must not have spaces in their filename, or else the command will fail. If needed, use Detox prior to extracting files.

  • Identify duplicate directories across a collection:

    Trying using direct-dedupe and then follow the Excel instructions below. It was built off of the following commands:

    Run the following scripts in the command line individually. (Replace topDirectory with the file path for the highest level directory.)

    cd topDirectory/
    find . -type d >> /home/bcadmin/Desktop/directories.csv
    for D in $(find . -type d); do du -sh $D >> /home/bcadmin/Desktop/filesize.csv; done
    for D in $(find . -type d); do find $D -type f -exec md5sum {} + | awk '{print $1}' | sort | md5sum >> /home/bcadmin/Desktop/checksums.csv; done
    

    This final command may take some time depending on the size of the collection. Together, these commands will create three CSV spreadsheets on the Bitcurator desktop, containing the list of directories, their human-readable size, and their checksums respectively. Move the columns into one spreadsheet, being mindful that the columns will require some light data clean up in order to get them to line up. It may be easier to have the TOPDIRECTORY be a subdirectory and combine all the spreadsheets at the end if you have more than a few thousand directories total.

    Once you've ensured that they match up correctly, create headers for each column and insert filters (Data/Filters or Donnees/Filtrer). Highlight the checksum column and do Home/Conditional formatting/Rules for highlighting cells/Duplicates or Accueil/Mise en forme conditionnelle/Regles de mise en surbrillance des cellules/Valeurs en double. This will change the color of all duplicate checksums, allowing you to identify and delete duplicate directories.

  • Identify all files with problematic timestamps in a directory and modify the timestamps:

    cd topDirectory
    find . -type f -newermt "YYYY-MM-DD" ! -newermt "YYYY-MM-DD" -exec touch -t "YYYYMMDDHHMM" {} +
    

    (Source 1 and Source 2)

    In which, the script will look for files (-type f) in a certain range of dates (-newermt "YYYY-MM-DD" ! -newermt "YYYY-MM-DD") and then it will change those dates to the one specified by "YYYYMMDDHHMM".

    Prior to December 2018, it was determined that if the modification was required by an issue in timestamp's interpretation by UNIX time system, the new date should be "197001010000" which correspond to time 0 in UNIX time system. This strategy was reassessed, and the new date should be derived from the content of the archive being processed. Confirm replacement date with Digital Archivist.

  • Detox utility cleans file and directory names by removing spaces and translating/cleaning up Latin-1 (ISO 8859-1) characters encoded in 8-bit ASCII, Unicode characters encoded in UTF-8, and CGI escaped characters.

    To do a test run (i.e. see proposed file name changes without actually making the changes):

    detox -rn topDirectory
    

    To make the changes:

    detox -r topDirectory
    
  • List and delete empty files and directories.

    To list all empty files and directories:

    cd [topDirectory]
    find . -empty
    

    To delete empty files:

    find . -type f -empty -delete
    

    To delete empty directories:

    find . -type d -empty -delete
    
  • To find and remove hidden files

    cd filepath/to/directory find -type f -name '.*' (#to find all files that begin with .)
    cd filepath/to/directory find -type f -name '.*' -delete (#to delete all files that begin with .)
    
    find -type f -name "*.db" -delete
    
    find -type f -name "~*" -delete
    
    find -type f -name "*.DS_Store" -delete
    
  • Print checksum mismatches between checksum.md5 file and objects directory to terminal

    cd /path/to/metadata/directory 
    md5deep -rlX checksum.md5 ../objects
    

    (the -X flag displays the hash and filename for each file in the objects directory that does not match the list of known hashes in the checksum.md5 file)

  • Batch remove commas from file names and replace with underscores

    cd /path/to/directory
    for f in $(find . -name "*,*"); do rename -v 's/,/_/' $f; done
    

    Note that this will only change the first comma in every file name. For example, if a file name contains five commas, you will have to run the command five times to replace every comma. *This command will only work if you run detox prior to get rid of all the spaces in the file name. Or else, the command will not find the files.

  • Sync folders

    This script copies or syncs a source folder with a destination folder. This is particularly useful if you have a copy of your files on a RAID and on Processing, and would like to update one to reflect processing with out having to copy-and-paste the whole thing. For large archives, this will be much faster as it keeps all files that stay the same.

    This script will copy a folder with the same name as the source folder in the destination directory. If that folder title already exists, it will sync the files. -qam is for quiet (i.e. error messages only), archival copy, and trims empty directories. --delete deletes any files at the destination that do not exist at the source.

    Note that this script overwrites, deletes, and uses sudo meaning that it is very powerful. It's not recommended that you use it without having some rsync experience.

    sudo rsync -qam --delete "/PATH/TO/SIPs/" "/PATH/TO/PARENT_OF_SIPs/"
    
  • Unlock folders

    This script unlocks a folder's contents using sudo.
    

    Often when disk imaging, a file may be created in a locked mode. If you need to unlock a folder change the * to the folder path.

    sudo chmod 777 *
    

    or use it recursevely, to unlock all subfolders.

    sudo chmod -R 777 *
    
  • Virus scan

    This command uses clamAV to scan files for viruses.

    clamscan -r /path/to/staging --max-filesize=Xm --max-scansize=Ym > collection.log
    

    X is the largest filesize (in megabytes) you want to scan, and Y is the largest number of megabytes you want to extract from a single compressed file.

  • For adding prefix and suffix for files (directories)

    ls | xargs -I {} mv {} PRE_{}
    ls | xargs -I {} mv {} {}_SUF
    

Preservation of computer-aided design

General resources