-
Notifications
You must be signed in to change notification settings - Fork 1
Wiki to Text
Adrian Wilke edited this page Feb 10, 2021
·
5 revisions
#!/bin/bash
if [[ $# -ne 2 ]] ; then
echo 'Please provide: <input directory> <output directory>'
exit 1
fi
# Remove slash at end
INDIR=${1%/}
OUTDIR=${2%/}
# Create directory if not exists
mkdir -p $OUTDIR
for FILEPATH in $INDIR/*
do
# Only file name
FILE="$(basename -- $FILEPATH)"
# Convert from wiki-markup to plain text
pandoc -f mediawiki -t plain -o $OUTDIR/$FILE $INDIR/$FILE
# Remove markers [1]
sed -i 's/\[[^]]*\]//g' $OUTDIR/$FILE
# Remove empty lines
sed -i '/^[[:space:]]*$/d' $OUTDIR/$FILE
done
# https://github.com/EML4U/WikimediaDumpExtractor/wiki/Wiki-to-Text
# Data Science Group (DICE) at Paderborn University
# This work has been supported by the German Federal Ministry of Education and Research (BMBF) within the project EML4U under the grant no 01IS19080B.
- Slow: Extraction of 699,988 text files takes around 15 hours (13 seconds per text file)
- Errors: Pandoc 2.11.4 exits sometimes because of parsing errors