[src,script,egs] the gop_speechocean762 recipe (kaldi-asr#4441)

goodatlas · Feb 4, 2021 · b9890a9 · b9890a9
1 parent 6359c90
commit b9890a9
Show file tree

Hide file tree

Showing 28 changed files with 1,223 additions and 234 deletions.
diff --git a/.gitignore b/.gitignore
@@ -48,11 +48,15 @@ GSYMS
 # Python compiled bytecode files.
 *.pyc
 
+# Python virtual environment
+venv/
+
 # Make dependencies.
 .depend.mk
 
 # Some weird thing that macOS creates.
 *.dSYM
+.DS_Store
 
 # Windows executable, symbol and some weird files.
 *.exe
@@ -61,6 +65,7 @@ GSYMS
 *.manifest
 /kaldiwin_vs*
 .vscode
+.idea
 
 # /src/
 /src/.short_version

diff --git a/egs/gop/s5/local/make_testcase.sh b/egs/gop/s5/local/make_testcase.sh
diff --git a/egs/gop/s5/run.sh b/egs/gop/s5/run.sh
diff --git a/egs/gop/README.md → egs/gop_speechocean762/README.md b/egs/gop/README.md → egs/gop_speechocean762/README.md
@@ -94,5 +94,20 @@ We guess the HMM topo of chain model may not fit for GOP.
 
 The nnet3's TDNN (no chain) model performs well in GOP computing, so this recipe uses it.
 
+## The `speechocean762` corpus
+
+This corpus aims to provide a free public dataset for the pronunciation scoring task.
+
+This corpus consists 5000 English sentences.
+All the speakers are non-native and their mother tongue is Mandarin.
+Half of the speakers are Children and the others are adults.
+The information of age and gender are provided.
+
+The scores was made by five experts. To avoid subjectively bias, each experts scores independently under the same metric.
+The experts score at three levels: phoneme-level, word-level and sentence-level.
+
+In this recipe, the automatic phoneme-level scoring is illustrated.
+
 ## Acknowledgement
-The author of this recipe would like to thank Xingyu Na for his works of model tuning and his helpful suggestions.
+The author of this recipe would like to thank Speechocean for providing the corpus,
+and Xingyu Na for his works of model tuning and his helpful suggestions.
diff --git a/egs/gop_speechocean762/s5/RESULT b/egs/gop_speechocean762/s5/RESULT
@@ -0,0 +1,26 @@
+In the `speechocean762` corpus, the phoneme-level scores are in three levels:
+2: pronunciation is correct
+1: pronunciation is right but has a heavy accent
+0: pronunciation is incorrect or missed
+
+Firstly, we can treat the scoring as a regression task.
+So, MSE(Mean Square Error) and Corr(Cross-correlation) are computed:
+
+MSE: 0.15
+Corr: 0.42
+
+Then we round the continuous predicted scores into [0, 1, 2] to treat the scoring
+as a classification task.
+So, the classification metrics like precision, recall, and f1-score are computed
+and printed by `sklearn.metrics.classification_report`:
+
+
+              precision    recall  f1-score   support
+
+           0       0.46      0.17      0.25      1339
+           1       0.16      0.37      0.22      1828
+           2       0.96      0.93      0.95     44079
+
+    accuracy                           0.89     47246
+   macro avg       0.53      0.49      0.47     47246
+weighted avg       0.92      0.89      0.90     47246
diff --git a/egs/gop/s5/cmd.sh → egs/gop_speechocean762/s5/cmd.sh b/egs/gop/s5/cmd.sh → egs/gop_speechocean762/s5/cmd.sh
diff --git a/egs/gop_speechocean762/s5/conf/mfcc_hires.conf b/egs/gop_speechocean762/s5/conf/mfcc_hires.conf
@@ -0,0 +1,10 @@
+# config for high-resolution MFCC features, intended for neural network training
+# Note: we keep all cepstra, so it has the same info as filterbank features,
+# but MFCC is more easily compressible (because less correlated) which is why 
+# we prefer this method.
+--use-energy=false   # use average of log energy, not energy.
+--num-mel-bins=40     # similar to Google's setup.
+--num-ceps=40     # there is no dimensionality reduction.
+--low-freq=20     # low cutoff frequency for mel bins... this is high-bandwidth data, so
+                  # there might be some information at the low end.
+--high-freq=-400 # high cutoff frequently, relative to Nyquist of 8000 (=7600) 
diff --git a/egs/gop_speechocean762/s5/local/check_dependencies.sh b/egs/gop_speechocean762/s5/local/check_dependencies.sh
@@ -0,0 +1,20 @@
+#!/usr/bin/env bash
+
+# Copyright 2015  Johns Hopkins University (Author: Jan Trmal <[email protected]>)
+#           2021  Xiaomi Corporation (Author: Junbo Zhang)
+# Apache 2.0
+
+[ -f ./path.sh ] && . ./path.sh
+
+command -v python3 >&/dev/null \
+  || { echo  >&2 "python3 not found on PATH. You will have to install Python3, preferably >= 3.6"; exit 1; }
+
+for package in kaldi_io sklearn imblearn; do
+  python3 -c "import ${package}" 2> /dev/null
+  if [ $? -ne 0 ] ; then
+    echo >&2 "This recipe needs the package ${package} installed. Exit."
+    exit 1
+  fi
+done
+
+exit  0
diff --git a/egs/gop_speechocean762/s5/local/data_prep.sh b/egs/gop_speechocean762/s5/local/data_prep.sh
@@ -0,0 +1,30 @@
+#!/usr/bin/env bash
+
+# Copyright 2020-2021  Xiaomi Corporation (Author: Junbo Zhang, Yongqing Wang)
+# Apache 2.0
+
+if [ "$#" -ne 2 ]; then
+  echo "Usage: $0 <src-dir> <dst-dir>"
+  echo "e.g.: $0 /home/storage07/zhangjunbo/data/speechocean762/test data/test"
+  exit 1
+fi
+
+src=$1
+dst=$2
+
+[ ! -d $src ] && echo "$0: no such directory $src" && exit 1;
+[ ! -d $src/../WAVE ] && echo "$0: no wav directory" && exit 1;
+
+wavedir=`realpath $src/../WAVE`
+
+[ -d $dst ] || mkdir -p $dst || exit 1;
+
+cp -Rf $src/* $dst/ || exit 1;
+
+sed -i.ori "s#WAVE#${wavedir}#" $dst/wav.scp || exit 1
+
+utils/validate_data_dir.sh --no-feats $dst || exit 1;
+
+echo "$0: successfully prepared data in $dst"
+
+exit 0
diff --git a/egs/gop_speechocean762/s5/local/download_and_untar.sh b/egs/gop_speechocean762/s5/local/download_and_untar.sh
@@ -0,0 +1,86 @@
+#!/usr/bin/env bash
+
+# Copyright      2014  Johns Hopkins University (author: Daniel Povey)
+#           2020-2021  Xiaomi Corporation (Author: Junbo Zhang, Yongqing Wang)
+# Apache 2.0
+
+set -e
+
+remove_archive=false
+if [ "$1" == --remove-archive ]; then
+  remove_archive=true
+  shift
+fi
+
+if [ $# -ne 2 ]; then
+  echo "Usage: $0 [--remove-archive] <url-base> <data-base>"
+  echo "e.g.: $0 www.openslr.org/resources/101 /home/storage07/zhangjunbo/data"
+  echo "With --remove-archive it will remove the archive after successfully un-tarring it."
+  exit 1
+fi
+
+url=$1
+data=$2
+[ -d $data ] || mkdir -p $data
+
+corpus_name=speechocean762
+
+if [ -z "$url" ]; then
+  echo "$0: empty URL base."
+  exit 1;
+fi
+
+if [ -f $data/$corpus_name/.complete ]; then
+  echo "$0: data part $corpus_name was already successfully extracted, nothing to do."
+  exit 0;
+fi
+
+# Check the archive file in bytes
+ref_size=520810923
+if [ -f $data/$corpus_name.tar.gz ]; then
+  size=$(/bin/ls -l $data/$corpus_name.tar.gz | awk '{print $5}')
+  if [ $ref_size != $size ]; then
+    echo "$0: removing existing file $data/$corpus_name.tar.gz because its size in bytes $size"
+    echo "does not equal the size of one of the archives."
+    rm $data/$corpus_name.tar.gz
+  else
+    echo "$data/$corpus_name.tar.gz exists and appears to be complete."
+  fi
+fi
+
+# If you have permission to access Xiaomi's server, you would not need to
+# download it from OpenSLR
+path_on_mi_server=/home/storage06/wangyongqing/share/data/$corpus_name.tar.gz
+if [ -f $path_on_mi_server ]; then
+  cp $path_on_mi_server $data/$corpus_name.tar.gz
+fi
+
+if [ ! -f $data/$corpus_name.tar.gz ]; then
+  if ! which wget >/dev/null; then
+    echo "$0: wget is not installed."
+    exit 1;
+  fi
+  full_url=$url/$corpus_name.tar.gz
+
+  echo "$0: downloading data from $full_url.  This may take some time, please be patient."
+  if ! wget -c --no-check-certificate $full_url -O $data/$corpus_name.tar.gz; then
+    echo "$0: error executing wget $full_url"
+    exit 1;
+  fi
+fi
+
+cd $data
+if ! tar -xvzf $corpus_name.tar.gz; then
+  echo "$0: error un-tarring archive $data/$corpus_name.tar.gz"
+  exit 1;
+fi
+
+touch $corpus_name/.complete
+cd -
+
+echo "$0: Successfully downloaded and un-tarred $data/$corpus_name.tar.gz"
+
+if $remove_archive; then
+  echo "$0: removing $data/$corpus_name.tar.gz file since --remove-archive option was supplied."
+  rm $data/$corpus_name.tar.gz
+fi
diff --git a/egs/gop_speechocean762/s5/local/feat_to_score_eval.py b/egs/gop_speechocean762/s5/local/feat_to_score_eval.py
@@ -0,0 +1,42 @@
+# Copyright 2021  Xiaomi Corporation (Author: Junbo Zhang, Yongqing Wang)
+# Apache 2.0
+
+# This script does phone-level pronunciation scoring by GOP-based features.
+
+import sys
+import argparse
+import pickle
+import kaldi_io
+from utils import round_score
+
+
+def get_args():
+    parser = argparse.ArgumentParser(
+        description='Phone-level scoring.',
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+    parser.add_argument('model', help='Input the model file')
+    parser.add_argument('feature_scp',
+                        help='Input gop-based feature file, in Kaldi scp')
+    parser.add_argument('output', help='Output the predicted file')
+    sys.stderr.write(' '.join(sys.argv) + "\n")
+    args = parser.parse_args()
+    return args
+
+
+def main():
+    args = get_args()
+
+    with open(args.model, 'rb') as f:
+        model_of = pickle.load(f)
+
+    with open(args.output, 'wt') as f:
+        for ph_key, feat in kaldi_io.read_vec_flt_scp(args.feature_scp):
+            ph = int(feat[0])
+            feat = feat[1:].reshape(1, -1)
+            score = model_of[ph].predict(feat).reshape(1)[0]
+            score = round_score(score, 1)
+            f.write(f'{ph_key}\t{score:.1f}\t{ph}\n')
+
+
+if __name__ == "__main__":
+    main()