-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CTC training #3
Conversation
+1 for black, not sure about mypy — it will be very strict about typing and we might end up spending a lot of extra effort to adhere to its strict checks. |
ok, in that case, I can remove mypy. |
I am trying to add the data preparation part for the LibriSpeech recipe and am trying to put everything in Python. |
Are you considering porting |
If we do port it to Python I don't think we need an entire Kaldi-compatible lang directory, we can just keep the things we need for k2. We might have several different possible formats, e.g. a lexicon-based version and a BPE-based version with different information. But I think writing things to files in a directory is a good idea as it makes them easy to inspect. |
egs/librispeech/ASR/prepare.sh
Outdated
mkdir -p data/LibriSpeech | ||
# TODO | ||
|
||
if [ ! -f data/LibriSpeech/train-other-500/.completed ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this check is technically needed; i.e. the download script would figure the completion status by itself
OTOH I am concerned about fixing the location of the dataset in data/LibriSpeech
-- people tend to have corpora downloaded in some standard locations on their own setups, I think this should remain customizable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the download script would figure the completion status by itself
If the zipped files are removed after extraction, the script will download it again. That's why I add this check to avoid invoking the download script.
people tend to have corpora downloaded in some standard locations on their own setups, I think this should remain customizable
Fix the location of the dataset makes the code simpler. Can we let the users create a symlink to its original dataset path, i.e.,
ln -s /path/to/LibriSpeech data/
as mentioned in the script
icefall/egs/librispeech/ASR/prepare.sh
Lines 21 to 25 in 0b19aa0
# If you have pre-downloaded it to /path/to/LibriSpeech, | |
# you can create a symlink to avoid downloading it again: | |
# | |
# ln -sfv /path/to/LibriSpeech data/ | |
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I see, I missed the bit about the symbolic link before. I am OK with that.
If the zipped files are removed after extraction, the script will download it again. That's why I add this check to avoid invoking the download script.
I see... maybe this is the right way then. I'm not sure if there is a straightforward way to address this issue inside of lhotse download
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... I take that back -- I think we can change Lhotse so that the "completed detector" is executed before downloading the files (move this line a few lines up)
I can make this change later for all the recipes. WDYT @csukuangfj ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make this change later for all the recipes.
That's would be great. In that case, I won't need to add that check here.
egs/librispeech/ASR/prepare.sh
Outdated
# ln -s /path/to/musan data/ | ||
# | ||
if [ ! -e data/musan ]; then | ||
wget https://www.openslr.org/resources/17/musan.tar.gz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is download_musan()
here: https://github.com/lhotse-speech/lhotse/blob/master/lhotse/recipes/musan.py#L27
and $ lhotse download musan
here: https://github.com/lhotse-speech/lhotse/blob/master/lhotse/bin/modes/recipes/musan.py#L25
Yes, but I am going to implement only a subset of prepare_lang.sh that is needed by snowfall, i.e., it contains If that's ok, then I will go ahead. Otherwise, I will use the current prepare_lang.sh |
output_dir = Path("data/manifests") | ||
num_jobs = min(15, os.cpu_count()) | ||
|
||
librispeech_manifests = prepare_librispeech( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this and the download script can be completely replaced with:
$ lhotse download librispeech --full $CORPUS_DIR
$ lhotse prepare librispeech -j $NUM_JOBS $CORPUS_DIR $MANIFEST_DIR
|
||
|
||
@contextmanager | ||
def get_executor(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This executor bit seems like a good candidate to move to the library-level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will move it to a new file local/utils.py
That sounds good to me! |
I've ported a subset of Here are some test results. Input lexicon.txt
The following are outputs: lexicon_disambig.txt
phones.txt
words.txt
L.fstL_disambig.fst |
Great!! |
I just found that I cannot think of any benefits of adding @danpovey Should the python version |
You can ignore it. The only time it's really needed is for the optional silence. |
states_needs_self_loops = set() | ||
for arc in arcs: | ||
src, dst, ilable, olable, score = arc | ||
if olable != 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lable -> label
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Fixed.
Here are the decoding results with icefall (using only the encoder part) from the pre-trained model (downloaded from https://huggingface.co/GuoLiyong/snowfall_bpe_model/tree/main/exp-duration-200-feat_batchnorm-bpe-lrfactor5.0-conformer-512-8-noam, as mentioned in k2-fsa/snowfall#227) HLG - no LM rescoring(output beam size is 8) 1-best decoding
n-best decodingFor n=100,
For n=200,
HLG - with LM rescoringWhole lattice rescoring
WERs of different LM scales are:
n-best LM rescoringn = 100
WERs of different LM scales are:
n = 150
|
Great! By the n-best LM rescoring, are you talking about a neural or n-gram LM? |
Nice! Do we have the scripts for training the BPE models from scratch somewhere? |
I am only using the transformer encoder + HLG decoding + 4-gram rescoring. Will integrate the attention decoder for rescoring. |
Yes, the training code is in the pull-request: k2-fsa/snowfall#219 We're porting it to icefall and polishing the training code. |
I'd rather worry about the present than the future. This issue could
easily become a limiting factor in whether people use Lhotse.
…On Thu, Jul 29, 2021 at 12:18 AM Piotr Żelasko ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In .github/workflows/test.yml
<#3 (comment)>:
> + - uses: ***@***.***
+ with:
+ fetch-depth: 0
+
+ - name: Setup Python ${{ matrix.python-version }}
+ uses: ***@***.***
+ with:
+ python-version: ${{ matrix.python-version }}
+
+ - name: Install Python dependencies
+ run: |
+ python3 -m pip install --upgrade pip pytest kaldialign
+ pip install k2==${{ matrix.k2-version }}+cpu.torch${{ matrix.torch }} -f https://k2-fsa.org/nightly/
+
+ # Don't use: pip install lhotse
+ # since it installs a version of PyTorch that is not predictable
(it's not ideal because it downloads all the torchaudio versions from the
newest going down to the right one, but at least it seems to work)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO7WRXGHD34LVZTRIBDT2AUW3ANCNFSM5ANFYHUA>
.
|
The following are the partial results (1best decoding without LM rescoring and without attention decoder)
It seems it is able to reproduce what Liyong has been doing. I think it is safe to merge now. I will fix all the comments in The tensorboard log is available at epoch-0
epoch-1
average of epoch 0 and 1
epoch-2
average of epoch 1 and 2
average of epoch 0, 1 and 2
epoch-3
average of epoch 2 and 3
average of epoch 1, 2 and 3
average of epoch 0, 1, 2 and 3
epoch 4
average of epoch 3 and 4
average of epoch 2, 3 and 4
average of epoch 1, 2, 3 and 4
average of epoch 0, 1, 2, 3 and 4
epoch 5
average of epoch 4 and 5
average of epoch 3, 4 and 5
average of epoch 2, 3, 4 and 5
average of epoch 1, 2, 3, 4 and 5
average of epoch 0, 1, 2, 3, 4 and 5
epoch 6
average of epoch 5 and 6
average of epoch 4, 5 and 6
average of epoch 3, 4, 5 and 6
average of epoch 2, 3, 4, 5 and 6
average of epoch 1, 2, 3, 4, 5 and 6
average of epoch 0, 1, 2, 3, 4, 5 and 6
epoch 7
average of epoch 6 and 7
average of epoch 5, 6 and 7
average of epoch 4, 5, 6 and 7
average of epoch 3, 4, 5, 6 and 7
average of epoch 2, 3, 4, 5, 6 and 7
average of epoch 1, 2, 3, 4, 5, 6 and 7
average of epoch 0, 1, 2, 3, 4, 5, 6 and 7
epoch 8
average of epoch 7 and 8
average of epoch 6, 7, and 8
average of epoch 5, 6, 7, and 8
average of epoch 4, 5, 6, 7, and 8
average of epoch 3, 4, 5, 6, 7, and 8
average of epoch 2, 3, 4, 5, 6, 7, and 8
|
We'll address the remaining comments from @pzelasko later on. |
Great! |
Where should the |
Sure, that sounds OK. |
I agree. Will do it. |
Cool! Nice work. I wondered about the choice of num buckets 1000, what was your motivation?
… Wiadomość napisana przez Fangjun Kuang ***@***.***> w dniu 7/31/21, o godz. 04:03:
.. also I think it might be a good idea, in our data-directory hierarchy, to make a very clear distinction between data that might be written to by local scripts, and data that is simply downloaded from elsewhere. I
I agree. Will do it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
The training options are copied from Liyong's work in @glynpu Maybe Liyong has something to say about this. |
There are various code formatting and style issues in snowfall since
it is written by different people with different preferred styles.
This pull request tries to ensure that code styles in icefall are as consistent as possible; all happens automagically with the help of the following tools:
https://github.com/python/mypy