-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Lhotse/K2 example #45
base: main
Are you sure you want to change the base?
Conversation
2b230b6
to
f801434
Compare
espresso/data/asr_k2_dataset.py
Outdated
self.tgt_sizes = np.array( | ||
[ | ||
round( | ||
cut.supervisions[0].trim(cut.duration).duration / cut.frame_shift |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You were previously using "cut.num_samples" if features were not available - it won't work here, as in that case frame_shift
will be None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I thought tgt_size
was used to denote "the number of output tokens" in a different context, here it seems like it's representing "the number of feature frames covered by a supervision"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like your intention here (supervisions[0]
) is to support only single-supervision cuts; maybe it makes sense to add an assertion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tgt_sizes is used for determining batching sizes, and possibly affect the loss value. I am not sure if in the future we would add on-the-fly feature extraction in this class if only the recordings were available, and if we do the on-the-fly feature extraction, whether the field frame_shift
will been populated. How about making tgt_sizes always the same as src_sizes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought you'd rather want to use the number of tokens in supervision.text
- unless I misunderstood the meaning of "target" in this context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have done exactly the same thing as what we did in Kaldi, i.e. making the length distribution of positives and negatives the same for training. This is done in local/data_prep.py in PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danpovey I need to test if all the data prep work as expected (additive noise from MUSAN is still to be done). In the meantime, maybe we can start to think about implementing the LF-MMI loss using K2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the data-prep and normalizing the sizes: I'm concerned that others who pick up the data from Lhotse may not do this and may get bad results? But IDK whether it would be natural to do that within Lhotse. Will comment in a second, about implementing the LF-MMI loss in k2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding implementing LF-MMI in k2:
you need to turn the nnet outputs into a DenseFsaVec. The nnet outputs will be of shape
(B,T,F) where B is the batch size, T is num-frames and F is number of features. Feature zero
will probably correspond to epsilon/blank. [If you do want an epsilon/blank then you should
probably call AddEpsilonSelfLoops() on your graphs before calling IntersectDensePruned() / intersect_dense_pruned(), since
IntersectDensePruned() does not treat epsilon specially. Caution: AddEpsilonSelfLoops() still
need to be wrapped to python, I created an issue on k2 for this.]
Anyway, the first step is to construct the DenseFsaVec from your nnet output. DenseFsaVec
supports different-sized supervision segments, and you have the choice here to omit any
padding frames from the frames you construct the DenseFsaVec from. Do git grep dense_fsa
in k2 and you'll find where the code is.
The next step is to construct the denominator graph as an Fsa (this is a k2 python type, although
there is also a C++ typedef of the same name; I refer here to the python type). You can probably create
it without epsilons and then add epsilon self-loops to it. If you want it can just be, effectively,
the union of 2 graphs, one for the numerator and one for the denominator. I expect you will use your
experience of what does and does not work, here.
The numerator graphs can start off as two Fsas, one for the positive and one for the negative examples.
Look at type Fsa in k2 (at the python level), because it does support being (really) a vector of Fsas.
Currently I don't know of a super efficient way to create the minibatch from the num and den fsas and
(say) a vector of bools, but this is doable; please consult @csukuangfj on this and maybe he can
create something.
The objective function will be like num_score - den_score, where each score comes from one call
to intersect_dense_pruned and then putting the output into get_tot_scores with log_semiring = true,
and summing the output tensor.
Sorry I have to go somewhere, but hopefully you can get stareted with this info and ask @csukuangfj for help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AddEpsilonSelfLoops
is added by k2-fsa/k2#313
1c925ee
to
e47b7f7
Compare
f21dd7c
to
9885797
Compare
@pzelasko I just drafted a data prep script in the file examples/mobvoihotwords/local/data_prep.py. I just would like to double-check with you whether I did everything correctly and efficiently. Basically I want to augment the original training data with 1.1x/0.9 speed-perturbation, and reverberation separately, and then combined them into a single CutSet. I did that by first extracting augmented features and dump them into the disk separately, and then merging their respective CutSet and in the meantime modifying their ids (by prefixing) to differentiate utterances from the same underlying original one. Also, I don't know the way I did speed perturbation is correct (in terms of both the use of Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I left you some comments.
if "train" in partition: | ||
with LilcomFilesWriter(f"{output_dir}/feats_{partition}_orig") as storage: | ||
cut_set_orig = cut_set.compute_and_store_features( | ||
extractor=Mfcc(config=mfcc_hires_config), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be okay to instantiate Mfcc(config=...)
once and re-use for all calls (although it won't make a difference in performance, just code terseness)
BTW this is in a very experimental stage, but some time ago I was able to run Lhotse feature extraction distributed on our CLSP grid with these steps (admittedly not tested with data augmentation yet):
If you'd like you can try it, else I will try it sometime, probably using your recipe as it'll be a great testing ground for this. |
Thanks for the helpful comments! There are still additional steps for data preprocessing to been done before features extraction (additive noise and split the recordings). I will try the distributed extraction once they are done. |
fa07059
to
5ce9179
Compare
705bb32
to
82dfb45
Compare
3f8f008
to
2fd59b9
Compare
5a47606
to
e597029
Compare
227ec47
to
421b087
Compare
97ef591
to
8ec673f
Compare
5d32606
to
41cdb00
Compare
9c5ba0b
to
660facf
Compare
@pzelasko