[BUG] UnicodeDecode error when using optional arguments in mfa train #822

SamPassmore · 2024-07-11T00:48:52Z

Debugging checklist

[X] Have you read the troubleshooting page (https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/troubleshooting.html) and searched the documentation to ensure that your issue is not addressed there?
[X] Have you updated to latest MFA version (check https://montreal-forced-aligner.readthedocs.io/en/latest/changelog/changelog_3.0.html)? What is the output of mfa version?
[X] Have you tried rerunning the command with the --clean flag?

Describe the issue
When running the mfa traincommand with the --temporary_directory option raises a UnicodeDecode error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 37: invalid start byte

The error doesn't arise if you do not use this option. I note that the --output_directory option has no problem.

For Reproducing your issue
The code I use is:

# Produces Unicode Error
$(ENV)/bin/mfa train \
	--output_directory "$(TRAIN_DIR)" \
	--temporary_directory "$(BASE_DIR)/acoustic_model" \
	--clean \
	--verbose \
	--debug \
	"$(TRAIN_DIR)" \
	"$(PRONDICT_PATH)" \
	"$(BASE_DIR)/$(LC)_acoustic_model.zip"

# Works fine
$(ENV)/bin/mfa train \
	--output_directory "$(TRAIN_DIR)" \
	--clean \
	--verbose \
	--debug \
	"$(TRAIN_DIR)" \
	"$(PRONDICT_PATH)" \
	"$(BASE_DIR)/$(LC)_acoustic_model.zip"

Corpus structure
- What language is the corpus in? This problem occurs in Bislama & Tok Pisin corpora
- How many files/speakers? BIS: 27 speakers, 30 files; TPI: 52 speakers 55 files
- Are you using lab files or TextGrid files for input? TextGrid
Dictionary
- Are you using a dictionary from MFA? If so, which one? No
- If it's a custom dictionary, what is the phoneset? It is just the roman alphabet as well we 'ng'
Acoustic model
- If you're using an acoustic model, is it one download through MFA? If so, which one? No
- If it's a model you've trained, what data was it trained on? The same data as above

Log file
I searched through all the log-files but they either do not exist or have completed successfully. I have provided the Traceback.

Traceback:

/opt/miniconda3/envs/aligner/bin/mfa train \
                        --temporary_directory "/Volumes/PassmoreSSD/PacificCreoles/BIS/acoustic_model" \
                        --clean \
                        --verbose \
                        --debug \
                        "/Volumes/PassmoreSSD/PacificCreoles/BIS/training" \
                        "/Volumes/PassmoreSSD/PacificCreoles/BIS/pronunciation_dictionary.txt" \
                        "/Volumes/PassmoreSSD/PacificCreoles/BIS/BIS_acoustic_model.zip"
 INFO     Setting up corpus information...                                                                                             
 INFO     Loading corpus from source files...                                                                                          
  52% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52/100  [ 0:00:01 < -:--:-- , ? it/s ]
 INFO     Found 27 speakers across 52 files, average number of utterances per speaker: 271.51851851851853                              
 INFO     Initializing multiprocessing jobs...                                                                                         
 INFO     Normalizing text...                                                                                                          
 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7,331/7,331  [ 0:00:02 < 0:00:00 , 6,049 it/s ]
 INFO     Generating MFCCs...                                                                                                          
 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7,331/7,331  [ 0:00:49 < 0:00:00 , 147 it/s ]
 INFO     Calculating CMVN...                                                                                                          
 INFO     Generating final features...                                                                                                 
 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7,331/7,331  [ 0:00:03 < 0:00:00 , 4,528 it/s ]
 INFO     Creating corpus split...                                                                                                     
 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7,331/7,331  [ 0:00:02 < 0:00:00 , 6,225 it/s ]
 INFO     Filtering utterances for training...                                                                                         
 INFO     Initializing training for monophone...                                                                                       
 INFO     Compiling training graphs...                                                                                                 
 INFO     Generating initial alignments...                                                                                             
 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7,334/7,331  [ 0:00:07 < 0:00:00 , 2,040 it/s ]
 INFO     Initialization complete!                                                                                                     
 INFO     monophone - Iteration 1 of 40                                                                                                
 INFO     Generating alignments...                                                                                                     
 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7,331/7,331  [ 0:01:20 < 0:00:00 , 58 it/s ]
 INFO     Accumulating statistics...                                                                                                   
  68% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━ 4,976/7,331  [ 0:00:03 < 0:00:01 , 2,628 it/s ]
 ERROR    There was an error in the run, please see the log.       
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/opt/miniconda3/envs/aligner/bin/mfa", line 10, in <module>
    sys.exit(mfa_cli())
  File "/opt/miniconda3/envs/aligner/lib/python3.9/site-packages/rich_click/rich_command.py", line 367, in __call__
    return super().__call__(*args, **kwargs)
  File "/opt/miniconda3/envs/aligner/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/miniconda3/envs/aligner/lib/python3.9/site-packages/rich_click/rich_command.py", line 152, in main
    rv = self.invoke(ctx)
  File "/opt/miniconda3/envs/aligner/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/miniconda3/envs/aligner/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/miniconda3/envs/aligner/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/miniconda3/envs/aligner/lib/python3.9/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/opt/miniconda3/envs/aligner/lib/python3.9/site-packages/montreal_forced_aligner/command_line/train_acoustic_model.py", line 151, in train_acoustic_model_cli
    trainer.train()
  File "/opt/miniconda3/envs/aligner/lib/python3.9/site-packages/montreal_forced_aligner/acoustic_modeling/trainer.py", line 607, in train
    trainer.train()
  File "/opt/miniconda3/envs/aligner/lib/python3.9/site-packages/montreal_forced_aligner/acoustic_modeling/base.py", line 395, in train
    self.train_iteration()
  File "/opt/miniconda3/envs/aligner/lib/python3.9/site-packages/montreal_forced_aligner/acoustic_modeling/base.py", line 370, in train_iteration
    parse_logs(self.working_log_directory)
  File "/opt/miniconda3/envs/aligner/lib/python3.9/site-packages/montreal_forced_aligner/utils.py", line 364, in parse_logs
    for line in f:
  File "/opt/miniconda3/envs/aligner/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 37: invalid start byte

Desktop (please complete the following information):

OS: MacOs
Version: 13.3.1 (a) (22E772610a) - M1 Processor

The text was updated successfully, but these errors were encountered:

SamPassmore added the bug label Jul 11, 2024

SamPassmore assigned mmcauliffe Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] UnicodeDecode error when using optional arguments in mfa train #822

[BUG] UnicodeDecode error when using optional arguments in mfa train #822

SamPassmore commented Jul 11, 2024

[BUG] UnicodeDecode error when using optional arguments in mfa train #822

[BUG] UnicodeDecode error when using optional arguments in mfa train #822

Comments

SamPassmore commented Jul 11, 2024