add surface-only segments to Buckeye2hayes.feature #793

stannam · 2021-12-08T04:47:59Z

I noticed that several sounds in the Buckeye Corpus only appear in surface_transcription. PCT's Buckeye2hayes.feature does not contain them, so they cannot be categorized, even after now we can add variant segments to the inventory (re: #792 ).

(NB: not categorizing 'hh' [h] and 'r' [ɹ] is expected and not a bug. )

Presumably, 'awn,' 'ihn' and 'own' are nasalized vowels. They are not included in Buckeye2hayes so far because I only added symbols found from the Buckeye documentation. According to the Buckeye documentation, they are 'Phones added/relabeled during hand labeling' and nasalized vowels are some of them. We need to check all the Buckeye files and add those symbols only found at the surface.

stannam · 2021-12-09T04:39:31Z

... and perhaps CSJ too

Although a corpus gets created without errors, this does not look right. But I don't know how CSJ text files are formatted, so I might be importing them wrong.

CSJ text files and csj2hayes.feature are in the dropbox folder.

textfiles are in Phonological_CorpusTools_Public/example_files/CSJ_sample_corrected/CSJ_text_sample
feature file is in Phonological_CorpusTools_Public/TRANS

(related: #665)

kchall · 2021-12-09T19:21:06Z

Re: CSJ -- I don't think this has been read in correctly. You want to make sure that (a) you don't use comma as the default segment delimiter (which PCT often wants to do with these files) and (b) you include the multi-character sequences from "csj_digraphs.txt" (also in the Phonological_CorpusTools_Public/TRANS folder).

I don't think that there are canonical vs. surface transcriptions in the CSJ -- just one pronunciation tier in the original textgrids, called "Seg," which I think is what is pulled out into the .txt file versions of the corpus. At any rate, the .txt file versions are just running text, not even interlinear glosses, so there shouldn't be issues of pronunciation variants there.

The Buckeye corpus, though, does have this issue.

stannam · 2021-12-14T06:00:28Z

Todo:

In the Buckeye feature system, attch 'n' to all vowels for the nasal vowel (e.g., 'aw' + 'n' -> 'awn'). As for their feature values, Vn segments should inherit all feature values except for [nasal], which should be [+nasal].
Do the same for Buckeye2spe.feature

Unexpected issues:

'''[ɚ̃]'''. ipa2spe.feature does not contain [ɚ], so we cannot add its nasalized version.
- If I recall correctly, the SPE book does not discuss [ɚ] and so we simply drop this in buckeye2spe.feature. ([ɚ] and [ɚ̃] are in buckeye2hayes by the way..)
- This does not raise any errors but I think we can say something in the documentation about [ɚ].

symbols like aɪ̃ were added to buckeye2hayes.feature, buckeye2spe.feature, ipa2hayes.feature and ipa2spe.feature see #793

stannam · 2021-12-17T04:49:25Z

not just Ṽs.
s0703a contains other surface-only symbols that are not recognized.

It seems like I really need to run through all data. But I'll be putting this off (possibly until after the release).

kchall · 2021-12-17T05:21:15Z

p. 22 and 23 of the manual have all the things that are 'supposed' to be labeled:
https://www.dropbox.com/s/57bl7ail6h6nvht/Buckeye_Corpus_manual.pdf?dl=0

Anything else I think we can just leave out. And I think that things like the above (IVER) are errors in the transcription -- it should have been labeled as and marked as non-speech.

We don't actually distribute the Buckeye corpus with PCT -- just the feature system for the symbols, and so I think it's fine for it to be just the symbols they say are included. People can be in charge of cleaning up their copies of the corpus themselves (or not, and just accepting [n] feature values!).

stannam · 2022-01-11T04:27:04Z

I checked and confirm that PCT covers everything on pages 22-23.
Both errors I marked above seem to be from mistakes.

The IVER in 'which' arose because is not in s0703a.words where needed (between lines 610 and 611).
As for 'being,' 'e' should be omitted from the surface transcription 'eng'...?

stannam added a commit that referenced this issue Dec 17, 2021

nasalized vowels in Buckeye and IPA feature files

b91248a

symbols like aɪ̃ were added to buckeye2hayes.feature, buckeye2spe.feature, ipa2hayes.feature and ipa2spe.feature see #793

stannam closed this as completed Jan 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add surface-only segments to Buckeye2hayes.feature #793

add surface-only segments to Buckeye2hayes.feature #793

stannam commented Dec 8, 2021 •

edited

Loading

stannam commented Dec 9, 2021

kchall commented Dec 9, 2021

stannam commented Dec 14, 2021 •

edited

Loading

stannam commented Dec 17, 2021 •

edited

Loading

kchall commented Dec 17, 2021

stannam commented Jan 11, 2022

add surface-only segments to Buckeye2hayes.feature #793

add surface-only segments to Buckeye2hayes.feature #793

Comments

stannam commented Dec 8, 2021 • edited Loading

stannam commented Dec 9, 2021

kchall commented Dec 9, 2021

stannam commented Dec 14, 2021 • edited Loading

stannam commented Dec 17, 2021 • edited Loading

kchall commented Dec 17, 2021

stannam commented Jan 11, 2022

stannam commented Dec 8, 2021 •

edited

Loading

stannam commented Dec 14, 2021 •

edited

Loading

stannam commented Dec 17, 2021 •

edited

Loading