-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add surface-only segments to Buckeye2hayes.feature #793
Comments
... and perhaps CSJ too Although a corpus gets created without errors, this does not look right. But I don't know how CSJ text files are formatted, so I might be importing them wrong. CSJ text files and csj2hayes.feature are in the dropbox folder.
(related: #665) |
Re: CSJ -- I don't think this has been read in correctly. You want to make sure that (a) you don't use comma as the default segment delimiter (which PCT often wants to do with these files) and (b) you include the multi-character sequences from "csj_digraphs.txt" (also in the Phonological_CorpusTools_Public/TRANS folder). I don't think that there are canonical vs. surface transcriptions in the CSJ -- just one pronunciation tier in the original textgrids, called "Seg," which I think is what is pulled out into the .txt file versions of the corpus. At any rate, the .txt file versions are just running text, not even interlinear glosses, so there shouldn't be issues of pronunciation variants there. The Buckeye corpus, though, does have this issue. |
Todo:
Unexpected issues:
|
symbols like aɪ̃ were added to buckeye2hayes.feature, buckeye2spe.feature, ipa2hayes.feature and ipa2spe.feature see #793
p. 22 and 23 of the manual have all the things that are 'supposed' to be labeled: Anything else I think we can just leave out. And I think that things like the above (IVER) are errors in the transcription -- it should have been labeled as and marked as non-speech. We don't actually distribute the Buckeye corpus with PCT -- just the feature system for the symbols, and so I think it's fine for it to be just the symbols they say are included. People can be in charge of cleaning up their copies of the corpus themselves (or not, and just accepting [n] feature values!). |
I checked and confirm that PCT covers everything on pages 22-23.
|
I noticed that several sounds in the Buckeye Corpus only appear in surface_transcription. PCT's Buckeye2hayes.feature does not contain them, so they cannot be categorized, even after now we can add variant segments to the inventory (re: #792 ).
(NB: not categorizing 'hh' [h] and 'r' [ɹ] is expected and not a bug. )
Presumably, 'awn,' 'ihn' and 'own' are nasalized vowels. They are not included in Buckeye2hayes so far because I only added symbols found from the Buckeye documentation. According to the Buckeye documentation, they are 'Phones added/relabeled during hand labeling' and nasalized vowels are some of them. We need to check all the Buckeye files and add those symbols only found at the surface.
The text was updated successfully, but these errors were encountered: