Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add surface-only segments to Buckeye2hayes.feature #793

Closed
stannam opened this issue Dec 8, 2021 · 6 comments
Closed

add surface-only segments to Buckeye2hayes.feature #793

stannam opened this issue Dec 8, 2021 · 6 comments

Comments

@stannam
Copy link
Member

stannam commented Dec 8, 2021

I noticed that several sounds in the Buckeye Corpus only appear in surface_transcription. PCT's Buckeye2hayes.feature does not contain them, so they cannot be categorized, even after now we can add variant segments to the inventory (re: #792 ).

image
(NB: not categorizing 'hh' [h] and 'r' [ɹ] is expected and not a bug. )

Presumably, 'awn,' 'ihn' and 'own' are nasalized vowels. They are not included in Buckeye2hayes so far because I only added symbols found from the Buckeye documentation. According to the Buckeye documentation, they are 'Phones added/relabeled during hand labeling' and nasalized vowels are some of them. We need to check all the Buckeye files and add those symbols only found at the surface.

@stannam
Copy link
Member Author

stannam commented Dec 9, 2021

... and perhaps CSJ too

image

Although a corpus gets created without errors, this does not look right. But I don't know how CSJ text files are formatted, so I might be importing them wrong.

CSJ text files and csj2hayes.feature are in the dropbox folder.

  • textfiles are in Phonological_CorpusTools_Public/example_files/CSJ_sample_corrected/CSJ_text_sample
  • feature file is in Phonological_CorpusTools_Public/TRANS

(related: #665)

@kchall
Copy link
Member

kchall commented Dec 9, 2021

Re: CSJ -- I don't think this has been read in correctly. You want to make sure that (a) you don't use comma as the default segment delimiter (which PCT often wants to do with these files) and (b) you include the multi-character sequences from "csj_digraphs.txt" (also in the Phonological_CorpusTools_Public/TRANS folder).

I don't think that there are canonical vs. surface transcriptions in the CSJ -- just one pronunciation tier in the original textgrids, called "Seg," which I think is what is pulled out into the .txt file versions of the corpus. At any rate, the .txt file versions are just running text, not even interlinear glosses, so there shouldn't be issues of pronunciation variants there.

The Buckeye corpus, though, does have this issue.

@stannam
Copy link
Member Author

stannam commented Dec 14, 2021

Todo:

  • In the Buckeye feature system, attch 'n' to all vowels for the nasal vowel (e.g., 'aw' + 'n' -> 'awn'). As for their feature values, Vn segments should inherit all feature values except for [nasal], which should be [+nasal].
  • Do the same for Buckeye2spe.feature

Unexpected issues:

  • '''[ɚ̃]'''. ipa2spe.feature does not contain [ɚ], so we cannot add its nasalized version.
    • If I recall correctly, the SPE book does not discuss [ɚ] and so we simply drop this in buckeye2spe.feature. ([ɚ] and [ɚ̃] are in buckeye2hayes by the way..)
    • This does not raise any errors but I think we can say something in the documentation about [ɚ].

stannam added a commit that referenced this issue Dec 17, 2021
symbols like aɪ̃ were added to buckeye2hayes.feature, buckeye2spe.feature, ipa2hayes.feature and ipa2spe.feature
see #793
@stannam
Copy link
Member Author

stannam commented Dec 17, 2021

not just Ṽs.
s0703a contains other surface-only symbols that are not recognized.

It seems like I really need to run through all data. But I'll be putting this off (possibly until after the release).

image

image

@kchall
Copy link
Member

kchall commented Dec 17, 2021

p. 22 and 23 of the manual have all the things that are 'supposed' to be labeled:
https://www.dropbox.com/s/57bl7ail6h6nvht/Buckeye_Corpus_manual.pdf?dl=0

Anything else I think we can just leave out. And I think that things like the above (IVER) are errors in the transcription -- it should have been labeled as and marked as non-speech.

We don't actually distribute the Buckeye corpus with PCT -- just the feature system for the symbols, and so I think it's fine for it to be just the symbols they say are included. People can be in charge of cleaning up their copies of the corpus themselves (or not, and just accepting [n] feature values!).

@stannam
Copy link
Member Author

stannam commented Jan 11, 2022

I checked and confirm that PCT covers everything on pages 22-23.
Both errors I marked above seem to be from mistakes.

  • The IVER in 'which' arose because is not in s0703a.words where needed (between lines 610 and 611).
  • As for 'being,' 'e' should be omitted from the surface transcription 'eng'...?

@stannam stannam closed this as completed Jan 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants