Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inventory of corpus with pronunciation variants #792

Open
Tracked by #800
kchall opened this issue Dec 2, 2021 · 7 comments
Open
Tracked by #800

inventory of corpus with pronunciation variants #792

kchall opened this issue Dec 2, 2021 · 7 comments

Comments

@kchall
Copy link
Member

kchall commented Dec 2, 2021

The inventory of a corpus is always based on the canonical pronunciations, not the full set of sounds in ANY pronunciation variants. So e.g. if your English canonical pronunciations contain /t/ and /d/, but your variants contain [ɾ], you can't search for [ɾ] or include it in your analyses. We need to allow inventories to be built from:

  1. the canonical forms only ('phonological' inventory)
  2. the variants only ('phonetic' inventory)
  3. a merged inventory of both ('total' inventory)
@stannam
Copy link
Member

stannam commented Dec 7, 2021

Suggested UI below.. (if no variants, grey out 'phonetic' and 'total' tabs)

image

  1. internally there should be 9 (=3*3) tables, one of which becomes visible according to the tab setting
  2. Questionː 'total' needed? (or, 'phonological' / 'phonetic' separation needed? if not, vertical tabs are not necessary?)
  3. need to revisit all instances where the 'inventory check' is done, and decide either phonetic or phonological inventory should be called.
    • e.g., self.inventory.segs in
      for i in list(dict.fromkeys(kwargs[a.name])):
      if i not in self.inventory.segs:
      reply = QMessageBox.critical(self,
      'Invalid information',
      'The transcription can only contain only symbols '
      'from the corpus\' inventory.'.format(str(a)))
      return
  4. Other issues?

Internally...

We already have inventory categorization functionality, so I think we can use the same on variant pronunciations and create 'phonetic' table?? (I'm being optimistic) As for the 'total' table, combine 'phonological' and 'phonetic' and remove duplicates?

(And what is this? Seems like alternative inventories were tried in 2016?)

def generate_alternative_inventories(self):
for att in self.alternative_transcriptions:
altinv = set()
for word in self:
transcription = getattr(word, att.name)
for x in transcription:
altinv.add(x)
self.alternative_inventories[att.name] = Inventory()
for seg in altinv:
#for some reason, this loop alters the contents of self.inventory, which it shouldn't
self.alternative_inventories[att.name].segs[seg] = Segment(seg,
self.specifier.specify(seg, assign_defaults=True))
@property
def all_inventories(self):
inventories = dict()
if hasattr(self, 'lexicon'):
inventories[self.default_transcription.display_name] = self.lexicon.inventory
else:
inventories[self.default_transcription.display_name] = self.inventory
for name,inv in self.alternative_inventories.items():
inventories[name] = inv
return inventories.items()

@stannam
Copy link
Member

stannam commented Dec 7, 2021

Also see #560 especially, #560 (comment)

Inventory charts in the main window will always display the default transcription's inventory, and only the default inventory's table should be open for editing. In an analysis window, users should be able to see (but not edit) the inventory of alternative transcriptions. This can be accomplished by simply feeding the alternative inventory to the default inventory table's sort function. That is, we sort the alternative inventory based on the default inventory's row/column settings. Some alternative segments (possibly all of them) will end up in the uncategorized tab, but there's not much to be done for that right now. I think this can be a relatively quick fix.

Seems like it is intentional to only contain default segments in the main inventory chart. Alternative inventory chart should not be editable but can be accessed in analysis functions?

stannam added a commit that referenced this issue Dec 7, 2021
For now, this just adds all variant segments. Eventually, need to have different tables for phonetic, phonological and total inventory (but after the release).
stannam added a commit that referenced this issue Dec 9, 2021
@stannam
Copy link
Member

stannam commented Dec 17, 2021

Two txt files are added to Dropbox. Find them in Phonological_CorpusTools_Public/example_files/variants/variants in inventory

  • variant_inventory_csv.txt: two distinct columns for (canonical) transcription and (alternative) phonetic tiers.
  • variant_inventory_ilg.txt: three lines for orthography, transcription, and variant forms, respectively. This file can be imported as a 'pronunciation variants' corpus.

stannam referenced this issue Dec 17, 2021
This lets the user know that phonological search does not work with segments that are only found in pronunciation variants.
@kchall
Copy link
Member Author

kchall commented Dec 17, 2021

Our current interim solution does show all symbols (phonetic or phonological) in a 'master' inventory table. Searches based on these symbols return 0, but there is a note to that effect in the search dialogue box.

However, there are two further problems: (1) All analyses have the same issue as searches (e.g., if you try to calculate functional load based on minimal pairs for [t] / [ɾ] in the above corpora ('writing' vs. 'riding'), the result is 0. This is likely to be true of ALL analyses. (2) If you pull up the 'corpus summary' inventory and click on a symbol that happens to occur only in phonetic variants (e.g. [ɾ] or [kʰ] in the above corpora), PCT crashes outright with no error message (instead of giving either the actual type / token count or 0).

Given these issues, I actually think we should 'roll back' the commit that added symbols that appear only in pronunciation variants to the total inventory, and simply clarify in the documentation that currently, only canonical pronunciations are used to populate the inventory and hence can be used in searches and analyses.

@stannam maybe we could add instead a note on the 'corpus summary' dialogue box that says "Note that this inventory is based on only the symbols that occur in canonical pronunciations. PCT does not include symbols from pronunciation variants in the inventory, and such symbols cannot currently be directly searched for or used in analyses."

Thanks, and sorry for the hassle! :(

stannam added a commit that referenced this issue Dec 21, 2021
reverting because the changes raise problems. re: #792
@kchall
Copy link
Member Author

kchall commented Jan 12, 2022

Hmm. I see that the corpus summary window is updated with the suggested note. But, it looks like PCT is still pulling in the inventory from pronunciation variants, so it's not quite rolled back. E.g.:

Load 'variant_inventory_ilg' corpus.
The symbol [ɾ] appears only in the pronunciation variants of 'riding' and 'writing,' not in their canonical forms.
Go to Corpus > Summary.
[ɾ] appears in the inventory (and it shouldn't). Clicking on it causes PCT to crash.

If instead you go to Corpus > Phonological search, again, [ɾ] occurs in the inventory; searching for it returns a count of 0.

(And basically the same thing happens in variant_inventory_csv, though of course there the [ɾ] is in the phonetic transcription column, not stored as a pronunciation variant.

@stannam
Copy link
Member

stannam commented Jan 12, 2022

That is strange. On my end, the chart only contains canonical segments in the summary window and other places including analysis functions and Features > Manage inventory chart.

Can you try to load the .txt files again? I used the two files in example_files/variants/variants in inventory.

@kchall
Copy link
Member Author

kchall commented Jan 12, 2022

Ah! Yes...again, silly on my part. I was reloading the existing corpora instead of creating them from scratch. Looks good, thank you.

@stannam stannam mentioned this issue Mar 2, 2022
18 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants