inventory of corpus with pronunciation variants #792

kchall · 2021-12-02T23:44:13Z

The inventory of a corpus is always based on the canonical pronunciations, not the full set of sounds in ANY pronunciation variants. So e.g. if your English canonical pronunciations contain /t/ and /d/, but your variants contain [ɾ], you can't search for [ɾ] or include it in your analyses. We need to allow inventories to be built from:

the canonical forms only ('phonological' inventory)
the variants only ('phonetic' inventory)
a merged inventory of both ('total' inventory)

stannam · 2021-12-07T05:21:10Z

Suggested UI below.. (if no variants, grey out 'phonetic' and 'total' tabs)

internally there should be 9 (=3*3) tables, one of which becomes visible according to the tab setting
Questionː 'total' needed? (or, 'phonological' / 'phonetic' separation needed? if not, vertical tabs are not necessary?)

need to revisit all instances where the 'inventory check' is done, and decide either phonetic or phonological inventory should be called.

e.g., self.inventory.segs in

CorpusTools/corpustools/gui/corpusgui.py

Lines 324 to 330 in 83bdf7e

    
           for i in list(dict.fromkeys(kwargs[a.name])): 
        
               if i not in self.inventory.segs: 
        
                   reply = QMessageBox.critical(self, 
        
                                                'Invalid information', 
        
                                                'The transcription can only contain only symbols ' 
        
                                                'from the corpus\' inventory.'.format(str(a))) 
        
                   return

Other issues?

Internally...

We already have inventory categorization functionality, so I think we can use the same on variant pronunciations and create 'phonetic' table?? (I'm being optimistic) As for the 'total' table, combine 'phonological' and 'phonetic' and remove duplicates?

(And what is this? Seems like alternative inventories were tried in 2016?)

CorpusTools/corpustools/corpus/classes/lexicon.py

Lines 2759 to 2782 in f1ba665

    
           def generate_alternative_inventories(self): 
        
               for att in self.alternative_transcriptions: 
        
                   altinv = set() 
        
                   for word in self: 
        
                       transcription = getattr(word, att.name) 
        
                       for x in transcription: 
        
                           altinv.add(x) 
        
                   self.alternative_inventories[att.name] = Inventory() 
        
                   for seg in altinv: 
        
                       #for some reason, this loop alters the contents of self.inventory, which it shouldn't 
        
                       self.alternative_inventories[att.name].segs[seg] = Segment(seg, 
        
                                                                          self.specifier.specify(seg, assign_defaults=True)) 
        
           @property 
        
           def all_inventories(self): 
        
               inventories = dict() 
        
               if hasattr(self, 'lexicon'): 
        
                   inventories[self.default_transcription.display_name] = self.lexicon.inventory 
        
               else: 
        
                   inventories[self.default_transcription.display_name] = self.inventory 
        
               for name,inv in self.alternative_inventories.items(): 
        
                   inventories[name] = inv 
        
               return inventories.items()

stannam · 2021-12-07T07:37:16Z

Also see #560 especially, #560 (comment)

Inventory charts in the main window will always display the default transcription's inventory, and only the default inventory's table should be open for editing. In an analysis window, users should be able to see (but not edit) the inventory of alternative transcriptions. This can be accomplished by simply feeding the alternative inventory to the default inventory table's sort function. That is, we sort the alternative inventory based on the default inventory's row/column settings. Some alternative segments (possibly all of them) will end up in the uncategorized tab, but there's not much to be done for that right now. I think this can be a relatively quick fix.

Seems like it is intentional to only contain default segments in the main inventory chart. Alternative inventory chart should not be editable but can be accessed in analysis functions?

For now, this just adds all variant segments. Eventually, need to have different tables for phonetic, phonological and total inventory (but after the release).

Also see commit de22852

stannam · 2021-12-17T00:44:46Z

Two txt files are added to Dropbox. Find them in Phonological_CorpusTools_Public/example_files/variants/variants in inventory

variant_inventory_csv.txt: two distinct columns for (canonical) transcription and (alternative) phonetic tiers.
variant_inventory_ilg.txt: three lines for orthography, transcription, and variant forms, respectively. This file can be imported as a 'pronunciation variants' corpus.

This lets the user know that phonological search does not work with segments that are only found in pronunciation variants.

kchall · 2021-12-17T23:09:19Z

Our current interim solution does show all symbols (phonetic or phonological) in a 'master' inventory table. Searches based on these symbols return 0, but there is a note to that effect in the search dialogue box.

However, there are two further problems: (1) All analyses have the same issue as searches (e.g., if you try to calculate functional load based on minimal pairs for [t] / [ɾ] in the above corpora ('writing' vs. 'riding'), the result is 0. This is likely to be true of ALL analyses. (2) If you pull up the 'corpus summary' inventory and click on a symbol that happens to occur only in phonetic variants (e.g. [ɾ] or [kʰ] in the above corpora), PCT crashes outright with no error message (instead of giving either the actual type / token count or 0).

Given these issues, I actually think we should 'roll back' the commit that added symbols that appear only in pronunciation variants to the total inventory, and simply clarify in the documentation that currently, only canonical pronunciations are used to populate the inventory and hence can be used in searches and analyses.

@stannam maybe we could add instead a note on the 'corpus summary' dialogue box that says "Note that this inventory is based on only the symbols that occur in canonical pronunciations. PCT does not include symbols from pronunciation variants in the inventory, and such symbols cannot currently be directly searched for or used in analyses."

Thanks, and sorry for the hassle! :(

reverting because the changes raise problems. re: #792

#792

kchall · 2022-01-12T00:28:34Z

Hmm. I see that the corpus summary window is updated with the suggested note. But, it looks like PCT is still pulling in the inventory from pronunciation variants, so it's not quite rolled back. E.g.:

Load 'variant_inventory_ilg' corpus.
The symbol [ɾ] appears only in the pronunciation variants of 'riding' and 'writing,' not in their canonical forms.
Go to Corpus > Summary.
[ɾ] appears in the inventory (and it shouldn't). Clicking on it causes PCT to crash.

If instead you go to Corpus > Phonological search, again, [ɾ] occurs in the inventory; searching for it returns a count of 0.

(And basically the same thing happens in variant_inventory_csv, though of course there the [ɾ] is in the phonetic transcription column, not stored as a pronunciation variant.

stannam · 2022-01-12T04:44:22Z

That is strange. On my end, the chart only contains canonical segments in the summary window and other places including analysis functions and Features > Manage inventory chart.

Can you try to load the .txt files again? I used the two files in example_files/variants/variants in inventory.

kchall · 2022-01-12T05:36:41Z

Ah! Yes...again, silly on my part. I was reloading the existing corpora instead of creating them from scratch. Looks good, thank you.

stannam added the enhancement label Dec 7, 2021

stannam added a commit that referenced this issue Dec 7, 2021

working on #792 -- merged inv table for now.

de22852

For now, this just adds all variant segments. Eventually, need to have different tables for phonetic, phonological and total inventory (but after the release).

stannam mentioned this issue Dec 8, 2021

add surface-only segments to Buckeye2hayes.feature #793

Closed

stannam added a commit that referenced this issue Dec 9, 2021

continue working on #792

0fbf65c

Also see commit de22852

stannam referenced this issue Dec 17, 2021

Temporary notes on PhonoSearch dialog

c81409d

This lets the user know that phonological search does not work with segments that are only found in pronunciation variants.

stannam added a commit that referenced this issue Dec 21, 2021

rolled back to inventory with only canonical symbols

7ce8666

reverting because the changes raise problems. re: #792

stannam added a commit that referenced this issue Dec 21, 2021

moved variant symbol notes from PS to corpus summary

68a12d6

#792

stannam mentioned this issue Mar 2, 2022

After v.1.5.1 #800

Open

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inventory of corpus with pronunciation variants #792

inventory of corpus with pronunciation variants #792

kchall commented Dec 2, 2021

stannam commented Dec 7, 2021 •

edited

Loading

stannam commented Dec 7, 2021

stannam commented Dec 17, 2021

kchall commented Dec 17, 2021

kchall commented Jan 12, 2022

stannam commented Jan 12, 2022

kchall commented Jan 12, 2022

inventory of corpus with pronunciation variants #792

inventory of corpus with pronunciation variants #792

Comments

kchall commented Dec 2, 2021

stannam commented Dec 7, 2021 • edited Loading

stannam commented Dec 7, 2021

stannam commented Dec 17, 2021

kchall commented Dec 17, 2021

kchall commented Jan 12, 2022

stannam commented Jan 12, 2022

kchall commented Jan 12, 2022

stannam commented Dec 7, 2021 •

edited

Loading