Unique ids for speeches? #487

Lauler · 2024-04-04T14:14:39Z

Lauler
Apr 4, 2024

I need to keep track of ids on the speech level for my work with riksdagen-corpus. I was hoping you could clarify some things for me:

What tag in the xml functions best as an id for a speech? I'm thinking it's the <note> tags that also have a "speaker" attribute. Is this assumption correct?
Will these ids for speeches change as you update the corpus from one version to another? I.e. do I need to refer to a specific version of riksdagen-corpus when working with the data and referring to specific speeches.

I am assuming your utterance ids will remain as long as you use the same OCR segmentation to create the corpus. But speech segmentations will probably change over time as you improve the segmentations.

The reason I'm asking was because I misunderstood the format and erroneously extracted a speech id from from the u tags

ninpnin · 2024-04-04T16:01:07Z

ninpnin
Apr 4, 2024
Maintainer

Hi!

As the speeches do not consist of single XML elements, there will not be an XML ID for a specific speech in the files themselves. As you mention, the <note type="speaker"> IDs will then probably be the most stable identifier of the true underlying speech. Note that even though our intro detection is 99%+ accurate, these are not always available.

Depending on your use case, you might also not want the same ID for a speech if the content changes, and use the <u> IDs instead.

We try to keep the IDs as static as possible. Currently, new IDs are generated deterministically from the following information

Record ID
Page number
Cumulative text on the page including the paragraph

So even if we use a new OCR engine, the same content on the same page will yield the same IDs. However, if there are some errors in the paragraph segmentation and we fix it, the IDs of some elements may change.

If you want specific examples, you can look into the git history of a file, most of the IDs should have stayed the same for quite some time now.

0 replies

MansMeg · 2024-04-04T16:10:26Z

MansMeg
Apr 4, 2024
Maintainer

Yes. For now, I think I used the speaker/introduction XML tag's ID as the speaker ID in the R package. I think we should formalize this because we will need speach_id sooner or later, and the introduction is probably the best way to do this.

The question is if there is a way to formalize this in the ParlaClarin schema?

0 replies

Lauler · 2024-04-09T15:58:02Z

Lauler
Apr 9, 2024
Author

I ended up settling on this to parse speeches:

def speech_iterator(root):
    """
    Convert Parla-Clarin XML to an iterator of of concatenated speeches and speaker ids.
    Speech segments are concatenated for utterances until no "next" attribute is found.

    Args:
        root: Parla-Clarin document root, as an lxml tree root.
    """
    us = root.findall(".//{http://www.tei-c.org/ns/1.0}u")
    divs = root.findall(".//{http://www.tei-c.org/ns/1.0}div")
    protocol_id = root.findall(".//{http://www.tei-c.org/ns/1.0}head")[0].text
    docdates = root.findall(".//{http://www.tei-c.org/ns/1.0}docDate")

    if len(us) == 0:
        return None

    dates = [docdate.get("when", None) for docdate in docdates]
    speaker_id = None
    speech_id = None

    for div in divs:
        speech_text = []
        for element in div:
            if (
                element.tag == "{http://www.tei-c.org/ns/1.0}note"
                and "speaker" in element.attrib.values()
            ):
                # Extract speech_id from note
                speech_id = element.attrib.get("{http://www.w3.org/XML/1998/namespace}id", None)
            elif element.tag == "{http://www.tei-c.org/ns/1.0}u":
                # Collect utterances
                speaker_id = element.attrib.get("who", None)
                text = [t.split() for t in element.itertext()]

                for t in text:
                    speech_text.extend(t)

                if "next" not in element.attrib:
                    # Yield the speech segment if it's the last utterance
                    yield {
                        "text": " ".join(speech_text),
                        "speaker_id": speaker_id,
                        "speech_id": speech_id,
                        "protocol_id": protocol_id,
                        "date": dates,
                    }
                    speech_text = []
                    speaker_id = None
                    speech_id = None

0 replies

MansMeg · 2024-04-09T16:02:57Z

MansMeg
Apr 9, 2024
Maintainer

I open this issue again since I think we need a more formal solution, ie an actual speech_id

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unique ids for speeches? #487

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Unique ids for speeches? #487

Lauler Apr 4, 2024

Replies: 4 comments

ninpnin Apr 4, 2024 Maintainer

MansMeg Apr 4, 2024 Maintainer

Lauler Apr 9, 2024 Author

MansMeg Apr 9, 2024 Maintainer

Lauler
Apr 4, 2024

ninpnin
Apr 4, 2024
Maintainer

MansMeg
Apr 4, 2024
Maintainer

Lauler
Apr 9, 2024
Author

MansMeg
Apr 9, 2024
Maintainer