Replies: 4 comments
-
Hi! As the speeches do not consist of single XML elements, there will not be an XML ID for a specific speech in the files themselves. As you mention, the Depending on your use case, you might also not want the same ID for a speech if the content changes, and use the We try to keep the IDs as static as possible. Currently, new IDs are generated deterministically from the following information
So even if we use a new OCR engine, the same content on the same page will yield the same IDs. However, if there are some errors in the paragraph segmentation and we fix it, the IDs of some elements may change. If you want specific examples, you can look into the git history of a file, most of the IDs should have stayed the same for quite some time now. |
Beta Was this translation helpful? Give feedback.
-
Yes. For now, I think I used the speaker/introduction XML tag's ID as the speaker ID in the R package. I think we should formalize this because we will need The question is if there is a way to formalize this in the ParlaClarin schema? |
Beta Was this translation helpful? Give feedback.
-
I ended up settling on this to parse speeches: def speech_iterator(root):
"""
Convert Parla-Clarin XML to an iterator of of concatenated speeches and speaker ids.
Speech segments are concatenated for utterances until no "next" attribute is found.
Args:
root: Parla-Clarin document root, as an lxml tree root.
"""
us = root.findall(".//{http://www.tei-c.org/ns/1.0}u")
divs = root.findall(".//{http://www.tei-c.org/ns/1.0}div")
protocol_id = root.findall(".//{http://www.tei-c.org/ns/1.0}head")[0].text
docdates = root.findall(".//{http://www.tei-c.org/ns/1.0}docDate")
if len(us) == 0:
return None
dates = [docdate.get("when", None) for docdate in docdates]
speaker_id = None
speech_id = None
for div in divs:
speech_text = []
for element in div:
if (
element.tag == "{http://www.tei-c.org/ns/1.0}note"
and "speaker" in element.attrib.values()
):
# Extract speech_id from note
speech_id = element.attrib.get("{http://www.w3.org/XML/1998/namespace}id", None)
elif element.tag == "{http://www.tei-c.org/ns/1.0}u":
# Collect utterances
speaker_id = element.attrib.get("who", None)
text = [t.split() for t in element.itertext()]
for t in text:
speech_text.extend(t)
if "next" not in element.attrib:
# Yield the speech segment if it's the last utterance
yield {
"text": " ".join(speech_text),
"speaker_id": speaker_id,
"speech_id": speech_id,
"protocol_id": protocol_id,
"date": dates,
}
speech_text = []
speaker_id = None
speech_id = None |
Beta Was this translation helpful? Give feedback.
-
I open this issue again since I think we need a more formal solution, ie an actual speech_id |
Beta Was this translation helpful? Give feedback.
-
I need to keep track of ids on the speech level for my work with riksdagen-corpus. I was hoping you could clarify some things for me:
<note>
tags that also have a "speaker" attribute. Is this assumption correct?I am assuming your utterance ids will remain as long as you use the same OCR segmentation to create the corpus. But speech segmentations will probably change over time as you improve the segmentations.
The reason I'm asking was because I misunderstood the format and erroneously extracted a speech id from from the
u
tagsBeta Was this translation helpful? Give feedback.
All reactions