-
Hi! I've made a first pass on matching text from speeches in The premise is as follows:
I made a table below with some statistics of the total number of speeches, and the number of matched speeches grouped by year. I have a couple of questions regarding this which I hope you could provide some input on:
|
Beta Was this translation helpful? Give feedback.
Replies: 8 comments 5 replies
-
Great work! I am not really sure about the dates. Some records (protokoll) could also cover a couple of days. @ninpnin do you know about the quality of our coverage of the dates? Regarding the second question, I am not sure I understand what you mean. But one important thing is that there is a difference between the oral speeches and the printed speeches in the records. This is because the parliamentary stenographers who turn verbatim speech to edited (readable text). Sometimes the differences is not big but sometimes it is a big difference. |
Beta Was this translation helpful? Give feedback.
-
The second question was referring to the fact that the total number of speeches in the dataset drops to 5459 in 1976. This seems anomalous as there are a higher number of speeches in the preceding and following years. @BobBorges informed me in an e-mail conversation that v.1.0.0 of The Swedish Parliament Corpus is out, and that you've switched repositories to https://github.com/swerik-project/the-swedish-parliament-corpus . For reference, in v.1.0.0 there are a lot more segmented speeches in the mid 70s: I think I will try running this all again with the new version of the parliament corpus, and additionally also relaxing the date ranges for the speeches I am not able to match in the first pass of the data. |
Beta Was this translation helpful? Give feedback.
-
By the way, I think you should advertise the fact that v.1.0.0 is out and is hosted in a different repository in the README of this repo. |
Beta Was this translation helpful? Give feedback.
-
Ah, got it. We have seen some variation in the speech data but over so many years that your data indicate. There are some strange thing going on around 1976/1977 when the parliament switch from full year to autumn-to-summer year. I can also say that we are working on improving the segmentation of speeches right now. Hence, in the next releases, there might be even better mapping between speech recordings and printed speech records. And yes, it is on my to-do to do so! |
Beta Was this translation helpful? Give feedback.
-
Excellent! |
Beta Was this translation helpful? Give feedback.
-
@BobBorges and @ninpnin , do you have a take on question 1. Is it worthwhile to file this as an issue? |
Beta Was this translation helpful? Give feedback.
-
I think the dates are pretty reliable. One challenge, though, is that some records contain multiple dates, but it sounds like @Lauler already has that covered to some extent. |
Beta Was this translation helpful? Give feedback.
-
The newer (1971-> maybe?) docDates come directly from riksdagen's metadata, while for the data before that they are scraped from margin notes. If there are OCR errors in the dates, some dates may be incorrect, but I think that is rare. We are not aware of any issues with this as of now. If this is an important concern, we can perform statistical quality control. It would probably even be pretty fast, sampling 1 protocol per year/chamber, and checking the date manually. |
Beta Was this translation helpful? Give feedback.
I think the dates are pretty reliable. One challenge, though, is that some records contain multiple dates, but it sounds like @Lauler already has that covered to some extent.