Skip to content
This repository has been archived by the owner on May 25, 2018. It is now read-only.

Infrastructure for removing duplicates #2

Open
dokterbob opened this issue Dec 11, 2012 · 3 comments
Open

Infrastructure for removing duplicates #2

dokterbob opened this issue Dec 11, 2012 · 3 comments

Comments

@dokterbob
Copy link
Collaborator

Sometimes, feed items are duplicated across streams. As to avoid new notifications for existing information, some infrastructure for deduplication has to be figured out.

@rejozenger Examples, please. ^^

@rejozenger
Copy link
Member

Unfortunately, I can't include attachments other than images here. I have a link to three different representations of, basically, the same information: https://rejo.zenger.nl/files/newspeak-example-duplication.zip. At least one of the two representations of officielebekendmakingen.nl appeared in the feed https://zoek.officielebekendmakingen.nl/kamervragen_aanhangsel/rss, the publication at rijksoverheid.nl appeared in http://feeds.rijksoverheid.nl/kamerstukken.rss.

@dokterbob
Copy link
Collaborator Author

@rejozenger As RSS is changing, please in the future attach the actual RSS resource (or a dump thereof). The current state of the RSS feed does not provide any reference to the mentioned documents.

Note about documents: the versions on 'officiëlebekendmakingen' have corresponding 'Aanhangselnummer'. This could be a lead to preliminary deduplication.

@rejozenger
Copy link
Member

I have uploaded a compressed directory to https://rejo.zenger.nl/files/newspeak.zip. It includes three directories with, for each, a dump of the RSS feed in which the item appeared, a saved version of the HTML file the RSS feed was pointing to and saved PDF file the HTML file was pointing to. It also includes "items-seen.txt" which shows all of the occurrences over time. I haven't investigated, but I am pretty sure not all nine version did appear in the RSS feed (using the legacy code).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants