TAB Pre Annotation Phase #7

golankai · 2024-02-08T15:05:40Z

Dear researcher, thank you very much for this amazing work!

I'm curious to expand the dataset to other domains, and wish to preserve the annotation guidelines.
In order to minimize the differences and to follow precisely your work, I would have wanted to run the same pre-annotation procedure as you did, would you be fine to share this part more in detail/code?

I do it as a part of a research project of the TrustHLT Group and this can be extremely beneficial to us!

Thank you very much,
Kai.

plison · 2024-02-09T10:40:35Z

Hi Kai, Sure, here is the Python code we used to pre-annotate the documents from the ECHR. As you can see, it essentially boils down to: * Running Spacy to get named entities, + a few simple regular expressions to detect codes and dates * Correcting those entities with a few heuristics, and mapping the 18 Ontonotes categories to the privacy-oriented categories we had defined The code is really tailored to ECHR documents and their formatting though, so I’m not sure how useful it would be to other domains, apart perhaps for the mapping between Ontonotes NE and the more privacy-oriented categories from TAB. Pierre

plison · 2024-02-09T10:46:37Z

The file is here: https://github.com/NorskRegnesentral/text-anonymization-benchmark/blob/master/scripts/annotate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TAB Pre Annotation Phase #7

TAB Pre Annotation Phase #7

golankai commented Feb 8, 2024

plison commented Feb 9, 2024 via email •

edited

Loading

plison commented Feb 9, 2024

TAB Pre Annotation Phase #7

TAB Pre Annotation Phase #7

Comments

golankai commented Feb 8, 2024

plison commented Feb 9, 2024 via email • edited Loading

plison commented Feb 9, 2024

plison commented Feb 9, 2024 via email •

edited

Loading