Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzzy matches should exclude interpunction and upper/lower casing switching #67

Open
bramiozo opened this issue Mar 18, 2024 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@bramiozo
Copy link
Collaborator

First this span

Geen longembolieën. Beiderzijds

is tagged as positive, then this span

Geen longembolieën

is tagged as negative.

The relevant phrases in the concept dictionary are:
positive: Longembolieën beiderzijds
negative: Geen longembolieën

with the following clinlp settings:

# add to 
nlp = spacy.blank("clinlp")
# Sentences
nlp.add_pipe("clinlp_sentencizer")
# Entity matcher
entity_matcher = nlp.add_pipe("clinlp_entity_matcher",  
                config={"fuzzy": 2, "fuzzy_min_len": 5, "proximity": 1})
entity_matcher.load_concepts(clinlp_concepts)
# Qualifiers
nlp.add_pipe("clinlp_context_algorithm")

Expected behavior

  • I expect that both spans are tagged as negative.
  • I expect that the casing in Longembolieën beiderzijds is respected
  • I expect that the interpunction is excluded from the fuzzy search and the proximity.
  • The sentencizer should function as a hard delimiter for phrase matching

Direction of solution:

  • For the interpunction we can use the IS_PUNCT option as a pattern attribute.
  • We have to exclude characters from the fuzzy matcher as well as the change in upper/lower casing.
@bramiozo bramiozo added the bug Something isn't working label Mar 18, 2024
@bramiozo bramiozo self-assigned this Mar 18, 2024
@bramiozo
Copy link
Collaborator Author

negative
positve

@bramiozo
Copy link
Collaborator Author

bramiozo commented Apr 24, 2024

@vmenger, ik ben bezig met dit issue. Ik zag overigens dat clinlp -> src/clinlp is verplaatst :)

p.s. of is het deze issue, naja he

@vmenger vmenger moved this from Next up to In Progress in Clinlp development roadmap Apr 24, 2024
@vmenger
Copy link
Collaborator

vmenger commented Apr 24, 2024

Cool, goed om te weten! De hele roadmap staat nu hier: https://github.com/orgs/umcu/projects/3 -- is gelijk wat uitnodigender naar externen die mee willen werken. Vul gerust aan (door issues aan te maken

@vmenger vmenger added this to the Entity matching improvements milestone May 16, 2024
@vmenger vmenger moved this from In Progress to Later in Clinlp development roadmap May 30, 2024
@vmenger vmenger removed this from the Entity matching improvements milestone May 30, 2024
@bramiozo
Copy link
Collaborator Author

bramiozo commented Jun 3, 2024

thinking out loud/notitie:
Exclusie van interpunctie in fuzzy/proximity matching kan gedaan worden door interpunctie tijdelijk te vervangen door een tokenreeks van voldoende lengte e.g. "SEP SEP SEP SEP SEP" :D. Beetje een belachelijke optie maar het werkt wel...

Het niet accepteren van case switching voor de fuzzy matching vereist een nieuw gecompileerde Levenshstein.

@vmenger
Copy link
Collaborator

vmenger commented Jun 3, 2024

Als alles goed werkt dan kijkt de fuzzy matching alleen binnen een token, was het niet de proximity matching die hier de interpunctie matchte? In dat geval is het wel makkelijker te fixen, door {"OP": "?"} te vervangen door iets wat geen punct (of wellicht ook sentence boundaries?) matcht..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Later
Development

No branches or pull requests

2 participants