-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Measuring Detection Quality #20
Comments
First of all: Great point and very good conversation starter! Here my thoughts about it: Section Splitting
In conclusion, I would use a very conservative n with the first option. But maybe, there are other possibilities. Judicial Person Extraction and Judgment Outcome Extraction This project will most likely continue at least for the remainder of my PhD, so around 2 years more. So, I am very happy to discuss any ideas with you guys :) |
Oh wow, that looks very promising! Would love to check it out! Maybe we can make it easy to use for everyone so that annotating a few examples per spider can quickly be done. |
Yes sure, we can do that. I will contact you to find time for a meeting |
I agree that some quality measure of the section splitting are missing. The length of a section is one point, sure. I did section splitting as follows: Considering all the positions of the section markers, and based on that, split the text. This offers many easy possibilities to check for reasonable splitting. E.g.
where the numbers are the positions in the decision, where the match starts. After sorting this, I know for each section the position in the text, and its length automatically. The first quality check i did, is that I need at least some text in any of my defined sections, thus |
Have debug output as follows for splitting |
I like the idea as its faster to keep the markers than to compute the length of each section afterwards. However I would prefer to build a separate module to do validations, which allows us to
This does require the length of each section to be computed with |
I like the separate module for the reasons stated. Also, computing the length is very fast and as you said, this application is not time-critical. I think we should prioritize re-usability and clean code. |
Problem
So far it seems we do not really have a way of measuring the quality of extractions, only a quantitative measure. This means we can for example have a 100% score in section splitting, but might actually detect totally wrong sections (especially for the spiders working with pdfs as input). I've encountered this with the section splitting, but surely is a concern with the other tasks as well.
Goal
I would like to find some metrics and maybe even implement them or the tools required. Therefore I'd be happy to hear suggestions on how we can tackle this problem, and will try to implement the good ones or at least evaluate them.
Some ideas on what I'm thinking of:
section splitting
> n characters
> x % of total_decision_length
judical person extraction
I'm afraid for most metrics we would have to define ~5% or 30-50 decisions.
judgement outcome
How to help?
Let me know what you think and bring in your ideas for your tasks. Bad ideas are always better than none 😉 You can write me directly on telegram, but preferably comment here.
By the way, even if it's too late to implement all ideas, I'm sure the project does not end with us finishing the seminar!
The text was updated successfully, but these errors were encountered: