Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Measuring Detection Quality #20

Open
Skatinger opened this issue Dec 3, 2021 · 7 comments
Open

Measuring Detection Quality #20

Skatinger opened this issue Dec 3, 2021 · 7 comments

Comments

@Skatinger
Copy link
Contributor

Problem

So far it seems we do not really have a way of measuring the quality of extractions, only a quantitative measure. This means we can for example have a 100% score in section splitting, but might actually detect totally wrong sections (especially for the spiders working with pdfs as input). I've encountered this with the section splitting, but surely is a concern with the other tasks as well.

Goal

I would like to find some metrics and maybe even implement them or the tools required. Therefore I'd be happy to hear suggestions on how we can tackle this problem, and will try to implement the good ones or at least evaluate them.

Some ideas on what I'm thinking of:

section splitting

  • check if a detected section contains > n characters
  • check if a detected section contains > x % of total_decision_length
  • check if # of paragraphs seems reasonable for the amount of text in the given section

judical person extraction

  • maybe provide tools to easily create some ground-truth and run it against that?
  • without GT: check the format of persons (e.g. do they have a first / lastname?)

I'm afraid for most metrics we would have to define ~5% or 30-50 decisions.

judgement outcome

  • most likely need some ground-truth, but this should be doable as we can just build a tool to show the decision, let a human press "positive" or "negative" and show the next one. Should be quick to check some 50 decisions by hand.

How to help?

Let me know what you think and bring in your ideas for your tasks. Bad ideas are always better than none 😉 You can write me directly on telegram, but preferably comment here.

By the way, even if it's too late to implement all ideas, I'm sure the project does not end with us finishing the seminar!

@JoelNiklaus
Copy link
Owner

First of all: Great point and very good conversation starter!

Here my thoughts about it:

Section Splitting

> n characters is a good metric in my opinion. We just need to make sure that n is low enough so that we don't punish truly small sections (which can definitely exist! e.g. the facts in some federal supreme court decisions are very short). I would propose the ballpark of 50 < n < 100.
> x % of total_decision_length works too. In my opinion this is similar to the one above, though a bit harder to implement. And it can be that a decisions contains mostly just the considerations and not much else.
Same for # of paragraphs. We need to make sure not to punish decisions which are just not standard.

In conclusion, I would use a very conservative n with the first option. But maybe, there are other possibilities.

Judicial Person Extraction and Judgment Outcome Extraction
We actually already paid for the premium version of Prodigy (https://prodi.gy/), which is a very good annotation tool. Just let me know if you want to use it, I can give you the file to run it.

This project will most likely continue at least for the remainder of my PhD, so around 2 years more. So, I am very happy to discuss any ideas with you guys :)

@Skatinger
Copy link
Contributor Author

We actually already paid for the premium version of Prodigy

Oh wow, that looks very promising! Would love to check it out!

Maybe we can make it easy to use for everyone so that annotating a few examples per spider can quickly be done.

@JoelNiklaus
Copy link
Owner

Yes sure, we can do that. I will contact you to find time for a meeting

@susifohn
Copy link
Contributor

susifohn commented Dec 6, 2021

I agree that some quality measure of the section splitting are missing. The length of a section is one point, sure. I did section splitting as follows: Considering all the positions of the section markers, and based on that, split the text. This offers many easy possibilities to check for reasonable splitting. E.g.

Language.DE
	 Section.CONSIDERATIONS 	 509
	 Section.CONSIDERATIONS 	 3034
	 Section.CONSIDERATIONS 	 2981
	 Section.RULINGS            3758

where the numbers are the positions in the decision, where the match starts. After sorting this, I know for each section the position in the text, and its length automatically. The first quality check i did, is that I need at least some text in any of my defined sections, thus > 0 characters . And e.g. if sections overlap, we have a problem with the splitting.
I implemented this for UR_Gerichte, and it works quite simpel. See Section splitting 10b54fc

@susifohn
Copy link
Contributor

susifohn commented Dec 6, 2021

Have debug output as follows for splitting
Section.HEADER length is 18.7% of whole decision
Section.CONSIDERATIONS length is 81.3% of whole decision
and for judgment output
Section.HEADER contains Judgment.APPROVAL ("Beschwerde gutzuheissen") at 58.9% of the section.
Section.CONSIDERATIONS contains Judgment.APPROVAL ("Beschwerde gutzuheissen") at 98.3% of the section.

@Skatinger
Copy link
Contributor Author

I like the idea as its faster to keep the markers than to compute the length of each section afterwards. However I would prefer to build a separate module to do validations, which allows us to

  1. extend functionality easily as we go
  2. create a simple api so everyone can use it for their spiders
  3. reduces code duplication
  4. splits responsibilites (every module should do one thing)

This does require the length of each section to be computed with len(), but I think we can bare with that overhead as it will only run once anyway, or just for a small dataset when developing.

@JoelNiklaus
Copy link
Owner

I like the separate module for the reasons stated. Also, computing the length is very fast and as you said, this application is not time-critical. I think we should prioritize re-usability and clean code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants