Measuring Detection Quality #20

Skatinger · 2021-12-03T20:03:47Z

Problem

So far it seems we do not really have a way of measuring the quality of extractions, only a quantitative measure. This means we can for example have a 100% score in section splitting, but might actually detect totally wrong sections (especially for the spiders working with pdfs as input). I've encountered this with the section splitting, but surely is a concern with the other tasks as well.

Goal

I would like to find some metrics and maybe even implement them or the tools required. Therefore I'd be happy to hear suggestions on how we can tackle this problem, and will try to implement the good ones or at least evaluate them.

Some ideas on what I'm thinking of:

section splitting

check if a detected section contains > n characters
check if a detected section contains > x % of total_decision_length
check if # of paragraphs seems reasonable for the amount of text in the given section

judical person extraction

maybe provide tools to easily create some ground-truth and run it against that?
without GT: check the format of persons (e.g. do they have a first / lastname?)

I'm afraid for most metrics we would have to define ~5% or 30-50 decisions.

judgement outcome

most likely need some ground-truth, but this should be doable as we can just build a tool to show the decision, let a human press "positive" or "negative" and show the next one. Should be quick to check some 50 decisions by hand.

How to help?

Let me know what you think and bring in your ideas for your tasks. Bad ideas are always better than none 😉 You can write me directly on telegram, but preferably comment here.

By the way, even if it's too late to implement all ideas, I'm sure the project does not end with us finishing the seminar!

The text was updated successfully, but these errors were encountered:

JoelNiklaus · 2021-12-03T21:57:58Z

First of all: Great point and very good conversation starter!

Here my thoughts about it:

Section Splitting

> n characters is a good metric in my opinion. We just need to make sure that n is low enough so that we don't punish truly small sections (which can definitely exist! e.g. the facts in some federal supreme court decisions are very short). I would propose the ballpark of 50 < n < 100.
> x % of total_decision_length works too. In my opinion this is similar to the one above, though a bit harder to implement. And it can be that a decisions contains mostly just the considerations and not much else.
Same for # of paragraphs. We need to make sure not to punish decisions which are just not standard.

In conclusion, I would use a very conservative n with the first option. But maybe, there are other possibilities.

Judicial Person Extraction and Judgment Outcome Extraction
We actually already paid for the premium version of Prodigy (https://prodi.gy/), which is a very good annotation tool. Just let me know if you want to use it, I can give you the file to run it.

This project will most likely continue at least for the remainder of my PhD, so around 2 years more. So, I am very happy to discuss any ideas with you guys :)

Skatinger · 2021-12-03T22:16:21Z

We actually already paid for the premium version of Prodigy

Oh wow, that looks very promising! Would love to check it out!

Maybe we can make it easy to use for everyone so that annotating a few examples per spider can quickly be done.

JoelNiklaus · 2021-12-04T17:25:28Z

Yes sure, we can do that. I will contact you to find time for a meeting

susifohn · 2021-12-06T08:23:45Z

I agree that some quality measure of the section splitting are missing. The length of a section is one point, sure. I did section splitting as follows: Considering all the positions of the section markers, and based on that, split the text. This offers many easy possibilities to check for reasonable splitting. E.g.

Language.DE
	 Section.CONSIDERATIONS 	 509
	 Section.CONSIDERATIONS 	 3034
	 Section.CONSIDERATIONS 	 2981
	 Section.RULINGS            3758

where the numbers are the positions in the decision, where the match starts. After sorting this, I know for each section the position in the text, and its length automatically. The first quality check i did, is that I need at least some text in any of my defined sections, thus > 0 characters . And e.g. if sections overlap, we have a problem with the splitting.
I implemented this for UR_Gerichte, and it works quite simpel. See Section splitting 10b54fc

susifohn · 2021-12-06T21:48:07Z

Have debug output as follows for splitting
Section.HEADER length is 18.7% of whole decision
Section.CONSIDERATIONS length is 81.3% of whole decision
and for judgment output
Section.HEADER contains Judgment.APPROVAL ("Beschwerde gutzuheissen") at 58.9% of the section.
Section.CONSIDERATIONS contains Judgment.APPROVAL ("Beschwerde gutzuheissen") at 98.3% of the section.

Skatinger · 2021-12-07T16:24:12Z

I like the idea as its faster to keep the markers than to compute the length of each section afterwards. However I would prefer to build a separate module to do validations, which allows us to

extend functionality easily as we go
create a simple api so everyone can use it for their spiders
reduces code duplication
splits responsibilites (every module should do one thing)

This does require the length of each section to be computed with len(), but I think we can bare with that overhead as it will only run once anyway, or just for a small dataset when developing.

JoelNiklaus · 2021-12-07T21:52:51Z

I like the separate module for the reasons stated. Also, computing the length is very fast and as you said, this application is not time-critical. I think we should prioritize re-usability and clean code.

Skatinger mentioned this issue Dec 3, 2021

Section splitting be bvd #21

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Measuring Detection Quality #20

Measuring Detection Quality #20

Skatinger commented Dec 3, 2021

JoelNiklaus commented Dec 3, 2021

Skatinger commented Dec 3, 2021

JoelNiklaus commented Dec 4, 2021

susifohn commented Dec 6, 2021 •

edited

Loading

susifohn commented Dec 6, 2021

Skatinger commented Dec 7, 2021

JoelNiklaus commented Dec 7, 2021

Measuring Detection Quality #20

Measuring Detection Quality #20

Comments

Skatinger commented Dec 3, 2021

Problem

Goal

section splitting

judical person extraction

judgement outcome

How to help?

JoelNiklaus commented Dec 3, 2021

Skatinger commented Dec 3, 2021

JoelNiklaus commented Dec 4, 2021

susifohn commented Dec 6, 2021 • edited Loading

susifohn commented Dec 6, 2021

Skatinger commented Dec 7, 2021

JoelNiklaus commented Dec 7, 2021

susifohn commented Dec 6, 2021 •

edited

Loading