Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What should we do with former judges? #34

Closed
Jonakemon opened this issue May 6, 2022 · 9 comments · Fixed by #56
Closed

What should we do with former judges? #34

Jonakemon opened this issue May 6, 2022 · 9 comments · Fixed by #56

Comments

@Jonakemon
Copy link
Collaborator

Former judges are judges whose profile is not available anymore on https://namenlijst.rechtspraak.nl. @siccovansas and I discussed the following solution:

  • Datetime column called removed_from_rechtspraak_at on the Person model that indicates when the person was removed from the Namenlijst.
  • If a Person is removed (removed_from_rechtspraak_at is set), we shouldn't include the person by default in the search api. If a query param is set, then former judges can be included.
  • In the person enrich scraper, removed_from_rechtspraak_at should be set if the namenlijst web page returns an invalid status code.
  • If removed_from_rechtspraak_at is set, but the page becomes accessible again during a future scrape, removed_from_rechtspraak_at should be reset to None. Persons that have removed_from_rechtspraak_at set should be rescraped periodically.
  • In Searchfield text update #23 a toggle should be added to be able to filter on former judges.
@ThijmenDam
Copy link

ThijmenDam commented May 17, 2023

On topic: I'd prefer a combination of the first two points, i.e. you give the ability to include former judges via a query parameter, and if you do so, the datetime column provides information on when exactly the person was removed.

Slightly off-topic:

Does the 'former_judges' parameter still work as intended in your latest API release? I cannot seem to find any differences in the API response if I make a request with 'true' or 'false'.

To illustrate, I created a Python function to request all judges at this moment in time using your API. Since I'm experimenting with the data, I figured it would be better if I request all judges once and then store them locally.

This is my code:

def query(search_query: str, limit: int = 100, page: int = 0) -> OpenRechtspraakResponse:

    res = requests.get("http://openrechtspraak.nl/api/v1/person", params={
        'offset': limit * page,
        'limit': limit,
        'former_judges': 'true',   <--- the parameter
        'q': search_query
    })

    res.raise_for_status()
    json: OpenRechtspraakResponse = res.json()

    return json


def all_judges():
    _all_judges: list[Judge] = []
    judge_count = query('%')['count']

    # [0, 1, 2, ..., 45]
    pages_to_fetch = [limit // 100 for limit in range(0, judge_count, 100)]

    for page in pages_to_fetch:
        subset = query('%', 100, page)
        for judge in subset['data']:
            _all_judges.append(judge)

    return _all_judges

In the code snippet the parameter is set to 'true', though if I set it to 'false' I get the same count of judges in the response.

@Jonakemon
Copy link
Collaborator Author

That's weird. I tried it myself, and I did get a different count. Additionally, we've got multiple unit tests for this query parameter that test this.

My hypothesis is that there is an ordering issue: we order by last name, but there are several judges with the same last name. In my script below, you'll see that I increment the offset by 75 instead of the limit. This prevents the ordering bug from happening. I checked whether your script has a bug by changing _all_judges to a set, adding the id of each judge to the set and printing the length of the set after every page. If you do this, you'll see an unexpected length after a few pages: 100, 200, 300, 400, 497, 596, 695, ....

I'll open up a PR to apply a definitive fix (order by last name AND id). I hope this will be deployed next week. I'll ping you once it's up; could you then check whether this fixes your problem?

My script:

import requests

offset = 0
limit = 100
ids = set()

while True:
    r = requests.get("https://openrechtspraak.nl/api/v1/person",
                     params={"limit": limit, "offset": offset, "former_judges": "true"})
    r.raise_for_status()
    data = r.json().get("data")

    if len(data) == 0:
        break

    ids |= {i.get("id") for i in data}
    print(len(ids))
    offset += 75

print(ids, len(ids))

@Jonakemon
Copy link
Collaborator Author

I've included the field removed_from_rechtspraak_at in the API response. Will hopefully also be deployed next week.

@ThijmenDam
Copy link

I'll open up a PR to apply a definitive fix (order by last name AND id). I hope this will be deployed next week. I'll ping you once it's up; could you then check whether this fixes your problem?

For sure. Looking forward to the ping!

@Jonakemon
Copy link
Collaborator Author

@ThijmenDam this fix is now online. Can you verify whether this fixes your problem?

@ThijmenDam
Copy link

@Jonakemon yes will do! I'll let you know as soon as I have time, probably tonight or tomorrow night.

@ThijmenDam
Copy link

ThijmenDam commented May 26, 2023

Hi @Jonakemon! Thanks for the effort you put in this week.

Using the code snippet I posted above, these are my results when fetching all judges:

Parameter former_judges set to true: fetched 4591 judges.
Parameter former_judges set to false: fetched 4591 judges.

Thus, the parameter does not seem to make a difference.

Within the set of 4591 judges, 24 judges now contain the new property removed_from_rechtspraak_at. That's great :-)

Random example:

 {
  'first_name': None,
  'gender': 'male',
  'id': '56d1821d-1a2f-491b-81c0-6d0ea89c287b',
  'initials': 'K.',
  'last_name': 'Meijde',
  'professional_details': [{'function': 'Raadsheer',
    'id': 'af0e7be2-7631-442e-9b40-2370312a936b',
    'organisation': "Gerechtshof 's-Hertogenbosch"}],
  'rechtspraak_id': 'uluiy2DjE4JH29e9O7nH5rxmzOsyxQIz',
  'removed_from_rechtspraak_at': '2023-05-26T01:53:05.230291',
  'titles': 'dhr. mr.',
  'toon_naam': 'dhr. mr. K. van der Meijde',
  'toon_naam_kort': 'K. van der Meijde'
}

However, despite the removed_from_rechtspraak_at property being set for this particular employee, it seems that he is still part of the namenlijst: https://namenlijst.rechtspraak.nl/#!/details/Y1ALw9lal-qFDMwDZ8oN_bxmzOsyxQIz

@Jonakemon
Copy link
Collaborator Author

Hi @ThijmenDam, using my own code snippet, I do get a difference in judges: 4591 vs 4573 (that's 18, not 24). I did notice that we broke the count field which causes your script to be confused. This will be deployed somewhere next week. I also added quite some extra tests to prevent something like this from breaking again.

Regarding finding the example you posted, I don't think this judge is removed. When I query the API, I don't get the same result: https://openrechtspraak.nl/api/v1/person?q=Meijde.

When exactly did you query the API? We rescrape the dataset every night. You might have queried the API at the same time when we did the rescrape. It's an edge case, but I can't think of another reason why that could have happened.

@Jonakemon
Copy link
Collaborator Author

@ThijmenDam this is now (definitively) fixed! 🎉 If you visit the API with and without query parameter, you'll also see a difference in count!

Please open up a new issue if it's not resolved or if you encounter other bugs. Thanks for helping us out by reporting it! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants