Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add MK names in english #220

Open
OriHoch opened this issue May 3, 2023 · 20 comments
Open

add MK names in english #220

OriHoch opened this issue May 3, 2023 · 20 comments

Comments

@OriHoch
Copy link
Contributor

OriHoch commented May 3, 2023

currently, the main table with mk details - members_mk_individual - doesn't have any names in english, the relevant fields are there but empty, we should see if it's possible to get them from the knesset API, and if not, scrape them from the html - https://knesset.gov.il/mk/eng/mkindex_current_eng.asp

@mattip
Copy link

mattip commented May 5, 2023

The OData schema has a KNS_Person table, which does not include english names. The english member-pages-by-id like https://main.knesset.gov.il/en/MK/APPS/mk/mk-personal-details/909 has English names in the <h2 _ngcontent-dpb-c237="" class="lobby-mk-name-prev">Abdullah Abu Maaruf</h2> section, but not as first-name, last-name. I guess we could do something like

for each id, FirstName, LastName in members_mk_individual:
    content = <scrape the page "https://main.knesset.gov.il/en/MK/APPS/mk/mk-personal-details/" + str(id)
    name = inner-html(find class==""lobby-mk-name-prev")
    nFirst = FirstName.split()
    nLast = LastName.split()
    nEng = name.split()
    if nEng != nFirst + nLast:
        raise ValueError()
    FirstNameEng = ' '.join(name.split()[:nFirst])
    LastNameEng = ' '.join(name.split()[nFirst:])
    store(id, FirstNameEng, LastNameEng)

@OriHoch
Copy link
Contributor Author

OriHoch commented May 5, 2023

I suggest as first step to add a new pipeline for this scraping which adds a table/package with id and english name per mk. Then we could combine this table into the mk_individual package and add a new english_full_name field.

@mattip
Copy link

mattip commented Feb 8, 2024

Getting back to this.

I suggest as first step to add a new pipeline for this scraping which adds a table/package with id and english name per mk.

Could you suggest a model pipeline I could use as a basis for this new pipeline?

@OriHoch
Copy link
Contributor Author

OriHoch commented Feb 11, 2024

follow the steps here to setup local development environment:

https://github.com/hasadna/knesset-data-pipelines/blob/master/airflow/README.md#local-development

then, add a command which runs your pipeline to the knesset-data-pipelines CLI, you can add a sub group names mks (like we have there for committees)

https://github.com/hasadna/knesset-data-pipelines/blob/master/airflow/knesset_data_pipelines/cli.py

we don't have a lot of pipelines there yet so feel free to decide how to implement the pipeline itself, but you can see an example of another pipeline here - https://github.com/hasadna/knesset-data-pipelines/blob/master/airflow/knesset_data_pipelines/committees/background_material_titles.py

@mattip
Copy link

mattip commented Feb 16, 2024

When I try to execute

knesset-data-pipelines committees background-material-titles

I get an error because the database is not populated

dataflows.base.exceptions.ProcessorError: Errored in processor iterable_loader in position #1: (psycopg2.errors.UndefinedTable) relation "committees_kns_committee" does not exist
LINE 7:                     from committees_kns_committee
                                 ^

[SQL: 
                    select
                        "CommitteeID" as committee_id,
                        "ParentCommitteeID" as parent_committee_id,
                        "Name" as name,
                        "CategoryDesc" as category_desc
                    from committees_kns_committee
                ]
(Background on this error at: https://sqlalche.me/e/14/f405)

Where in the docker-compose do the tables get initialized?

@OriHoch
Copy link
Contributor Author

OriHoch commented Feb 18, 2024

each table gets populated by it's relevant pipeline, in this case you would need to run knesset-data-pipelines run committees/kns_committee

But I suggest to just copy over the relevant data, in this case I can send you privately read-only DB credentials, and you can just copy over this table to your DB. Some pipelines depend on local files, in that case you can copy them from here - https://production.oknesset.org/pipelines/data/

@OriHoch
Copy link
Contributor Author

OriHoch commented Feb 18, 2024

I sent you the db credentials in slack

@mattip
Copy link

mattip commented Feb 18, 2024

Thanks!

@mattip
Copy link

mattip commented Feb 20, 2024

It turns out the interesting page I want to scrape uses javascript, i.e. wget https://main.knesset.gov.il/en/MK/APPS/mk/mk-personal-details/909 does not activate the javascript code to fill in the data:
<!DOCTYPE html><html><head><meta charset="utf-8"><script type="text/javascript" src="/kramericaindustries.ac.lib.js"></script><script type="text/javascript"> ;;window.rbzns={"bereshit":"1","seed":"PvaZs3eWk1OKKUCVNFFgZG9U60fnFqQA6pa9o4LXr3Ax4ttM\/MRG\/tRml9TKG3chfVXn4QDs6GxnTF7xz7T+elJOjCksK1U3tvGS8Ldijwk=","location_host":"main.knesset.gov.il","storage":3,"protocol":"https:"};winsocks();</script></head><body></body></html>

I could use something like requests-html which does support javascript, at the cost of

Note, the first time you ever run the render() method, it will download Chromium into your home directory (e.g. ~/.pyppeteer/). This only happens once.

Is there already something in this repo that does render javascript pages?

@OriHoch
Copy link
Contributor Author

OriHoch commented Feb 20, 2024

you can see in chrome developer tools that it makes a request to this url which returns the data in xml:

https://knesset.gov.il/WebSiteApi/knessetapi/MKs/GetMkdetailsHeader?mkId=909&languageKey=en

so you can just skip the html page and get the data from there

@mattip
Copy link

mattip commented Feb 20, 2024

Perfect, thanks. It even includes a URL for an image, which could be fed into the DB for display. For instance, https://oknesset.org/members/knesset-25.html does not have images. But that is a separate topic.

@OriHoch
Copy link
Contributor Author

OriHoch commented Feb 20, 2024

we had some copyright problems with the images.. so we don't display them

@mattip
Copy link

mattip commented Feb 22, 2024

I have made some progress, the heart is in my fork here. It can be used as

knesset-data-pipelines members-eng

The URL fetch fails after 100 requests. Something is off with the timeout backoff? It seems to take a minute or two to reset.

@OriHoch
Copy link
Contributor Author

OriHoch commented Feb 25, 2024

not sure what you mean by timeout backoff, we don't have such an option
you should add a sleep between iterations and set to higher seconds_between_retries

anyway, our servers are whitelisted on gov security so we usually don't get blocked

@mattip
Copy link

mattip commented Feb 25, 2024

anyway, our servers are whitelisted on gov security so we usually don't get blocked

Ahh, so maybe it is only a problem running locally.

you should add a sleep between iterations and set to higher seconds_between_retries

I will add a command line option --slow to do this.

@mattip
Copy link

mattip commented Mar 20, 2024

The --slow command works: running knesset-data-pipelines members-eng --slow added a member_english_names table with columns NameEng and mk_individual_id to the local DB and saved to a CSV file

member_english_names.csv

Still TODO: I hardcode the IDs here

def get_members_id():
    """Return an iterable of all valid mk_individual_id
    """
    return range(1, 1000)

What would be a better way to get the actual list of mk_individual_id from the DB? I couldn't find one that has the mapping in the CSV file

@OriHoch
Copy link
Contributor Author

OriHoch commented Mar 20, 2024

you can use our API
to get all mk_individual_ids you need to make 2 calls to https://backend.oknesset.org/docs#/user%20friendly/get_friendly_members_list_members_get

one with is_current=false and one with is_current_true

@bobiboMC FYI

@mattip
Copy link

mattip commented Mar 21, 2024

Cool. It seems to work. Not all the members have english names, for instance משה צ'יקו אדרי comes up in the query for is_current=false, with mk_idividual_id=30869, but he doesn't have an english page.

It is a bit strange that the query with is_current=true comes up with 143 items ...

@mattip
Copy link

mattip commented Mar 21, 2024

I suggest as first step to add a new pipeline for this scraping which adds a table/package with id and english name per mk.

See PR #352, knesset-data-pipelines members-eng --slow works for me locally (the actual workflow should not need the slow argument).

Then we could combine this table into the mk_individual package and add a new english_full_name field.

Should this be a separate step or an additional click task?

@bobiboMC
Copy link

Cool. It seems to work. Not all the members have english names, for instance משה צ'יקו אדרי comes up in the query for is_current=false, with mk_idividual_id=30869, but he doesn't have an english page.

It is a bit strange that the query with is_current=true comes up with 143 items ...

It contains people who serve in Knesset in different positons in addition to Knesset members. For example משה אדרי currently serves as מנכ"ל הכנסת.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants