-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add MK names in english #220
Comments
The OData schema has a KNS_Person table, which does not include english names. The english member-pages-by-id like https://main.knesset.gov.il/en/MK/APPS/mk/mk-personal-details/909 has English names in the
|
I suggest as first step to add a new pipeline for this scraping which adds a table/package with id and english name per mk. Then we could combine this table into the mk_individual package and add a new english_full_name field. |
Getting back to this.
Could you suggest a model pipeline I could use as a basis for this new pipeline? |
follow the steps here to setup local development environment: https://github.com/hasadna/knesset-data-pipelines/blob/master/airflow/README.md#local-development then, add a command which runs your pipeline to the knesset-data-pipelines CLI, you can add a sub group names https://github.com/hasadna/knesset-data-pipelines/blob/master/airflow/knesset_data_pipelines/cli.py we don't have a lot of pipelines there yet so feel free to decide how to implement the pipeline itself, but you can see an example of another pipeline here - https://github.com/hasadna/knesset-data-pipelines/blob/master/airflow/knesset_data_pipelines/committees/background_material_titles.py |
When I try to execute
I get an error because the database is not populated
Where in the docker-compose do the tables get initialized? |
each table gets populated by it's relevant pipeline, in this case you would need to run But I suggest to just copy over the relevant data, in this case I can send you privately read-only DB credentials, and you can just copy over this table to your DB. Some pipelines depend on local files, in that case you can copy them from here - https://production.oknesset.org/pipelines/data/ |
I sent you the db credentials in slack |
Thanks! |
It turns out the interesting page I want to scrape uses javascript, i.e. I could use something like requests-html which does support javascript, at the cost of
Is there already something in this repo that does render javascript pages? |
you can see in chrome developer tools that it makes a request to this url which returns the data in xml: https://knesset.gov.il/WebSiteApi/knessetapi/MKs/GetMkdetailsHeader?mkId=909&languageKey=en so you can just skip the html page and get the data from there |
Perfect, thanks. It even includes a URL for an image, which could be fed into the DB for display. For instance, https://oknesset.org/members/knesset-25.html does not have images. But that is a separate topic. |
we had some copyright problems with the images.. so we don't display them |
I have made some progress, the heart is in my fork here. It can be used as
The URL fetch fails after 100 requests. Something is off with the timeout backoff? It seems to take a minute or two to reset. |
not sure what you mean by anyway, our servers are whitelisted on gov security so we usually don't get blocked |
Ahh, so maybe it is only a problem running locally.
I will add a command line option |
The Still TODO: I hardcode the IDs here
What would be a better way to get the actual list of |
you can use our API one with is_current=false and one with is_current_true @bobiboMC FYI |
Cool. It seems to work. Not all the members have english names, for instance משה צ'יקו אדרי comes up in the query for It is a bit strange that the query with |
See PR #352,
Should this be a separate step or an additional click task? |
It contains people who serve in Knesset in different positons in addition to Knesset members. For example משה אדרי currently serves as מנכ"ל הכנסת. |
currently, the main table with mk details -
members_mk_individual
- doesn't have any names in english, the relevant fields are there but empty, we should see if it's possible to get them from the knesset API, and if not, scrape them from the html - https://knesset.gov.il/mk/eng/mkindex_current_eng.aspThe text was updated successfully, but these errors were encountered: