Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HOWTO: Tag a new corpus version of corpus and create a speech corpus #15

Open
10 tasks
roger-mahler opened this issue Sep 26, 2022 · 0 comments
Open
10 tasks
Assignees
Labels

Comments

@roger-mahler
Copy link
Contributor

roger-mahler commented Sep 26, 2022

Update riksprot tagger system

If pyriksprot_tagger repository folder already exists:

% cd "pyriksprot-tagger-folder"
% git pull

If repository folder doesn't exist:

% cd "some-folder"
% git clone [email protected]:welfare-state-analytics/pyriksprot_tagger.git

Update configuration

Update configurational elements in "pyriksprot-tagger-folder"/.env:

Environment variable Description
RIKSPROT_DATA_FOLDER Parent folder (location) of Riksdagens corpus data folder
RIKSPROT_REPOSITORY_URL https://github.com/welfare-state-analytics/riksdagen-corpus.git
RIKSPROT_REPOSITORY_TAG Target corpus version. Must be a valid Github tag
SPARV_DATADIR Sparv data folder
STANZA_DATADIR Stanza data folder
OMP_NUM_THREADS Number of threads to use
RIKSPROT_DATA_FOLDER="/data/riksdagen_corpus_data"
RIKSPROT_REPOSITORY_URL="https://github.com/welfare-state-analytics/riksdagen-corpus.git"
RIKSPROT_REPOSITORY_TAG="v0.4.5"
SPARV_DATADIR="/data/sparv"
STANZA_DATADIR="/data/sparv/models/stanza"
OMP_NUM_THREADS=10

Create or update Riksdagens Corpus data repository

% cd "pyriksprot-tagger-folder"
# If you want to create a new clone of the repository:
% make full-clone-repository
# If you want to update existing repository: 
% make full-pull-repository
# If you want to save space a do a shallow clone
% make shallow-update-repository
# Update timestamp of repository work folder files to match last commit timestamp (important!):
% make update-repository-timestamps

Update / tag a new version of RIKSPROT:

Prerequisites:

  • Pull latest version of welfare-state-analytics/pyriksprot_tagger
  • Update configuration (see above)

If you want to use snakemake:

  • Edit options (target name) in workflow/config/config.yml
  • Run make annotate (ca: 10 hours run time)

If you want to use tag-it script (preferred, faster):

  • Run PYTHONPATH=. nohup ./tag-it.sh > tag-it.version.log &

Create metadata database:

  • Pull or clone latest version of welfare-state-analytics/pyriksprot
  • Update configuration (specify tag) to use in pyriksprot/.env
  • Run make metadata

Create speech corpus

  • Pull or clone latest version of welfare-state-analytics/pyriksprot
  • Update configuration (specify tag) to use in pyriksprot/.env
  • [ } Run make extract-speeches-to-feather
@roger-mahler roger-mahler self-assigned this Sep 26, 2022
@roger-mahler roger-mahler pinned this issue Sep 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant