Skip to content

Add multiprocessing to Datahandler, unify Processors

Compare
Choose a tag to compare
@tanaysoni tanaysoni released this 19 Aug 08:57
· 538 commits to master since this release

Besides fixing various smaller bugs, we focussed in this release on two major changes:

1. Speeding things up 🚀 :

  • By adding multiprocessing to the data preprocessing, we reduced the execution time for many tasks from hours to minutes. Since the functionality is mostly hidden in the parent class, the user doesn't have to implement anything on his own. However, this required changing the interface of the processor slightly. _dict_to_samples and _sample_to_features must now be classmethods and all objects accessed by them must be class attributes.
  • Multi-GPU support is now also available for the "building blocks mode"

2. Making the processor more user friendly 😊 :

  • Instead of having one individual processor per dataset, we have implemented a more generic TextClassificationProcessor that you can instantiate easily for various predefined tasks (GNAD, GermEval ...) or your own dataset in CSV/TSV format
processor = TextClassificationProcessor(tokenizer=tokenizer,
                                        max_seq_len=128,
                                        data_dir="../data/germeval18",
                                        columns=["text", "label", "unused"],
                                        label_list=["OTHER", "OFFENSE"],
                                        metrics=["f1_macro"]
                                        ) 

Thanks for contributing @brandenchan @tanaysoni @tholor @Timoeller @tripl3a @Seb0 @waldemarhahn !


Modeling:

  • [bug] Accuracy metric in LM finetuning always zero #30
  • [enhancement] Multi-GPU only enabled in experiment mode #57
  • [bug] Wrong number of total steps for linear warmup schedule #46

Data Handling:

  • [enhancement] Unify redundant Processor; add new NERProcessor and TextClassificationProcessor
  • [enhancement] Add parallel dataprocessing #45
  • [bug] dev_size param in run-by-config is being ignored #49
  • [bug] output_dir parameter in run by config is being ignored #39
  • [bug] Error when running by config with a list of batch sizes #38

Documentation:

  • [bug] LM finetuning example missing data #47
  • [bug] Colab Notebook referenced in readme does not work #27

Other:

  • [enhancement] Proposition: improve dependency management with pipenv #35