Skip to content

Latest commit

 

History

History
265 lines (240 loc) · 11.7 KB

README.md

File metadata and controls

265 lines (240 loc) · 11.7 KB

Logical and Abstract Reasoning

Repository for the evaluation of Large Language Models on logical and abstract reasoning tasks

Installation

To install the repository, use the following command:

git clone https://github.com/Strong-AI-Lab/Logical-and-abstract-reasoning.git

To install the dependencies in a virtual environment, use the following:

cd Logical-and-abstract-reasoning
python -m venv env/
source env/bin/activate
pip install -r requirements.txt

You may need to install transformers from the repository:

pip install git+https://github.com/huggingface/transformers

Use

Evaluation

To evaluate a model in the repository, use the following command:

python run_evaluation config/model/<model_config.yaml> config/data/<data_config.yaml> --<kwarg_name> <kwarg>

You can choose the model to evaluate by changing the <model_config.yaml> file, and the dataset to evaluate the model on by changing the <data_config.yaml> file. You can add any additional arguments as <kwargs> (e.g. private API key for GPT models).

By default, all the results are saved in a csv file in the logs/ folder. You can re-compute the metrics from the evaluation run from this file by running the following:

python src/evaluate/evaluator.py logs/<results_file.csv>

Fine-tuning

To fine-tune a model on a given dataset, run the following:

python run_finetuning.py config/model/<model_config.yaml> config/data/<data_config.yaml> config/trainer/<trainer_config.yaml>

The configuration files work similarly as for evaluation. The <model_config.yaml> file contains additoinal configuration for training. The logs are saved in fine-tuning-output/ and the model weights are saved in fine-tuning-saves/.

Currently, only HuggingFace models can be fine-tuned.

LLaMA-based model instruction fine-tuning

We use the LLaMA-based model fine-tuning from the Stanford Alpaca training script. If you want to conduct a LLaMA-based model on instruction fine-tuning, you can do that by following this link.

Models

Inference Type Model Size Task Link Remark
Logical Reasoning on Reading Comprehension MERIt - Reading Comprehension paper
project
#3 on the ReClor leaderboard
LReasoner - Reading Comprehension paper
project
#6 on the ReClor leaderboard
AMR-LE - Reading Comprehension project #2 and #5 on the ReClor leaderboard
LLaMA - Reading Comprehension paper
code
Open source very large language model
LLaMA2 - Reading Comprehension paper
code
Open source very large language model
TinyLLaMA - Reading Comprehension paper
code
Open source very large language model
Alpaca - Reading Comprehension code Fine-tuned LLaMA
Vicuna - Reading Comprehension project
code
Fine-tuned LLaMA
ChatGPT - Reading Comprehension paper
project
Use api to do prompt tuning
GPT-4 - Reading Comprehension paper
project
Use api to do prompt tuning
Zephyr-7b-beta - Reading Comprehension code Fine-tuned Mistral-7b

Datasets & Benchmarks

Inference Type Dataset Size Task Link Remark
Logical Reasoning on Reading Comprehension ReClor - Reading Comprehension paper
project
Logical reasoning reading comprehension
LogiQA - Reading Comprehension paper
project
Logical reasoning reading comprehension
LogiQA V2 - Reading Comprehension project Logical reasoning reading comprehension
LogiQA Logical Reasoning Plus - Reading Comprehension project Logical reasoning reading comprehension for out-of-distribution evaluation
Abstract Reasoning ARC - Abstract Reasoning paper
code
Text version of a Visual Abstract Reasoning task
ACRE - Abstract Reasoning paper
code
Text version of a Visual Abstract Reasoning task
PVR - Abstract Reasoning paper Abstract Reasoning task
RAVEN - Abstract Reasoning paper
project
Text version of a Visual Abstract Reasoning task
Diagrammatic Logic - Abstract Reasoning code Extracted from OpenAI Evals
Logic - Abstract Reasoning code Extracted from OpenAI Evals
Logic Statements - Abstract Reasoning code Extracted from OpenAI Evals
Pattern Identification - Abstract Reasoning code Extracted from OpenAI Evals
String Patterns - Abstract Reasoning code Extracted from OpenAI Evals
List Functions - Abstract Reasoning code Extracted from Google BIG-bench

Acknowledgement

Our proposed new dataset logiqa-logical-reasoning-plus has been merged by OpenAI/Evals.