This repo is for code in our arxived paper:
DOCE: Finding the Sweet Spot for Execution-Based Code Generation
Haau-Sing Li, Patrick Fernandes, Iryna Gurevych, André F. T. Martins
Contact person: Haau-Sing Li
-
Installing packages from
requirements*.txt
. -
Inference on HumanEval/MBPP task
python3 codegen/generate.py \
--model ${model} \
--bs ${batch_size} \
--temperature ${temperature} \
--n_samples ${num_of_samples_for_reranking} \
--dataset ${humaneval/mbpp} \
--resume \
--root ${path_to_store_output}
- Evaluation
evalplus.evaluate \
--dataset {humaneval/mbpp} \
--samples ${path to generated samples} \
--parallel 30 \
--test-details
- Get execution outputs of generated samples (for MBR-Exec)
python3 evalplus/gen_outputs.py \
--gen_dir {model_name_plus_temperature} \
--dataset {humaneval/mbpp} \
--gen_fast
- Self-Debugging You should get execution feedback first:
python3 evalplus/error_feedback.py \
--gen_dir {model_name_plus_temperature} \
--dataset {humaneval/mbpp}
Then we can do self-debugging:
python3 codegen/ape_sd_ut.py \
--model ${model} \
--bs ${batch_size} \
--temperature ${temperature} \
--n_samples ${num_of_samples_for_reranking} \
--dataset ${humaneval/mbpp} \
--resume \
--root ${path_to_store_output}
--debugging_turn ${ith_debugging_turn}
- For MBR and N-Best-Reranking, please refer to our notebooks for now.
We will release our generated candidates soon if you want to save compute.
Our code is built upon EvalPlus.