mirror of https://github.com/open-compass/opencompass.git synced 2025-05-30 16:03:24 +08:00

History

Songyang Zhang 46cc7894e1 [Feature] Support import configs/models/summarizers from whl (#1376 ) * [Feature] Support import configs/models/summarizers from whl * Update LCBench configs * Update * Update * Update * Update * update * Update * Update * Update * Update * Update		2024-08-01 00:42:48 +08:00
..
apps_gen_c7893a.py	[Feature] Support import configs/models/summarizers from whl (#1376 )	2024-08-01 00:42:48 +08:00
apps_gen.py	[Feature] Support import configs/models/summarizers from whl (#1376 )	2024-08-01 00:42:48 +08:00
apps_mini_gen_c7893a.py	[Feature] Support import configs/models/summarizers from whl (#1376 )	2024-08-01 00:42:48 +08:00
apps_mini_gen.py	[Feature] Support import configs/models/summarizers from whl (#1376 )	2024-08-01 00:42:48 +08:00
deprecated_apps_gen_5b4254.py	[Feature] Support import configs/models/summarizers from whl (#1376 )	2024-08-01 00:42:48 +08:00
deprecated_apps_gen_7fbb95.py	[Feature] Support import configs/models/summarizers from whl (#1376 )	2024-08-01 00:42:48 +08:00
deprecated_apps_gen_b4dee3.py	[Feature] Support import configs/models/summarizers from whl (#1376 )	2024-08-01 00:42:48 +08:00
README.md	[Feature] Support import configs/models/summarizers from whl (#1376 )	2024-08-01 00:42:48 +08:00

README.md

APPS

Dataset Description

APPS is a benchmark for code generation with 10000 problems. It can be used to evaluate the ability of language models to generate code from natural language specifications.

Dataset Structure

DatasetDict({
    train: Dataset({
        features: ['problem_id', 'question', 'solutions', 'input_output', 'difficulty', 'url', 'starter_code'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['problem_id', 'question', 'solutions', 'input_output', 'difficulty', 'url', 'starter_code'],
        num_rows: 5000
    })
})

We also offer an apps_mini subset, which includes 1500 questions divided proportionally of introductory, interview, and competition categories, with a ratio of 1:1:1(500 questions each).

How to Use

You can also filter the dataset based on difficulty level: introductory, interview and competition. Just pass a list of difficulty levels to the filter. For example, if you want the most challenging questions, you need to select the competition level:

ds = load_dataset("codeparrot/apps", split="train", difficulties=["competition"])
print(next(iter(ds))["question"])

Evaluation results

dataset	metric	CodeLlama-7b-Python	internlm2-chat-1.8b-sft-hf	internlm2-chat-7b-sft-hf	internlm2-chat-20b-sft-hf
apps_mini	pass@1	1.3	0.7	7.1	9.3

Please refer to Table 3 of code llama for original results if needed.

Citation

@article{hendrycksapps2021,
  title={Measuring Coding Challenge Competence With APPS},
  author={Dan Hendrycks and Steven Basart and Saurav Kadavath and Mantas Mazeika and Akul Arora and Ethan Guo and Collin Burns and Samir Puranik and Horace He and Dawn Song and Jacob Steinhardt},
  journal={NeurIPS},
  year={2021}
}