mirror of https://github.com/open-compass/opencompass.git synced 2025-05-30 16:03:24 +08:00

* upload hellobench

* update hellobench

* update readme.md

* update eval_hellobench.py

* update lastest

---------

Co-authored-by: bittersweet1999 <148421775+bittersweet1999@users.noreply.github.com>

2024-10-15 17:11:37 +08:00

1.6 KiB

Raw Blame History

Guideline for evaluating HelloBench on Diverse LLMs

HelloBench is a comprehenvise, in-the-wild, and open-ended benchmark to evaluate LLMs' performance in generating long text. More details could be found in 🌐Github Repo and 📖Paper.

Detailed instructions to evalute HelloBench in Opencompass

Git clone Opencompass

cd ~
git clone git@github.com:open-compass/opencompass.git
cd opencompass

Download HelloBench data in Google Drive Url, unzip it and put it in the following path(OPENCOMPASS_PATH/data/HelloBench), make sure you get path like this:

~/opencompass/data/
└── HelloBench
    ├── chat.jsonl
    ├── heuristic_text_generation.jsonl
    ├── length_constrained_data
    │   ├── heuristic_text_generation_16k.jsonl
    │   ├── heuristic_text_generation_2k.jsonl
    │   ├── heuristic_text_generation_4k.jsonl
    │   └── heuristic_text_generation_8k.jsonl
    ├── open_ended_qa.jsonl
    ├── summarization.jsonl
    └── text_completion.jsonl

Setup your opencompass

cd ~/opencompass
pip install -e .

configuration your launch in configs/eval_hellobench.py

set your models to be evaluated
set your judge model (we recommend to use gpt4o-mini)

launch it!

python run.py configs/eval_hellobench.py

After that, you could find the results in outputs/hellobench/xxx/summary

1.6 KiB Raw Blame History

Guideline for evaluating HelloBench on Diverse LLMs

Detailed instructions to evalute HelloBench in Opencompass

1.6 KiB

Raw Blame History