# Guideline for evaluating HelloBench on Diverse LLMs HelloBench is a comprehenvise, in-the-wild, and open-ended benchmark to evaluate LLMs' performance in generating long text. More details could be found in [🌐Github Repo](https://github.com/Quehry/HelloBench) and [πŸ“–Paper](https://arxiv.org/abs/2409.16191). ## Detailed instructions to evalute HelloBench in Opencompass 1. Git clone Opencompass ```shell cd ~ git clone git@github.com:open-compass/opencompass.git cd opencompass ``` 2. Download HelloBench data in [Google Drive Url](https://drive.google.com/file/d/1EJTmMFgCs2pDy9l0wB5idvp3XzjYEsi9/view?usp=sharing), unzip it and put it in the following path(OPENCOMPASS_PATH/data/HelloBench), make sure you get path like this: ``` ~/opencompass/data/ └── HelloBench β”œβ”€β”€ chat.jsonl β”œβ”€β”€ heuristic_text_generation.jsonl β”œβ”€β”€ length_constrained_data β”‚ β”œβ”€β”€ heuristic_text_generation_16k.jsonl β”‚ β”œβ”€β”€ heuristic_text_generation_2k.jsonl β”‚ β”œβ”€β”€ heuristic_text_generation_4k.jsonl β”‚ └── heuristic_text_generation_8k.jsonl β”œβ”€β”€ open_ended_qa.jsonl β”œβ”€β”€ summarization.jsonl └── text_completion.jsonl ``` 3. Setup your opencompass ``` cd ~/opencompass pip install -e . ``` 4. configuration your launch in configs/eval_hellobench.py - set your models to be evaluated - set your judge model (we recommend to use gpt4o-mini) 5. launch it! ``` python run.py configs/eval_hellobench.py ``` 6. After that, you could find the results in outputs/hellobench/xxx/summary