mirror of
https://github.com/open-compass/opencompass.git
synced 2025-05-30 16:03:24 +08:00
[Doc] Update NeedleBench Docs (#1330)
* update needlebench docs * update model_name_mapping dict * update README * Update README_zh-CN.md --------- Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>
This commit is contained in:
parent
0a1c89e618
commit
104bddf647
@ -70,6 +70,7 @@ Just like a compass guides us on our journey, OpenCompass will guide you through
|
||||
|
||||
## 🚀 What's New <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
|
||||
|
||||
- **\[2024.07.17\]** We are excited to announce the release of NeedleBench's [technical report](http://arxiv.org/abs/2407.11963). We invite you to visit our [support documentation](https://opencompass.readthedocs.io/en/latest/advanced_guides/needleinahaystack_eval.html) for detailed evaluation guidelines. 🔥🔥🔥
|
||||
- **\[2024.07.04\]** OpenCompass now supports InternLM2.5, which has **outstanding reasoning capability**, **1M Context window and** and **stronger tool use**, you can try the models in [OpenCompass Config](https://github.com/open-compass/opencompass/tree/main/configs/models/hf_internlm) and [InternLM](https://github.com/InternLM/InternLM) .🔥🔥🔥.
|
||||
- **\[2024.06.20\]** OpenCompass now supports one-click switching between inference acceleration backends, enhancing the efficiency of the evaluation process. In addition to the default HuggingFace inference backend, it now also supports popular backends [LMDeploy](https://github.com/InternLM/lmdeploy) and [vLLM](https://github.com/vllm-project/vllm). This feature is available via a simple command-line switch and through deployment APIs. For detailed usage, see the [documentation](docs/en/advanced_guides/accelerator_intro.md).🔥🔥🔥.
|
||||
- **\[2024.05.08\]** We supported the evaluation of 4 MoE models: [Mixtral-8x22B-v0.1](configs/models/mixtral/hf_mixtral_8x22b_v0_1.py), [Mixtral-8x22B-Instruct-v0.1](configs/models/mixtral/hf_mixtral_8x22b_instruct_v0_1.py), [Qwen1.5-MoE-A2.7B](configs/models/qwen/hf_qwen1_5_moe_a2_7b.py), [Qwen1.5-MoE-A2.7B-Chat](configs/models/qwen/hf_qwen1_5_moe_a2_7b_chat.py). Try them out now!
|
||||
|
@ -69,6 +69,7 @@
|
||||
|
||||
## 🚀 最新进展 <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
|
||||
|
||||
- **\[2024.07.17\]** 我们正式发布 NeedleBench 的[技术报告](http://arxiv.org/abs/2407.11963)。诚邀您访问我们的[帮助文档](https://opencompass.readthedocs.io/zh-cn/latest/advanced_guides/needleinahaystack_eval.html)进行评估。🔥🔥🔥
|
||||
- **\[2024.07.04\]** OpenCompass 现已支持 InternLM2.5, 它拥有卓越的推理性能、有效支持百万字超长上下文以及工具调用能力整体升级,欢迎访问[OpenCompass Config](https://github.com/open-compass/opencompass/tree/main/configs/models/hf_internlm) 和 [InternLM](https://github.com/InternLM/InternLM) .🔥🔥🔥.
|
||||
- **\[2024.06.20\]** OpenCompass 现已支持一键切换推理加速后端,助力评测过程更加高效。除了默认的HuggingFace推理后端外,还支持了常用的 [LMDeploy](https://github.com/InternLM/lmdeploy) 和 [vLLM](https://github.com/vllm-project/vllm) ,支持命令行一键切换和部署 API 加速服务两种方式,详细使用方法见[文档](docs/zh_cn/advanced_guides/accelerator_intro.md)。
|
||||
欢迎试用!🔥🔥🔥.
|
||||
|
@ -6,7 +6,7 @@ The Needle In A Haystack test (inspired by [NeedleInAHaystack](https://github.co
|
||||
|
||||
## Task Overview
|
||||
|
||||
Within the `NeedleBench` framework of `OpenCompass`, we have designed a series of increasingly challenging test scenarios to comprehensively evaluate the models' abilities in long text information extraction and reasoning:
|
||||
Within the `NeedleBench` framework of `OpenCompass`, we have designed a series of increasingly challenging test scenarios to comprehensively evaluate the models' abilities in long text information extraction and reasoning. For a complete introduction, refer to our [technical report](https://arxiv.org/abs/2407.11963):
|
||||
|
||||
- **Single-Needle Retrieval Task (S-RT)**: Assesses an LLM's ability to extract a single key piece of information from a long text, testing its precision in recalling specific details within broad narratives. This corresponds to the **original Needle In A Haystack test** setup.
|
||||
|
||||
@ -77,11 +77,11 @@ python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --su
|
||||
|
||||
##### Evaluation on a Slurm Cluster
|
||||
|
||||
If using `Slurm`, you can add parameters such as `--slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000`, as shown below:
|
||||
If using `Slurm`, you can add parameters such as `--slurm -p partition_name -q reserved --max-num-workers 16`, as shown below:
|
||||
|
||||
```bash
|
||||
# Slurm Evaluation
|
||||
python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000
|
||||
python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
|
||||
```
|
||||
|
||||
##### Evaluating a Subdataset Only
|
||||
@ -89,13 +89,13 @@ python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --su
|
||||
If you only want to test the original NeedleInAHaystack task setup, you could change the dataset parameter to `needlebench_single_4k`, which corresponds to the single needle version of the NeedleInAHaystack test at 4k length:
|
||||
|
||||
```bash
|
||||
python run.py --dataset needlebench_single_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000
|
||||
python run.py --dataset needlebench_single_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
|
||||
```
|
||||
|
||||
You can also choose to evaluate a specific subdataset, such as changing the `--datasets` parameter to `needlebench_single_4k/needlebench_zh_datasets` for testing just the Chinese version of the single needle 4K length NeedleInAHaystack task. The parameter after `/` represents the subdataset, which can be found in the dataset variable of `configs/datasets/needlebench/needlebench_4k/needlebench_single_4k.py` :
|
||||
|
||||
```bash
|
||||
python run.py --dataset needlebench_single_4k/needlebench_zh_datasets --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000
|
||||
python run.py --dataset needlebench_single_4k/needlebench_zh_datasets --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
|
||||
```
|
||||
|
||||
Be sure to install the [LMDeploy](https://github.com/InternLM/lmdeploy) tool before starting the evaluation:
|
||||
@ -104,7 +104,7 @@ Be sure to install the [LMDeploy](https://github.com/InternLM/lmdeploy) tool bef
|
||||
pip install lmdeploy
|
||||
```
|
||||
|
||||
This command initiates the evaluation process, with parameters `-p partition_name -q auto` and `--max-num-workers 32` used to specify the Slurm partition name and the maximum number of worker processes.
|
||||
This command initiates the evaluation process, with parameters `-p partition_name -q auto` and `--max-num-workers 16` used to specify the Slurm partition name and the maximum number of worker processes.
|
||||
|
||||
#### Evaluating Other `Huggingface` Models
|
||||
|
||||
@ -112,8 +112,10 @@ For other models, we recommend writing an additional configuration file to modif
|
||||
|
||||
```python
|
||||
from mmengine.config import read_base
|
||||
# We use mmengine.config to import variables from other configuration files
|
||||
|
||||
with read_base():
|
||||
from .models.hf_internlm.lmdeploy_internlm2_chat_7b import models as internlm2_chat_7b_200k
|
||||
# from .models.hf_internlm.lmdeploy_internlm2_chat_7b import models as internlm2_chat_7b_200k
|
||||
from .models.hf_internlm.hf_internlm2_chat_7b import models as internlm2_chat_7b
|
||||
|
||||
# Evaluate needlebench_4k, adjust the configuration to use 8k, 32k, 128k, 200k, or 1000k if necessary.
|
||||
@ -131,7 +133,7 @@ with read_base():
|
||||
datasets = sum([v for k, v in locals().items() if ('datasets' in k)], [])
|
||||
|
||||
for m in internlm2_chat_7b:
|
||||
m['max_seq_len'] = 32768 # Ensure InternLM2-7B model can receive the complete long text, other models need to adjust according to their maximum sequence length support.
|
||||
m['max_seq_len'] = 30768 # Ensure InternLM2-7B model can receive the complete long text, other models need to adjust according to their maximum sequence length support.
|
||||
m['max_out_len'] = 2000 # Ensure that in the multi-needle recall task, the model can receive a complete response
|
||||
|
||||
models = internlm2_chat_7b
|
||||
@ -142,10 +144,10 @@ work_dir = './outputs/needlebench'
|
||||
Once the test `config` file is written, we can pass the corresponding config file path through the `run.py` file in the command line, such as:
|
||||
|
||||
```bash
|
||||
python run.py configs/eval_needlebench.py --slurm -p partition_name -q reserved --max-num-workers 128 --max-partition-size 8000
|
||||
python run.py configs/eval_needlebench.py --slurm -p partition_name -q reserved --max-num-workers 16
|
||||
```
|
||||
|
||||
Note, at this point, we do not need to pass in the `--dataset, --models, --summarizer` parameters, as we have already defined these configurations in the config file. You can manually adjust the `--max-partition-size` setting to achieve the best task slicing strategy to improve evaluation efficiency.
|
||||
Note, at this point, we do not need to pass in the `--dataset, --models, --summarizer` parameters, as we have already defined these configurations in the config file. You can manually adjust the `--max-num-workers` setting to adjust the number of parallel workers.
|
||||
|
||||
### Visualization
|
||||
|
||||
@ -155,6 +157,16 @@ If you use this method, please add a reference:
|
||||
|
||||
```bibtex
|
||||
|
||||
@misc{li2024needlebenchllmsretrievalreasoning,
|
||||
title={NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?},
|
||||
author={Mo Li and Songyang Zhang and Yunxin Liu and Kai Chen},
|
||||
year={2024},
|
||||
eprint={2407.11963},
|
||||
archivePrefix={arXiv},
|
||||
primaryClass={cs.CL},
|
||||
url={https://arxiv.org/abs/2407.11963},
|
||||
}
|
||||
|
||||
@misc{2023opencompass,
|
||||
title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
|
||||
author={OpenCompass Contributors},
|
||||
|
@ -6,7 +6,7 @@
|
||||
|
||||
## 任务介绍
|
||||
|
||||
在`OpenCompass`的`NeedleBench`框架中,为了全面评估模型在长文本信息提取和推理方面的能力,我们设计了一系列逐渐增加难度的测试方案。
|
||||
在`OpenCompass`的`NeedleBench`框架中,为了全面评估模型在长文本信息提取和推理方面的能力,我们设计了一系列逐渐增加难度的测试方案。完整的介绍参见我们的[技术报告](https://arxiv.org/abs/2407.11963)。
|
||||
|
||||
- **单一信息检索任务(Single-Needle Retrieval Task, S-RT)**:评估LLM在长文本中提取单一关键信息的能力,测试其对广泛叙述中特定细节的精确回忆能力。这对应于**原始的大海捞针测试**任务设定。
|
||||
|
||||
@ -77,11 +77,11 @@ python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --su
|
||||
|
||||
##### 在Slurm集群上评估
|
||||
|
||||
如果使用 `Slurm`,可以添加 `--slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000`等参数,例如下面:
|
||||
如果使用 `Slurm`,可以添加 `--slurm -p partition_name -q reserved --max-num-workers 16 `等参数,例如下面:
|
||||
|
||||
```bash
|
||||
# Slurm评估
|
||||
python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000
|
||||
python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
|
||||
```
|
||||
|
||||
##### 只评估子数据集
|
||||
@ -89,13 +89,13 @@ python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --su
|
||||
如果只想测试原始的大海捞针任务设定,比如可以更换数据集的参数为`needlebench_single_4k`,这对应于4k长度下的单针版本的大海捞针测试:
|
||||
|
||||
```bash
|
||||
python run.py --dataset needlebench_single_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000
|
||||
python run.py --dataset needlebench_single_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
|
||||
```
|
||||
|
||||
您也可以进一步选择子数据集,如更换数据集`--datasets`的参数为`needlebench_single_4k/needlebench_zh_datasets`,仅仅进行中文版本的单针4K长度下的大海捞针任务测试,其中`/`后面的参数代表子数据集,您可以在`configs/datasets/needlebench/needlebench_4k/needlebench_single_4k.py`中找到可选的子数据集变量,如:
|
||||
|
||||
```bash
|
||||
python run.py --dataset needlebench_single_4k/needlebench_zh_datasets --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000
|
||||
python run.py --dataset needlebench_single_4k/needlebench_zh_datasets --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
|
||||
```
|
||||
|
||||
注意在评估前预先安装[LMDeploy](https://github.com/InternLM/lmdeploy)工具
|
||||
@ -112,8 +112,10 @@ pip install lmdeploy
|
||||
|
||||
```python
|
||||
from mmengine.config import read_base
|
||||
# 我们使用mmengine.config来import其他的配置文件中的变量
|
||||
|
||||
with read_base():
|
||||
from .models.hf_internlm.lmdeploy_internlm2_chat_7b import models as internlm2_chat_7b_200k
|
||||
# from .models.hf_internlm.lmdeploy_internlm2_chat_7b import models as internlm2_chat_7b_200k
|
||||
from .models.hf_internlm.hf_internlm2_chat_7b import models as internlm2_chat_7b
|
||||
|
||||
# Evaluate needlebench_4k, adjust the configuration to use 8k, 32k, 128k, 200k, or 1000k if necessary.
|
||||
@ -131,7 +133,7 @@ with read_base():
|
||||
datasets = sum([v for k, v in locals().items() if ('datasets' in k)], [])
|
||||
|
||||
for m in internlm2_chat_7b:
|
||||
m['max_seq_len'] = 32768 # 保证InternLM2-7B模型能接收到完整的长文本,其他模型需要根据各自支持的最大序列长度修改。
|
||||
m['max_seq_len'] = 30768 # 保证InternLM2-7B模型能接收到完整的长文本,其他模型需要根据各自支持的最大序列长度修改。
|
||||
m['max_out_len'] = 2000 # 保证在多针召回任务中能接收到模型完整的回答
|
||||
|
||||
models = internlm2_chat_7b
|
||||
@ -142,10 +144,10 @@ work_dir = './outputs/needlebench'
|
||||
当书写好测试的`config`文件后,我们可以命令行中通过`run.py`文件传入对应的config文件路径,例如:
|
||||
|
||||
```bash
|
||||
python run.py configs/eval_needlebench.py --slurm -p partition_name -q reserved --max-num-workers 128 --max-partition-size 8000
|
||||
python run.py configs/eval_needlebench.py --slurm -p partition_name -q reserved --max-num-workers 16
|
||||
```
|
||||
|
||||
注意,此时我们不需传入`--dataset, --models, --summarizer `等参数,因为我们已经在config文件中定义了这些配置。你可以自己手动调节`--max-partition-size`的设定以实现最好的任务分片策略以提高评估效率。
|
||||
注意,此时我们不需传入`--dataset, --models, --summarizer `等参数,因为我们已经在config文件中定义了这些配置。你可以自己手动调节`--max-num-workers`的设定以调节并行工作的workers的数量。
|
||||
|
||||
### 可视化
|
||||
|
||||
@ -155,6 +157,16 @@ python run.py configs/eval_needlebench.py --slurm -p partition_name -q reserved
|
||||
|
||||
```bibtex
|
||||
|
||||
@misc{li2024needlebenchllmsretrievalreasoning,
|
||||
title={NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?},
|
||||
author={Mo Li and Songyang Zhang and Yunxin Liu and Kai Chen},
|
||||
year={2024},
|
||||
eprint={2407.11963},
|
||||
archivePrefix={arXiv},
|
||||
primaryClass={cs.CL},
|
||||
url={https://arxiv.org/abs/2407.11963},
|
||||
}
|
||||
|
||||
@misc{2023opencompass,
|
||||
title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
|
||||
author={OpenCompass Contributors},
|
||||
|
@ -65,6 +65,8 @@ model_name_mapping = {
|
||||
'qwen1.5-72b-chat-vllm': 'Qwen-1.5-72B-vLLM',
|
||||
'glm4_notools': 'GLM-4',
|
||||
'claude-3-opus': 'Claude-3-Opus',
|
||||
'glm-4-9b-chat-1m-vllm': 'GLM4-9B-Chat-1M',
|
||||
'internlm2_5-7b-chat-1m-turbomind': 'InternLM2.5-7B-Chat-1M',
|
||||
# Add more mappings as necessary
|
||||
}
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user