diff --git a/README.md b/README.md index 7a1902c9..31deb70c 100644 --- a/README.md +++ b/README.md @@ -70,6 +70,7 @@ Just like a compass guides us on our journey, OpenCompass will guide you through ## 🚀 What's New +- **\[2024.07.17\]** We are excited to announce the release of NeedleBench's [technical report](http://arxiv.org/abs/2407.11963). We invite you to visit our [support documentation](https://opencompass.readthedocs.io/en/latest/advanced_guides/needleinahaystack_eval.html) for detailed evaluation guidelines. 🔥🔥🔥 - **\[2024.07.04\]** OpenCompass now supports InternLM2.5, which has **outstanding reasoning capability**, **1M Context window and** and **stronger tool use**, you can try the models in [OpenCompass Config](https://github.com/open-compass/opencompass/tree/main/configs/models/hf_internlm) and [InternLM](https://github.com/InternLM/InternLM) .🔥🔥🔥. - **\[2024.06.20\]** OpenCompass now supports one-click switching between inference acceleration backends, enhancing the efficiency of the evaluation process. In addition to the default HuggingFace inference backend, it now also supports popular backends [LMDeploy](https://github.com/InternLM/lmdeploy) and [vLLM](https://github.com/vllm-project/vllm). This feature is available via a simple command-line switch and through deployment APIs. For detailed usage, see the [documentation](docs/en/advanced_guides/accelerator_intro.md).🔥🔥🔥. - **\[2024.05.08\]** We supported the evaluation of 4 MoE models: [Mixtral-8x22B-v0.1](configs/models/mixtral/hf_mixtral_8x22b_v0_1.py), [Mixtral-8x22B-Instruct-v0.1](configs/models/mixtral/hf_mixtral_8x22b_instruct_v0_1.py), [Qwen1.5-MoE-A2.7B](configs/models/qwen/hf_qwen1_5_moe_a2_7b.py), [Qwen1.5-MoE-A2.7B-Chat](configs/models/qwen/hf_qwen1_5_moe_a2_7b_chat.py). Try them out now! diff --git a/README_zh-CN.md b/README_zh-CN.md index 5c5e65a8..4b677963 100644 --- a/README_zh-CN.md +++ b/README_zh-CN.md @@ -69,6 +69,7 @@ ## 🚀 最新进展 +- **\[2024.07.17\]** 我们正式发布 NeedleBench 的[技术报告](http://arxiv.org/abs/2407.11963)。诚邀您访问我们的[帮助文档](https://opencompass.readthedocs.io/zh-cn/latest/advanced_guides/needleinahaystack_eval.html)进行评估。🔥🔥🔥 - **\[2024.07.04\]** OpenCompass 现已支持 InternLM2.5, 它拥有卓越的推理性能、有效支持百万字超长上下文以及工具调用能力整体升级,欢迎访问[OpenCompass Config](https://github.com/open-compass/opencompass/tree/main/configs/models/hf_internlm) 和 [InternLM](https://github.com/InternLM/InternLM) .🔥🔥🔥. - **\[2024.06.20\]** OpenCompass 现已支持一键切换推理加速后端,助力评测过程更加高效。除了默认的HuggingFace推理后端外,还支持了常用的 [LMDeploy](https://github.com/InternLM/lmdeploy) 和 [vLLM](https://github.com/vllm-project/vllm) ,支持命令行一键切换和部署 API 加速服务两种方式,详细使用方法见[文档](docs/zh_cn/advanced_guides/accelerator_intro.md)。 欢迎试用!🔥🔥🔥. diff --git a/docs/en/advanced_guides/needleinahaystack_eval.md b/docs/en/advanced_guides/needleinahaystack_eval.md index b75f6394..7ad4997a 100644 --- a/docs/en/advanced_guides/needleinahaystack_eval.md +++ b/docs/en/advanced_guides/needleinahaystack_eval.md @@ -6,7 +6,7 @@ The Needle In A Haystack test (inspired by [NeedleInAHaystack](https://github.co ## Task Overview -Within the `NeedleBench` framework of `OpenCompass`, we have designed a series of increasingly challenging test scenarios to comprehensively evaluate the models' abilities in long text information extraction and reasoning: +Within the `NeedleBench` framework of `OpenCompass`, we have designed a series of increasingly challenging test scenarios to comprehensively evaluate the models' abilities in long text information extraction and reasoning. For a complete introduction, refer to our [technical report](https://arxiv.org/abs/2407.11963): - **Single-Needle Retrieval Task (S-RT)**: Assesses an LLM's ability to extract a single key piece of information from a long text, testing its precision in recalling specific details within broad narratives. This corresponds to the **original Needle In A Haystack test** setup. @@ -77,11 +77,11 @@ python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --su ##### Evaluation on a Slurm Cluster -If using `Slurm`, you can add parameters such as `--slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000`, as shown below: +If using `Slurm`, you can add parameters such as `--slurm -p partition_name -q reserved --max-num-workers 16`, as shown below: ```bash # Slurm Evaluation -python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000 +python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16 ``` ##### Evaluating a Subdataset Only @@ -89,13 +89,13 @@ python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --su If you only want to test the original NeedleInAHaystack task setup, you could change the dataset parameter to `needlebench_single_4k`, which corresponds to the single needle version of the NeedleInAHaystack test at 4k length: ```bash -python run.py --dataset needlebench_single_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000 +python run.py --dataset needlebench_single_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16 ``` You can also choose to evaluate a specific subdataset, such as changing the `--datasets` parameter to `needlebench_single_4k/needlebench_zh_datasets` for testing just the Chinese version of the single needle 4K length NeedleInAHaystack task. The parameter after `/` represents the subdataset, which can be found in the dataset variable of `configs/datasets/needlebench/needlebench_4k/needlebench_single_4k.py` : ```bash -python run.py --dataset needlebench_single_4k/needlebench_zh_datasets --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000 +python run.py --dataset needlebench_single_4k/needlebench_zh_datasets --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16 ``` Be sure to install the [LMDeploy](https://github.com/InternLM/lmdeploy) tool before starting the evaluation: @@ -104,7 +104,7 @@ Be sure to install the [LMDeploy](https://github.com/InternLM/lmdeploy) tool bef pip install lmdeploy ``` -This command initiates the evaluation process, with parameters `-p partition_name -q auto` and `--max-num-workers 32` used to specify the Slurm partition name and the maximum number of worker processes. +This command initiates the evaluation process, with parameters `-p partition_name -q auto` and `--max-num-workers 16` used to specify the Slurm partition name and the maximum number of worker processes. #### Evaluating Other `Huggingface` Models @@ -112,8 +112,10 @@ For other models, we recommend writing an additional configuration file to modif ```python from mmengine.config import read_base +# We use mmengine.config to import variables from other configuration files + with read_base(): - from .models.hf_internlm.lmdeploy_internlm2_chat_7b import models as internlm2_chat_7b_200k + # from .models.hf_internlm.lmdeploy_internlm2_chat_7b import models as internlm2_chat_7b_200k from .models.hf_internlm.hf_internlm2_chat_7b import models as internlm2_chat_7b # Evaluate needlebench_4k, adjust the configuration to use 8k, 32k, 128k, 200k, or 1000k if necessary. @@ -131,7 +133,7 @@ with read_base(): datasets = sum([v for k, v in locals().items() if ('datasets' in k)], []) for m in internlm2_chat_7b: - m['max_seq_len'] = 32768 # Ensure InternLM2-7B model can receive the complete long text, other models need to adjust according to their maximum sequence length support. + m['max_seq_len'] = 30768 # Ensure InternLM2-7B model can receive the complete long text, other models need to adjust according to their maximum sequence length support. m['max_out_len'] = 2000 # Ensure that in the multi-needle recall task, the model can receive a complete response models = internlm2_chat_7b @@ -142,10 +144,10 @@ work_dir = './outputs/needlebench' Once the test `config` file is written, we can pass the corresponding config file path through the `run.py` file in the command line, such as: ```bash -python run.py configs/eval_needlebench.py --slurm -p partition_name -q reserved --max-num-workers 128 --max-partition-size 8000 +python run.py configs/eval_needlebench.py --slurm -p partition_name -q reserved --max-num-workers 16 ``` -Note, at this point, we do not need to pass in the `--dataset, --models, --summarizer` parameters, as we have already defined these configurations in the config file. You can manually adjust the `--max-partition-size` setting to achieve the best task slicing strategy to improve evaluation efficiency. +Note, at this point, we do not need to pass in the `--dataset, --models, --summarizer` parameters, as we have already defined these configurations in the config file. You can manually adjust the `--max-num-workers` setting to adjust the number of parallel workers. ### Visualization @@ -155,6 +157,16 @@ If you use this method, please add a reference: ```bibtex +@misc{li2024needlebenchllmsretrievalreasoning, + title={NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?}, + author={Mo Li and Songyang Zhang and Yunxin Liu and Kai Chen}, + year={2024}, + eprint={2407.11963}, + archivePrefix={arXiv}, + primaryClass={cs.CL}, + url={https://arxiv.org/abs/2407.11963}, +} + @misc{2023opencompass, title={OpenCompass: A Universal Evaluation Platform for Foundation Models}, author={OpenCompass Contributors}, diff --git a/docs/zh_cn/advanced_guides/needleinahaystack_eval.md b/docs/zh_cn/advanced_guides/needleinahaystack_eval.md index 72d21ae9..05457958 100644 --- a/docs/zh_cn/advanced_guides/needleinahaystack_eval.md +++ b/docs/zh_cn/advanced_guides/needleinahaystack_eval.md @@ -6,7 +6,7 @@ ## 任务介绍 -在`OpenCompass`的`NeedleBench`框架中,为了全面评估模型在长文本信息提取和推理方面的能力,我们设计了一系列逐渐增加难度的测试方案。 +在`OpenCompass`的`NeedleBench`框架中,为了全面评估模型在长文本信息提取和推理方面的能力,我们设计了一系列逐渐增加难度的测试方案。完整的介绍参见我们的[技术报告](https://arxiv.org/abs/2407.11963)。 - **单一信息检索任务(Single-Needle Retrieval Task, S-RT)**:评估LLM在长文本中提取单一关键信息的能力,测试其对广泛叙述中特定细节的精确回忆能力。这对应于**原始的大海捞针测试**任务设定。 @@ -77,11 +77,11 @@ python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --su ##### 在Slurm集群上评估 -如果使用 `Slurm`,可以添加 `--slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000`等参数,例如下面: +如果使用 `Slurm`,可以添加 `--slurm -p partition_name -q reserved --max-num-workers 16 `等参数,例如下面: ```bash # Slurm评估 -python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000 +python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16 ``` ##### 只评估子数据集 @@ -89,13 +89,13 @@ python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --su 如果只想测试原始的大海捞针任务设定,比如可以更换数据集的参数为`needlebench_single_4k`,这对应于4k长度下的单针版本的大海捞针测试: ```bash -python run.py --dataset needlebench_single_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000 +python run.py --dataset needlebench_single_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16 ``` 您也可以进一步选择子数据集,如更换数据集`--datasets`的参数为`needlebench_single_4k/needlebench_zh_datasets`,仅仅进行中文版本的单针4K长度下的大海捞针任务测试,其中`/`后面的参数代表子数据集,您可以在`configs/datasets/needlebench/needlebench_4k/needlebench_single_4k.py`中找到可选的子数据集变量,如: ```bash -python run.py --dataset needlebench_single_4k/needlebench_zh_datasets --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000 +python run.py --dataset needlebench_single_4k/needlebench_zh_datasets --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16 ``` 注意在评估前预先安装[LMDeploy](https://github.com/InternLM/lmdeploy)工具 @@ -112,8 +112,10 @@ pip install lmdeploy ```python from mmengine.config import read_base +# 我们使用mmengine.config来import其他的配置文件中的变量 + with read_base(): - from .models.hf_internlm.lmdeploy_internlm2_chat_7b import models as internlm2_chat_7b_200k + # from .models.hf_internlm.lmdeploy_internlm2_chat_7b import models as internlm2_chat_7b_200k from .models.hf_internlm.hf_internlm2_chat_7b import models as internlm2_chat_7b # Evaluate needlebench_4k, adjust the configuration to use 8k, 32k, 128k, 200k, or 1000k if necessary. @@ -131,7 +133,7 @@ with read_base(): datasets = sum([v for k, v in locals().items() if ('datasets' in k)], []) for m in internlm2_chat_7b: - m['max_seq_len'] = 32768 # 保证InternLM2-7B模型能接收到完整的长文本,其他模型需要根据各自支持的最大序列长度修改。 + m['max_seq_len'] = 30768 # 保证InternLM2-7B模型能接收到完整的长文本,其他模型需要根据各自支持的最大序列长度修改。 m['max_out_len'] = 2000 # 保证在多针召回任务中能接收到模型完整的回答 models = internlm2_chat_7b @@ -142,10 +144,10 @@ work_dir = './outputs/needlebench' 当书写好测试的`config`文件后,我们可以命令行中通过`run.py`文件传入对应的config文件路径,例如: ```bash -python run.py configs/eval_needlebench.py --slurm -p partition_name -q reserved --max-num-workers 128 --max-partition-size 8000 +python run.py configs/eval_needlebench.py --slurm -p partition_name -q reserved --max-num-workers 16 ``` -注意,此时我们不需传入`--dataset, --models, --summarizer `等参数,因为我们已经在config文件中定义了这些配置。你可以自己手动调节`--max-partition-size`的设定以实现最好的任务分片策略以提高评估效率。 +注意,此时我们不需传入`--dataset, --models, --summarizer `等参数,因为我们已经在config文件中定义了这些配置。你可以自己手动调节`--max-num-workers`的设定以调节并行工作的workers的数量。 ### 可视化 @@ -155,6 +157,16 @@ python run.py configs/eval_needlebench.py --slurm -p partition_name -q reserved ```bibtex +@misc{li2024needlebenchllmsretrievalreasoning, + title={NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?}, + author={Mo Li and Songyang Zhang and Yunxin Liu and Kai Chen}, + year={2024}, + eprint={2407.11963}, + archivePrefix={arXiv}, + primaryClass={cs.CL}, + url={https://arxiv.org/abs/2407.11963}, +} + @misc{2023opencompass, title={OpenCompass: A Universal Evaluation Platform for Foundation Models}, author={OpenCompass Contributors}, diff --git a/opencompass/summarizers/needlebench.py b/opencompass/summarizers/needlebench.py index 93e2b909..1be32dd3 100644 --- a/opencompass/summarizers/needlebench.py +++ b/opencompass/summarizers/needlebench.py @@ -65,6 +65,8 @@ model_name_mapping = { 'qwen1.5-72b-chat-vllm': 'Qwen-1.5-72B-vLLM', 'glm4_notools': 'GLM-4', 'claude-3-opus': 'Claude-3-Opus', + 'glm-4-9b-chat-1m-vllm': 'GLM4-9B-Chat-1M', + 'internlm2_5-7b-chat-1m-turbomind': 'InternLM2.5-7B-Chat-1M', # Add more mappings as necessary }