mirror of
https://github.com/open-compass/opencompass.git
synced 2025-05-30 16:03:24 +08:00
Merge branch 'open-compass:main' into main
This commit is contained in:
commit
59f02ce1a2
@ -70,6 +70,7 @@ Just like a compass guides us on our journey, OpenCompass will guide you through
|
|||||||
|
|
||||||
## 🚀 What's New <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
|
## 🚀 What's New <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
|
||||||
|
|
||||||
|
- **\[2024.07.17\]** We are excited to announce the release of NeedleBench's [technical report](http://arxiv.org/abs/2407.11963). We invite you to visit our [support documentation](https://opencompass.readthedocs.io/en/latest/advanced_guides/needleinahaystack_eval.html) for detailed evaluation guidelines. 🔥🔥🔥
|
||||||
- **\[2024.07.04\]** OpenCompass now supports InternLM2.5, which has **outstanding reasoning capability**, **1M Context window and** and **stronger tool use**, you can try the models in [OpenCompass Config](https://github.com/open-compass/opencompass/tree/main/configs/models/hf_internlm) and [InternLM](https://github.com/InternLM/InternLM) .🔥🔥🔥.
|
- **\[2024.07.04\]** OpenCompass now supports InternLM2.5, which has **outstanding reasoning capability**, **1M Context window and** and **stronger tool use**, you can try the models in [OpenCompass Config](https://github.com/open-compass/opencompass/tree/main/configs/models/hf_internlm) and [InternLM](https://github.com/InternLM/InternLM) .🔥🔥🔥.
|
||||||
- **\[2024.06.20\]** OpenCompass now supports one-click switching between inference acceleration backends, enhancing the efficiency of the evaluation process. In addition to the default HuggingFace inference backend, it now also supports popular backends [LMDeploy](https://github.com/InternLM/lmdeploy) and [vLLM](https://github.com/vllm-project/vllm). This feature is available via a simple command-line switch and through deployment APIs. For detailed usage, see the [documentation](docs/en/advanced_guides/accelerator_intro.md).🔥🔥🔥.
|
- **\[2024.06.20\]** OpenCompass now supports one-click switching between inference acceleration backends, enhancing the efficiency of the evaluation process. In addition to the default HuggingFace inference backend, it now also supports popular backends [LMDeploy](https://github.com/InternLM/lmdeploy) and [vLLM](https://github.com/vllm-project/vllm). This feature is available via a simple command-line switch and through deployment APIs. For detailed usage, see the [documentation](docs/en/advanced_guides/accelerator_intro.md).🔥🔥🔥.
|
||||||
- **\[2024.05.08\]** We supported the evaluation of 4 MoE models: [Mixtral-8x22B-v0.1](configs/models/mixtral/hf_mixtral_8x22b_v0_1.py), [Mixtral-8x22B-Instruct-v0.1](configs/models/mixtral/hf_mixtral_8x22b_instruct_v0_1.py), [Qwen1.5-MoE-A2.7B](configs/models/qwen/hf_qwen1_5_moe_a2_7b.py), [Qwen1.5-MoE-A2.7B-Chat](configs/models/qwen/hf_qwen1_5_moe_a2_7b_chat.py). Try them out now!
|
- **\[2024.05.08\]** We supported the evaluation of 4 MoE models: [Mixtral-8x22B-v0.1](configs/models/mixtral/hf_mixtral_8x22b_v0_1.py), [Mixtral-8x22B-Instruct-v0.1](configs/models/mixtral/hf_mixtral_8x22b_instruct_v0_1.py), [Qwen1.5-MoE-A2.7B](configs/models/qwen/hf_qwen1_5_moe_a2_7b.py), [Qwen1.5-MoE-A2.7B-Chat](configs/models/qwen/hf_qwen1_5_moe_a2_7b_chat.py). Try them out now!
|
||||||
|
@ -69,6 +69,7 @@
|
|||||||
|
|
||||||
## 🚀 最新进展 <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
|
## 🚀 最新进展 <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
|
||||||
|
|
||||||
|
- **\[2024.07.17\]** 我们正式发布 NeedleBench 的[技术报告](http://arxiv.org/abs/2407.11963)。诚邀您访问我们的[帮助文档](https://opencompass.readthedocs.io/zh-cn/latest/advanced_guides/needleinahaystack_eval.html)进行评估。🔥🔥🔥
|
||||||
- **\[2024.07.04\]** OpenCompass 现已支持 InternLM2.5, 它拥有卓越的推理性能、有效支持百万字超长上下文以及工具调用能力整体升级,欢迎访问[OpenCompass Config](https://github.com/open-compass/opencompass/tree/main/configs/models/hf_internlm) 和 [InternLM](https://github.com/InternLM/InternLM) .🔥🔥🔥.
|
- **\[2024.07.04\]** OpenCompass 现已支持 InternLM2.5, 它拥有卓越的推理性能、有效支持百万字超长上下文以及工具调用能力整体升级,欢迎访问[OpenCompass Config](https://github.com/open-compass/opencompass/tree/main/configs/models/hf_internlm) 和 [InternLM](https://github.com/InternLM/InternLM) .🔥🔥🔥.
|
||||||
- **\[2024.06.20\]** OpenCompass 现已支持一键切换推理加速后端,助力评测过程更加高效。除了默认的HuggingFace推理后端外,还支持了常用的 [LMDeploy](https://github.com/InternLM/lmdeploy) 和 [vLLM](https://github.com/vllm-project/vllm) ,支持命令行一键切换和部署 API 加速服务两种方式,详细使用方法见[文档](docs/zh_cn/advanced_guides/accelerator_intro.md)。
|
- **\[2024.06.20\]** OpenCompass 现已支持一键切换推理加速后端,助力评测过程更加高效。除了默认的HuggingFace推理后端外,还支持了常用的 [LMDeploy](https://github.com/InternLM/lmdeploy) 和 [vLLM](https://github.com/vllm-project/vllm) ,支持命令行一键切换和部署 API 加速服务两种方式,详细使用方法见[文档](docs/zh_cn/advanced_guides/accelerator_intro.md)。
|
||||||
欢迎试用!🔥🔥🔥.
|
欢迎试用!🔥🔥🔥.
|
||||||
|
@ -6,7 +6,7 @@ The Needle In A Haystack test (inspired by [NeedleInAHaystack](https://github.co
|
|||||||
|
|
||||||
## Task Overview
|
## Task Overview
|
||||||
|
|
||||||
Within the `NeedleBench` framework of `OpenCompass`, we have designed a series of increasingly challenging test scenarios to comprehensively evaluate the models' abilities in long text information extraction and reasoning:
|
Within the `NeedleBench` framework of `OpenCompass`, we have designed a series of increasingly challenging test scenarios to comprehensively evaluate the models' abilities in long text information extraction and reasoning. For a complete introduction, refer to our [technical report](https://arxiv.org/abs/2407.11963):
|
||||||
|
|
||||||
- **Single-Needle Retrieval Task (S-RT)**: Assesses an LLM's ability to extract a single key piece of information from a long text, testing its precision in recalling specific details within broad narratives. This corresponds to the **original Needle In A Haystack test** setup.
|
- **Single-Needle Retrieval Task (S-RT)**: Assesses an LLM's ability to extract a single key piece of information from a long text, testing its precision in recalling specific details within broad narratives. This corresponds to the **original Needle In A Haystack test** setup.
|
||||||
|
|
||||||
@ -77,11 +77,11 @@ python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --su
|
|||||||
|
|
||||||
##### Evaluation on a Slurm Cluster
|
##### Evaluation on a Slurm Cluster
|
||||||
|
|
||||||
If using `Slurm`, you can add parameters such as `--slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000`, as shown below:
|
If using `Slurm`, you can add parameters such as `--slurm -p partition_name -q reserved --max-num-workers 16`, as shown below:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Slurm Evaluation
|
# Slurm Evaluation
|
||||||
python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000
|
python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
|
||||||
```
|
```
|
||||||
|
|
||||||
##### Evaluating a Subdataset Only
|
##### Evaluating a Subdataset Only
|
||||||
@ -89,13 +89,13 @@ python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --su
|
|||||||
If you only want to test the original NeedleInAHaystack task setup, you could change the dataset parameter to `needlebench_single_4k`, which corresponds to the single needle version of the NeedleInAHaystack test at 4k length:
|
If you only want to test the original NeedleInAHaystack task setup, you could change the dataset parameter to `needlebench_single_4k`, which corresponds to the single needle version of the NeedleInAHaystack test at 4k length:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python run.py --dataset needlebench_single_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000
|
python run.py --dataset needlebench_single_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
|
||||||
```
|
```
|
||||||
|
|
||||||
You can also choose to evaluate a specific subdataset, such as changing the `--datasets` parameter to `needlebench_single_4k/needlebench_zh_datasets` for testing just the Chinese version of the single needle 4K length NeedleInAHaystack task. The parameter after `/` represents the subdataset, which can be found in the dataset variable of `configs/datasets/needlebench/needlebench_4k/needlebench_single_4k.py` :
|
You can also choose to evaluate a specific subdataset, such as changing the `--datasets` parameter to `needlebench_single_4k/needlebench_zh_datasets` for testing just the Chinese version of the single needle 4K length NeedleInAHaystack task. The parameter after `/` represents the subdataset, which can be found in the dataset variable of `configs/datasets/needlebench/needlebench_4k/needlebench_single_4k.py` :
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python run.py --dataset needlebench_single_4k/needlebench_zh_datasets --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000
|
python run.py --dataset needlebench_single_4k/needlebench_zh_datasets --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
|
||||||
```
|
```
|
||||||
|
|
||||||
Be sure to install the [LMDeploy](https://github.com/InternLM/lmdeploy) tool before starting the evaluation:
|
Be sure to install the [LMDeploy](https://github.com/InternLM/lmdeploy) tool before starting the evaluation:
|
||||||
@ -104,7 +104,7 @@ Be sure to install the [LMDeploy](https://github.com/InternLM/lmdeploy) tool bef
|
|||||||
pip install lmdeploy
|
pip install lmdeploy
|
||||||
```
|
```
|
||||||
|
|
||||||
This command initiates the evaluation process, with parameters `-p partition_name -q auto` and `--max-num-workers 32` used to specify the Slurm partition name and the maximum number of worker processes.
|
This command initiates the evaluation process, with parameters `-p partition_name -q auto` and `--max-num-workers 16` used to specify the Slurm partition name and the maximum number of worker processes.
|
||||||
|
|
||||||
#### Evaluating Other `Huggingface` Models
|
#### Evaluating Other `Huggingface` Models
|
||||||
|
|
||||||
@ -112,8 +112,10 @@ For other models, we recommend writing an additional configuration file to modif
|
|||||||
|
|
||||||
```python
|
```python
|
||||||
from mmengine.config import read_base
|
from mmengine.config import read_base
|
||||||
|
# We use mmengine.config to import variables from other configuration files
|
||||||
|
|
||||||
with read_base():
|
with read_base():
|
||||||
from .models.hf_internlm.lmdeploy_internlm2_chat_7b import models as internlm2_chat_7b_200k
|
# from .models.hf_internlm.lmdeploy_internlm2_chat_7b import models as internlm2_chat_7b_200k
|
||||||
from .models.hf_internlm.hf_internlm2_chat_7b import models as internlm2_chat_7b
|
from .models.hf_internlm.hf_internlm2_chat_7b import models as internlm2_chat_7b
|
||||||
|
|
||||||
# Evaluate needlebench_4k, adjust the configuration to use 8k, 32k, 128k, 200k, or 1000k if necessary.
|
# Evaluate needlebench_4k, adjust the configuration to use 8k, 32k, 128k, 200k, or 1000k if necessary.
|
||||||
@ -131,7 +133,7 @@ with read_base():
|
|||||||
datasets = sum([v for k, v in locals().items() if ('datasets' in k)], [])
|
datasets = sum([v for k, v in locals().items() if ('datasets' in k)], [])
|
||||||
|
|
||||||
for m in internlm2_chat_7b:
|
for m in internlm2_chat_7b:
|
||||||
m['max_seq_len'] = 32768 # Ensure InternLM2-7B model can receive the complete long text, other models need to adjust according to their maximum sequence length support.
|
m['max_seq_len'] = 30768 # Ensure InternLM2-7B model can receive the complete long text, other models need to adjust according to their maximum sequence length support.
|
||||||
m['max_out_len'] = 2000 # Ensure that in the multi-needle recall task, the model can receive a complete response
|
m['max_out_len'] = 2000 # Ensure that in the multi-needle recall task, the model can receive a complete response
|
||||||
|
|
||||||
models = internlm2_chat_7b
|
models = internlm2_chat_7b
|
||||||
@ -142,10 +144,10 @@ work_dir = './outputs/needlebench'
|
|||||||
Once the test `config` file is written, we can pass the corresponding config file path through the `run.py` file in the command line, such as:
|
Once the test `config` file is written, we can pass the corresponding config file path through the `run.py` file in the command line, such as:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python run.py configs/eval_needlebench.py --slurm -p partition_name -q reserved --max-num-workers 128 --max-partition-size 8000
|
python run.py configs/eval_needlebench.py --slurm -p partition_name -q reserved --max-num-workers 16
|
||||||
```
|
```
|
||||||
|
|
||||||
Note, at this point, we do not need to pass in the `--dataset, --models, --summarizer` parameters, as we have already defined these configurations in the config file. You can manually adjust the `--max-partition-size` setting to achieve the best task slicing strategy to improve evaluation efficiency.
|
Note, at this point, we do not need to pass in the `--dataset, --models, --summarizer` parameters, as we have already defined these configurations in the config file. You can manually adjust the `--max-num-workers` setting to adjust the number of parallel workers.
|
||||||
|
|
||||||
### Visualization
|
### Visualization
|
||||||
|
|
||||||
@ -155,6 +157,16 @@ If you use this method, please add a reference:
|
|||||||
|
|
||||||
```bibtex
|
```bibtex
|
||||||
|
|
||||||
|
@misc{li2024needlebenchllmsretrievalreasoning,
|
||||||
|
title={NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?},
|
||||||
|
author={Mo Li and Songyang Zhang and Yunxin Liu and Kai Chen},
|
||||||
|
year={2024},
|
||||||
|
eprint={2407.11963},
|
||||||
|
archivePrefix={arXiv},
|
||||||
|
primaryClass={cs.CL},
|
||||||
|
url={https://arxiv.org/abs/2407.11963},
|
||||||
|
}
|
||||||
|
|
||||||
@misc{2023opencompass,
|
@misc{2023opencompass,
|
||||||
title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
|
title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
|
||||||
author={OpenCompass Contributors},
|
author={OpenCompass Contributors},
|
||||||
|
@ -6,7 +6,7 @@
|
|||||||
|
|
||||||
## 任务介绍
|
## 任务介绍
|
||||||
|
|
||||||
在`OpenCompass`的`NeedleBench`框架中,为了全面评估模型在长文本信息提取和推理方面的能力,我们设计了一系列逐渐增加难度的测试方案。
|
在`OpenCompass`的`NeedleBench`框架中,为了全面评估模型在长文本信息提取和推理方面的能力,我们设计了一系列逐渐增加难度的测试方案。完整的介绍参见我们的[技术报告](https://arxiv.org/abs/2407.11963)。
|
||||||
|
|
||||||
- **单一信息检索任务(Single-Needle Retrieval Task, S-RT)**:评估LLM在长文本中提取单一关键信息的能力,测试其对广泛叙述中特定细节的精确回忆能力。这对应于**原始的大海捞针测试**任务设定。
|
- **单一信息检索任务(Single-Needle Retrieval Task, S-RT)**:评估LLM在长文本中提取单一关键信息的能力,测试其对广泛叙述中特定细节的精确回忆能力。这对应于**原始的大海捞针测试**任务设定。
|
||||||
|
|
||||||
@ -77,11 +77,11 @@ python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --su
|
|||||||
|
|
||||||
##### 在Slurm集群上评估
|
##### 在Slurm集群上评估
|
||||||
|
|
||||||
如果使用 `Slurm`,可以添加 `--slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000`等参数,例如下面:
|
如果使用 `Slurm`,可以添加 `--slurm -p partition_name -q reserved --max-num-workers 16 `等参数,例如下面:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Slurm评估
|
# Slurm评估
|
||||||
python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000
|
python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
|
||||||
```
|
```
|
||||||
|
|
||||||
##### 只评估子数据集
|
##### 只评估子数据集
|
||||||
@ -89,13 +89,13 @@ python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --su
|
|||||||
如果只想测试原始的大海捞针任务设定,比如可以更换数据集的参数为`needlebench_single_4k`,这对应于4k长度下的单针版本的大海捞针测试:
|
如果只想测试原始的大海捞针任务设定,比如可以更换数据集的参数为`needlebench_single_4k`,这对应于4k长度下的单针版本的大海捞针测试:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python run.py --dataset needlebench_single_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000
|
python run.py --dataset needlebench_single_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
|
||||||
```
|
```
|
||||||
|
|
||||||
您也可以进一步选择子数据集,如更换数据集`--datasets`的参数为`needlebench_single_4k/needlebench_zh_datasets`,仅仅进行中文版本的单针4K长度下的大海捞针任务测试,其中`/`后面的参数代表子数据集,您可以在`configs/datasets/needlebench/needlebench_4k/needlebench_single_4k.py`中找到可选的子数据集变量,如:
|
您也可以进一步选择子数据集,如更换数据集`--datasets`的参数为`needlebench_single_4k/needlebench_zh_datasets`,仅仅进行中文版本的单针4K长度下的大海捞针任务测试,其中`/`后面的参数代表子数据集,您可以在`configs/datasets/needlebench/needlebench_4k/needlebench_single_4k.py`中找到可选的子数据集变量,如:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python run.py --dataset needlebench_single_4k/needlebench_zh_datasets --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000
|
python run.py --dataset needlebench_single_4k/needlebench_zh_datasets --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
|
||||||
```
|
```
|
||||||
|
|
||||||
注意在评估前预先安装[LMDeploy](https://github.com/InternLM/lmdeploy)工具
|
注意在评估前预先安装[LMDeploy](https://github.com/InternLM/lmdeploy)工具
|
||||||
@ -112,8 +112,10 @@ pip install lmdeploy
|
|||||||
|
|
||||||
```python
|
```python
|
||||||
from mmengine.config import read_base
|
from mmengine.config import read_base
|
||||||
|
# 我们使用mmengine.config来import其他的配置文件中的变量
|
||||||
|
|
||||||
with read_base():
|
with read_base():
|
||||||
from .models.hf_internlm.lmdeploy_internlm2_chat_7b import models as internlm2_chat_7b_200k
|
# from .models.hf_internlm.lmdeploy_internlm2_chat_7b import models as internlm2_chat_7b_200k
|
||||||
from .models.hf_internlm.hf_internlm2_chat_7b import models as internlm2_chat_7b
|
from .models.hf_internlm.hf_internlm2_chat_7b import models as internlm2_chat_7b
|
||||||
|
|
||||||
# Evaluate needlebench_4k, adjust the configuration to use 8k, 32k, 128k, 200k, or 1000k if necessary.
|
# Evaluate needlebench_4k, adjust the configuration to use 8k, 32k, 128k, 200k, or 1000k if necessary.
|
||||||
@ -131,7 +133,7 @@ with read_base():
|
|||||||
datasets = sum([v for k, v in locals().items() if ('datasets' in k)], [])
|
datasets = sum([v for k, v in locals().items() if ('datasets' in k)], [])
|
||||||
|
|
||||||
for m in internlm2_chat_7b:
|
for m in internlm2_chat_7b:
|
||||||
m['max_seq_len'] = 32768 # 保证InternLM2-7B模型能接收到完整的长文本,其他模型需要根据各自支持的最大序列长度修改。
|
m['max_seq_len'] = 30768 # 保证InternLM2-7B模型能接收到完整的长文本,其他模型需要根据各自支持的最大序列长度修改。
|
||||||
m['max_out_len'] = 2000 # 保证在多针召回任务中能接收到模型完整的回答
|
m['max_out_len'] = 2000 # 保证在多针召回任务中能接收到模型完整的回答
|
||||||
|
|
||||||
models = internlm2_chat_7b
|
models = internlm2_chat_7b
|
||||||
@ -142,10 +144,10 @@ work_dir = './outputs/needlebench'
|
|||||||
当书写好测试的`config`文件后,我们可以命令行中通过`run.py`文件传入对应的config文件路径,例如:
|
当书写好测试的`config`文件后,我们可以命令行中通过`run.py`文件传入对应的config文件路径,例如:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python run.py configs/eval_needlebench.py --slurm -p partition_name -q reserved --max-num-workers 128 --max-partition-size 8000
|
python run.py configs/eval_needlebench.py --slurm -p partition_name -q reserved --max-num-workers 16
|
||||||
```
|
```
|
||||||
|
|
||||||
注意,此时我们不需传入`--dataset, --models, --summarizer `等参数,因为我们已经在config文件中定义了这些配置。你可以自己手动调节`--max-partition-size`的设定以实现最好的任务分片策略以提高评估效率。
|
注意,此时我们不需传入`--dataset, --models, --summarizer `等参数,因为我们已经在config文件中定义了这些配置。你可以自己手动调节`--max-num-workers`的设定以调节并行工作的workers的数量。
|
||||||
|
|
||||||
### 可视化
|
### 可视化
|
||||||
|
|
||||||
@ -155,6 +157,16 @@ python run.py configs/eval_needlebench.py --slurm -p partition_name -q reserved
|
|||||||
|
|
||||||
```bibtex
|
```bibtex
|
||||||
|
|
||||||
|
@misc{li2024needlebenchllmsretrievalreasoning,
|
||||||
|
title={NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?},
|
||||||
|
author={Mo Li and Songyang Zhang and Yunxin Liu and Kai Chen},
|
||||||
|
year={2024},
|
||||||
|
eprint={2407.11963},
|
||||||
|
archivePrefix={arXiv},
|
||||||
|
primaryClass={cs.CL},
|
||||||
|
url={https://arxiv.org/abs/2407.11963},
|
||||||
|
}
|
||||||
|
|
||||||
@misc{2023opencompass,
|
@misc{2023opencompass,
|
||||||
title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
|
title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
|
||||||
author={OpenCompass Contributors},
|
author={OpenCompass Contributors},
|
||||||
|
@ -351,7 +351,7 @@ class APITemplateParser:
|
|||||||
def _prompt2api(self,
|
def _prompt2api(self,
|
||||||
prompts: Union[List, str],
|
prompts: Union[List, str],
|
||||||
role_dict: Dict[str, Dict],
|
role_dict: Dict[str, Dict],
|
||||||
for_gen: bool = False) -> Tuple[str, bool]:
|
for_gen: bool = False) -> Tuple[List, bool]:
|
||||||
"""Convert the prompts to a API-style prompts, given an updated
|
"""Convert the prompts to a API-style prompts, given an updated
|
||||||
role_dict.
|
role_dict.
|
||||||
|
|
||||||
@ -363,7 +363,7 @@ class APITemplateParser:
|
|||||||
role whose "generate" is set to True.
|
role whose "generate" is set to True.
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
Tuple[str, bool]: The converted string, and whether the follow-up
|
Tuple[List, bool]: The converted string, and whether the follow-up
|
||||||
conversion should be proceeded.
|
conversion should be proceeded.
|
||||||
"""
|
"""
|
||||||
cont = True
|
cont = True
|
||||||
@ -376,7 +376,7 @@ class APITemplateParser:
|
|||||||
res = []
|
res = []
|
||||||
for prompt in prompts:
|
for prompt in prompts:
|
||||||
if isinstance(prompt, str):
|
if isinstance(prompt, str):
|
||||||
raise TypeError('Mixing str without explictt role is not '
|
raise TypeError('Mixing str without explict role is not '
|
||||||
'allowed in API models!')
|
'allowed in API models!')
|
||||||
else:
|
else:
|
||||||
api_role, cont = self._role2api_role(prompt, role_dict,
|
api_role, cont = self._role2api_role(prompt, role_dict,
|
||||||
@ -390,7 +390,7 @@ class APITemplateParser:
|
|||||||
def _role2api_role(self,
|
def _role2api_role(self,
|
||||||
role_prompt: Dict,
|
role_prompt: Dict,
|
||||||
role_dict: Dict[str, Dict],
|
role_dict: Dict[str, Dict],
|
||||||
for_gen: bool = False) -> Tuple[str, bool]:
|
for_gen: bool = False) -> Tuple[Dict, bool]:
|
||||||
"""Convert a role prompt to a string, given an updated role_dict.
|
"""Convert a role prompt to a string, given an updated role_dict.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
@ -401,7 +401,7 @@ class APITemplateParser:
|
|||||||
role whose "generate" is set to True.
|
role whose "generate" is set to True.
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
Tuple[str, bool]: The converted string, and whether the follow-up
|
Tuple[Dict, bool]: The converted string, and whether the follow-up
|
||||||
conversion should be proceeded.
|
conversion should be proceeded.
|
||||||
"""
|
"""
|
||||||
merged_prompt = role_dict.get(
|
merged_prompt = role_dict.get(
|
||||||
|
@ -65,6 +65,8 @@ model_name_mapping = {
|
|||||||
'qwen1.5-72b-chat-vllm': 'Qwen-1.5-72B-vLLM',
|
'qwen1.5-72b-chat-vllm': 'Qwen-1.5-72B-vLLM',
|
||||||
'glm4_notools': 'GLM-4',
|
'glm4_notools': 'GLM-4',
|
||||||
'claude-3-opus': 'Claude-3-Opus',
|
'claude-3-opus': 'Claude-3-Opus',
|
||||||
|
'glm-4-9b-chat-1m-vllm': 'GLM4-9B-Chat-1M',
|
||||||
|
'internlm2_5-7b-chat-1m-turbomind': 'InternLM2.5-7B-Chat-1M',
|
||||||
# Add more mappings as necessary
|
# Add more mappings as necessary
|
||||||
}
|
}
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user