mirror of
https://github.com/open-compass/opencompass.git
synced 2025-05-30 16:03:24 +08:00
Merge 03f16c8a83
into d572761cef
This commit is contained in:
commit
7a44a80bb9
@ -1,52 +1,26 @@
|
||||
# Needle In A Haystack Experimental Evaluation
|
||||
# Needle In A Haystack Evaluation
|
||||
|
||||
## Introduction to the Needle In A Haystack Test
|
||||
|
||||
The Needle In A Haystack test (inspired by [NeedleInAHaystack](https://github.com/gkamradt/LLMTest_NeedleInAHaystack/blob/main/LLMNeedleHaystackTester.py)) is an evaluation method that randomly inserts key information into long texts to form prompts for large language models (LLMs). The test aims to detect whether large models can extract such key information from extensive texts, thereby assessing the models' capabilities in processing and understanding long documents.
|
||||
The Needle In A Haystack test (inspired by [NeedleInAHaystack](https://github.com/gkamradt/LLMTest_NeedleInAHaystack/blob/main/LLMNeedleHaystackTester.py)) is an evaluation method where key information is randomly inserted into long texts to form the prompt for large language models (LLMs). This test aims to assess whether LLMs can extract critical information from long texts, thereby evaluating their fundamental ability to comprehend and process long-context documents.
|
||||
|
||||
## Task Overview
|
||||
|
||||
Within the `NeedleBench` framework of `OpenCompass`, we have designed a series of increasingly challenging test scenarios to comprehensively evaluate the models' abilities in long text information extraction and reasoning. For a complete introduction, refer to our [technical report](https://arxiv.org/abs/2407.11963):
|
||||
Within the `OpenCompass` framework, under `NeedleBench`, we designed a series of progressively challenging evaluation tasks to comprehensively assess LLMs' long-text information extraction and reasoning capabilities. For a complete description, please refer to our [technical report](https://arxiv.org/abs/2407.11963).
|
||||
|
||||
- **Single-Needle Retrieval Task (S-RT)**: Assesses an LLM's ability to extract a single key piece of information from a long text, testing its precision in recalling specific details within broad narratives. This corresponds to the **original Needle In A Haystack test** setup.
|
||||
- **Single-Needle Retrieval Task (S-RT)**: Evaluates the LLM's ability to retrieve a single piece of key information from a long text, testing precise recall of specific details within extensive narratives. This corresponds to the **original Needle In A Haystack test** setup.
|
||||
|
||||
- **Multi-Needle Retrieval Task (M-RT)**: Explores an LLM's capability to retrieve multiple related pieces of information from long texts, simulating real-world scenarios of complex queries on comprehensive documents.
|
||||
- **Multi-Needle Retrieval Task (M-RT)**: Explores the LLM's ability to retrieve multiple relevant pieces of information from long texts, simulating complex queries over comprehensive documents.
|
||||
|
||||
- **Multi-Needle Reasoning Task (M-RS)**: Evaluates an LLM's long-text abilities by extracting and utilizing multiple key pieces of information, requiring the model to have a comprehensive understanding of each key information fragment.
|
||||
- **Multi-Needle Reasoning Task (M-RS)**: Assesses LLMs' abilities to integrate multiple key pieces of information extracted from long texts for reasoning, requiring a comprehensive understanding of content.
|
||||
|
||||
- **Ancestral Trace Challenge (ATC)**: Uses the "relational needle" to test an LLM's ability to handle multi-layer logical challenges in real long texts. In the ATC task, a series of logical reasoning questions are used to test the model's memory and analytical skills for every detail in the text. For this task, we remove the irrelevant text (Haystack) setting, designing all texts as critical information, requiring the LLM to use all the content and reasoning in the text accurately to answer the questions.
|
||||
- **Ancestral Trace Challenge (ATC)**: Tests LLMs' capabilities in handling multi-layer logical challenges within realistic long-text contexts through "kinship trace needles." In the ATC task, no irrelevant (haystack) texts are added; every piece of text is critical, and models must reason through all details for accurate answers.
|
||||
|
||||
### Evaluation Steps
|
||||
> **Note:** NeedleBench (v2) includes several optimizations and adjustments in dataset construction and task details. For a detailed comparison between the old and new versions, as well as a summary of updates, please refer to [opencompass/configs/datasets/needlebench_v2/readme.md](https://github.com/open-compass/opencompass/blob/main/opencompass/configs/datasets/needlebench_v2/readme.md).
|
||||
|
||||
> Note: In the latest code, OpenCompass has been set to automatically load the dataset from [Huggingface API](https://huggingface.co/datasets/opencompass/NeedleBench), so you can **skip directly** the following steps of manually downloading and placing the dataset.
|
||||
## Evaluation Steps
|
||||
|
||||
1. Download the dataset from [here](https://github.com/open-compass/opencompass/files/14741330/needlebench.zip).
|
||||
|
||||
2. Place the downloaded files in the `opencompass/data/needlebench/` directory. The expected file structure in the `needlebench` directory is shown below:
|
||||
|
||||
```
|
||||
opencompass/
|
||||
├── configs
|
||||
├── docs
|
||||
├── data
|
||||
│ └── needlebench
|
||||
│ ├── multi_needle_reasoning_en.json
|
||||
│ ├── multi_needle_reasoning_zh.json
|
||||
│ ├── names.json
|
||||
│ ├── needles.jsonl
|
||||
│ ├── PaulGrahamEssays.jsonl
|
||||
│ ├── zh_finance.jsonl
|
||||
│ ├── zh_game.jsonl
|
||||
│ ├── zh_government.jsonl
|
||||
│ ├── zh_movie.jsonl
|
||||
│ ├── zh_tech.jsonl
|
||||
│ ├── zh_general.jsonl
|
||||
├── LICENSE
|
||||
├── opencompass
|
||||
├── outputs
|
||||
├── run.py
|
||||
├── more...
|
||||
```
|
||||
> Note: In the latest `OpenCompass` codebase, the NeedleBench dataset is automatically loaded from the [Huggingface interface](https://huggingface.co/datasets/opencompass/NeedleBench), with no need for manual download or configuration.
|
||||
|
||||
### `OpenCompass` Environment Setup
|
||||
|
||||
@ -58,115 +32,85 @@ cd opencompass
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
### Configuring the Dataset
|
||||
### Dataset Configuration
|
||||
|
||||
We have pre-configured datasets for common text lengths (4k, 8k, 32k, 128k, 200k, 1000k) in `configs/datasets/needlebench`, allowing you to flexibly create datasets that meet your needs by defining related parameters in the configuration files.
|
||||
We have pre-configured various long-context settings (4k, 8k, 32k, 128k, 200k, 1000k) in `opencompass/configs/datasets/needlebench_v2`, and you can flexibly define your parameters by adjusting the configuration files.
|
||||
|
||||
### Evaluation Example
|
||||
|
||||
#### Evaluating `InternLM2-7B` Model Deployed Using `LMDeploy`
|
||||
#### Evaluating with `VLLM` Deployed `Qwen2-5-7B` Model
|
||||
|
||||
For example, to evaluate the `InternLM2-7B` model deployed using `LMDeploy` for all tasks in NeedleBench-4K, you can directly use the following command in the command line. This command calls the pre-defined model and dataset configuration files without needing to write additional configuration files:
|
||||
To evaluate the `Qwen2-5-7B` model deployed with `VLLM` on all tasks under NeedleBench-128K, use the following command. This leverages pre-defined model and dataset configuration files without needing additional configuration:
|
||||
|
||||
##### Local Evaluation
|
||||
|
||||
If you are evaluating the model locally, the command below will utilize all available GPUs on your machine. You can limit the GPU access for `OpenCompass` by setting the `CUDA_VISIBLE_DEVICES` environment variable. For instance, using `CUDA_VISIBLE_DEVICES=0,1,2,3 python run.py ...` will only expose the first four GPUs to OpenCompass, ensuring that it does not use more than these four GPUs.
|
||||
If evaluating locally, the command will use all available GPUs. You can control GPU visibility using `CUDA_VISIBLE_DEVICES`:
|
||||
|
||||
```bash
|
||||
# Local Evaluation
|
||||
python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer
|
||||
# Local evaluation
|
||||
python run.py --dataset needlebench_v2_128k --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_v2_128k_summarizer
|
||||
```
|
||||
|
||||
##### Evaluation on a Slurm Cluster
|
||||
##### Evaluation on Slurm Cluster
|
||||
|
||||
If using `Slurm`, you can add parameters such as `--slurm -p partition_name -q reserved --max-num-workers 16`, as shown below:
|
||||
For Slurm environments, you can add options like `--slurm -p partition_name -q reserved --max-num-workers 16`:
|
||||
|
||||
```bash
|
||||
# Slurm Evaluation
|
||||
python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
|
||||
# Slurm evaluation
|
||||
python run.py --dataset needlebench_v2_128k --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_v2_128k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
|
||||
```
|
||||
|
||||
##### Evaluating a Subdataset Only
|
||||
##### Evaluating Specific Subsets
|
||||
|
||||
If you only want to test the original NeedleInAHaystack task setup, you could change the dataset parameter to `needlebench_single_4k`, which corresponds to the single needle version of the NeedleInAHaystack test at 4k length:
|
||||
If you only want to test the original Needle In A Haystack task (e.g., single-needle 128k), adjust the dataset parameter:
|
||||
|
||||
```bash
|
||||
python run.py --dataset needlebench_single_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
|
||||
python run.py --dataset needlebench_v2_single_128k --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_v2_128k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
|
||||
```
|
||||
|
||||
You can also choose to evaluate a specific subdataset, such as changing the `--datasets` parameter to `needlebench_single_4k/needlebench_zh_datasets` for testing just the Chinese version of the single needle 4K length NeedleInAHaystack task. The parameter after `/` represents the subdataset, which can be found in the dataset variable of `configs/datasets/needlebench/needlebench_4k/needlebench_single_4k.py` :
|
||||
To evaluate only Chinese versions, specify the subset dataset after `/`:
|
||||
|
||||
```bash
|
||||
python run.py --dataset needlebench_single_4k/needlebench_zh_datasets --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
|
||||
python run.py --dataset needlebench_v2_single_128k/needlebench_zh_datasets --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_v2_128k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
|
||||
```
|
||||
|
||||
Be sure to install the [LMDeploy](https://github.com/InternLM/lmdeploy) tool before starting the evaluation:
|
||||
Ensure `VLLM` is installed beforehand:
|
||||
|
||||
```bash
|
||||
pip install lmdeploy
|
||||
# Install vLLM with CUDA 12.4.
|
||||
# For other CUDA versions, please refer to the [official documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html)
|
||||
pip install vllm
|
||||
```
|
||||
|
||||
This command initiates the evaluation process, with parameters `-p partition_name -q auto` and `--max-num-workers 16` used to specify the Slurm partition name and the maximum number of worker processes.
|
||||
|
||||
#### Evaluating Other `Huggingface` Models
|
||||
|
||||
For other models, we recommend writing an additional configuration file to modify the model's `max_seq_len` and `max_out_len` parameters so the model can receive the complete long text content, as we have prepared in the `configs/eval_needlebench.py` file. The complete content is as follows:
|
||||
For other models, it is recommended to write your own config file (such as `examples/eval_needlebench_v2.py`) to adjust `max_seq_len` and `max_out_len`, so that the model can process the full context.
|
||||
|
||||
```python
|
||||
from mmengine.config import read_base
|
||||
# We use mmengine.config to import variables from other configuration files
|
||||
|
||||
with read_base():
|
||||
# from .models.hf_internlm.lmdeploy_internlm2_chat_7b import models as internlm2_chat_7b_200k
|
||||
from .models.hf_internlm.hf_internlm2_chat_7b import models as internlm2_chat_7b
|
||||
|
||||
# Evaluate needlebench_4k, adjust the configuration to use 8k, 32k, 128k, 200k, or 1000k if necessary.
|
||||
# from .datasets.needlebench.needlebench_4k.needlebench_4k import needlebench_datasets
|
||||
# from .summarizers.needlebench import needlebench_4k_summarizer as summarizer
|
||||
|
||||
# only eval original "needle in a haystack test" in needlebench_4k
|
||||
from .datasets.needlebench.needlebench_4k.needlebench_single_4k import needlebench_zh_datasets, needlebench_en_datasets
|
||||
from .summarizers.needlebench import needlebench_4k_summarizer as summarizer
|
||||
|
||||
# eval Ancestral Tracing Challenge(ATC)
|
||||
# from .datasets.needlebench.atc.atc_choice_50 import needlebench_datasets
|
||||
# from .summarizers.needlebench import atc_summarizer_50 as summarizer
|
||||
|
||||
datasets = sum([v for k, v in locals().items() if ('datasets' in k)], [])
|
||||
|
||||
for m in internlm2_chat_7b:
|
||||
m['max_seq_len'] = 30768 # Ensure InternLM2-7B model can receive the complete long text, other models need to adjust according to their maximum sequence length support.
|
||||
m['max_out_len'] = 2000 # Ensure that in the multi-needle recall task, the model can receive a complete response
|
||||
|
||||
models = internlm2_chat_7b
|
||||
|
||||
work_dir = './outputs/needlebench'
|
||||
```
|
||||
|
||||
Once the test `config` file is written, we can pass the corresponding config file path through the `run.py` file in the command line, such as:
|
||||
You can then run evaluation with:
|
||||
|
||||
```bash
|
||||
python run.py configs/eval_needlebench.py --slurm -p partition_name -q reserved --max-num-workers 16
|
||||
python run.py configs/eval_needlebench_v2.py --slurm -p partition_name -q reserved --max-num-workers 16
|
||||
```
|
||||
|
||||
Note, at this point, we do not need to pass in the `--dataset, --models, --summarizer` parameters, as we have already defined these configurations in the config file. You can manually adjust the `--max-num-workers` setting to adjust the number of parallel workers.
|
||||
No need to manually specify `--dataset`, `--models`, or `--summarizer` again.
|
||||
|
||||
### Visualization
|
||||
|
||||
We have built-in result visualization into the `summarizer` implementation in the latest code version. You can find the corresponding visualizations in the plots directory of the respective output folder, eliminating the need for manual visualization of scores across various depths and lengths.
|
||||
NeedleBench's latest version has built-in visualization integrated into the summarizer. You can find corresponding visualizations in the `plots` directory under the output folder without needing additional scripts.
|
||||
|
||||
If you use this method, please add a reference:
|
||||
### Citation
|
||||
|
||||
If you use NeedleBench, please cite us:
|
||||
|
||||
```bibtex
|
||||
|
||||
@misc{li2024needlebenchllmsretrievalreasoning,
|
||||
title={NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?},
|
||||
author={Mo Li and Songyang Zhang and Yunxin Liu and Kai Chen},
|
||||
year={2024},
|
||||
@misc{li2025needlebenchllmsretrievalreasoning,
|
||||
title={NeedleBench: Can LLMs Do Retrieval and Reasoning in Information-Dense Context?},
|
||||
author={Mo Li and Songyang Zhang and Taolin Zhang and Haodong Duan and Yunxin Liu and Kai Chen},
|
||||
year={2025},
|
||||
eprint={2407.11963},
|
||||
archivePrefix={arXiv},
|
||||
primaryClass={cs.CL},
|
||||
url={https://arxiv.org/abs/2407.11963},
|
||||
url={https://arxiv.org/abs/2407.11963},
|
||||
}
|
||||
|
||||
@misc{2023opencompass,
|
||||
@ -174,8 +118,6 @@ If you use this method, please add a reference:
|
||||
author={OpenCompass Contributors},
|
||||
howpublished={\url{https://github.com/open-compass/opencompass}},
|
||||
year={2023}
|
||||
|
||||
|
||||
}
|
||||
|
||||
@misc{LLMTest_NeedleInAHaystack,
|
||||
@ -187,11 +129,10 @@ If you use this method, please add a reference:
|
||||
|
||||
@misc{wei2023skywork,
|
||||
title={Skywork: A More Open Bilingual Foundation Model},
|
||||
author={Tianwen Wei and Liang Zhao and Lichang Zhang and Bo Zhu and Lijie Wang and Haihua Yang and Biye Li and Cheng Cheng and Weiwei Lü and Rui Hu and Chenxia Li and Liu Yang and Xilin Luo and Xuejie Wu and Lunan Liu and Wenjun Cheng and Peng Cheng and Jianhao Zhang and Xiaoyu Zhang and Lei Lin and Xiaokun Wang and Yutuan Ma and Chuanhai Dong and Yanqi Sun and Yifu Chen and Yongyi Peng and Xiaojuan Liang and Shuicheng Yan and Han Fang and Yahui Zhou},
|
||||
author={Tianwen Wei and Liang Zhao and Lichang Zhang and Bo Zhu and Lijie Wang and Haihua Yang and Biye Li and Cheng Cheng and Weiwei L\"u and Rui Hu and Chenxia Li and Liu Yang and Xilin Luo and Xuejie Wu and Lunan Liu and Wenjun Cheng and Peng Cheng and Jianhao Zhang and Xiaoyu Zhang and Lei Lin and Xiaokun Wang and Yutuan Ma and Chuanhai Dong and Yanqi Sun and Yifu Chen and Yongyi Peng and Xiaojuan Liang and Shuicheng Yan and Han Fang and Yahui Zhou},
|
||||
year={2023},
|
||||
eprint={2310.19341},
|
||||
archivePrefix={arXiv},
|
||||
primaryClass={cs.CL}
|
||||
}
|
||||
|
||||
```
|
||||
|
@ -16,37 +16,11 @@
|
||||
|
||||
- **祖先追溯挑战(Ancestral Trace Challenge, ATC)**:通过设计“亲属关系针”,测试LLM处理真实长文本中多层逻辑挑战的能力。在ATC任务中,通过一系列逻辑推理问题,检验模型对长文本中每个细节的记忆和分析能力,在此任务中,我们去掉了无关文本(Haystack)的设定,而是将所有文本设计为关键信息,LLM必须综合运用长文本中的所有内容和推理才能准确回答问题。
|
||||
|
||||
### 评估步骤
|
||||
> **补充说明**:目前NeedleBench(v2)在数据集构建和任务细节等方面做了一些小的优化和调整。如果您想了解新旧版本的具体差异和详细更新内容,请参考 [opencompass/configs/datasets/needlebench_v2/readme.md](https://github.com/open-compass/opencompass/blob/main/opencompass/configs/datasets/needlebench_v2/readme.md)。
|
||||
|
||||
> 注意:在最新代码中,OpenCompass已经设置数据集从[Huggingface的接口](https://huggingface.co/datasets/opencompass/NeedleBench)中自动加载,可以直接跳过下面的手动下载安放数据集。
|
||||
## 评估步骤
|
||||
|
||||
1. 从[这里](https://github.com/open-compass/opencompass/files/14741330/needlebench.zip)下载数据集。
|
||||
|
||||
2. 将下载的文件放置于`opencompass/data/needlebench/`目录下。`needlebench`目录中预期的文件结构如下所示:
|
||||
|
||||
```
|
||||
opencompass/
|
||||
├── configs
|
||||
├── docs
|
||||
├── data
|
||||
│ └── needlebench
|
||||
│ ├── multi_needle_reasoning_en.json
|
||||
│ ├── multi_needle_reasoning_zh.json
|
||||
│ ├── names.json
|
||||
│ ├── needles.jsonl
|
||||
│ ├── PaulGrahamEssays.jsonl
|
||||
│ ├── zh_finance.jsonl
|
||||
│ ├── zh_game.jsonl
|
||||
│ ├── zh_government.jsonl
|
||||
│ ├── zh_movie.jsonl
|
||||
│ ├── zh_tech.jsonl
|
||||
│ ├── zh_general.jsonl
|
||||
├── LICENSE
|
||||
├── opencompass
|
||||
├── outputs
|
||||
├── run.py
|
||||
├── more...
|
||||
```
|
||||
> 注意:在最新的OpenCompass代码中,NeedleBench数据集会自动从[Huggingface接口](https://huggingface.co/datasets/opencompass/NeedleBench)加载,无需手动下载或配置数据集,您可以直接运行评测命令。
|
||||
|
||||
### `OpenCompass`环境配置
|
||||
|
||||
@ -60,13 +34,13 @@ pip install -e .
|
||||
|
||||
### 配置数据集
|
||||
|
||||
我们在`configs/datasets/needlebench`中已经预先配置好了关于常见长度区间(4k, 8k, 32k, 128k, 200k, 1000k)的长文本测试设定,您可以通过在配置文件中定义相关参数,以灵活地创建适合您需求的数据集。
|
||||
我们在`opencompass/configs/datasets/needlebench_v2`中已经预先配置好了关于常见长度区间(4k, 8k, 32k, 128k, 200k, 1000k)的长文本测试设定,您可以通过在配置文件中定义相关参数,以灵活地创建适合您需求的数据集。
|
||||
|
||||
### 评估示例
|
||||
|
||||
#### 使用`LMDeploy`部署的 `InternLM2-7B` 模型进行评估
|
||||
#### 使用`VLLM`部署的 `Qwen2-5-7B` 模型进行评估
|
||||
|
||||
例如,使用`LMDeploy`部署的 `InternLM2-7B` 模型进行评估NeedleBench-4K的所有任务,可以在命令行中直接使用以下命令,该命令会调用预定义好的模型、数据集配置文件,而无需额外书写配置文件:
|
||||
例如,使用`VLLM`部署的 `Qwen2-5-7B` 模型进行评估NeedleBench-128K的所有任务,可以在命令行中直接使用以下命令,该命令会调用预定义好的模型、数据集配置文件,而无需额外书写配置文件:
|
||||
|
||||
##### 本地评估
|
||||
|
||||
@ -74,7 +48,7 @@ pip install -e .
|
||||
|
||||
```bash
|
||||
# 本地评估
|
||||
python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer
|
||||
python run.py --dataset needlebench_v2_128k --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_v2_128k_summarizer
|
||||
```
|
||||
|
||||
##### 在Slurm集群上评估
|
||||
@ -83,70 +57,42 @@ python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --su
|
||||
|
||||
```bash
|
||||
# Slurm评估
|
||||
python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
|
||||
python run.py --dataset needlebench_v2_128k --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_v2_128k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
|
||||
```
|
||||
|
||||
##### 只评估子数据集
|
||||
|
||||
如果只想测试原始的大海捞针任务设定,比如可以更换数据集的参数为`needlebench_single_4k`,这对应于4k长度下的单针版本的大海捞针测试:
|
||||
如果只想测试原始的大海捞针任务设定,比如可以更换数据集的参数为`needlebench_single_128k`,这对应于128k长度下的单针版本的大海捞针测试:
|
||||
|
||||
```bash
|
||||
python run.py --dataset needlebench_single_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
|
||||
python run.py --dataset needlebench_v2_single_128k --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_v2_128k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
|
||||
```
|
||||
|
||||
您也可以进一步选择子数据集,如更换数据集`--datasets`的参数为`needlebench_single_4k/needlebench_zh_datasets`,仅仅进行中文版本的单针4K长度下的大海捞针任务测试,其中`/`后面的参数代表子数据集,您可以在`configs/datasets/needlebench/needlebench_4k/needlebench_single_4k.py`中找到可选的子数据集变量,如:
|
||||
您也可以进一步选择子数据集,如更换数据集`--datasets`的参数为`needlebench_single_128k/needlebench_zh_datasets`,仅仅进行中文版本的单针128k长度下的大海捞针任务测试,其中`/`后面的参数代表子数据集,您可以在`opencompass/configs/datasets/needlebench_v2/needlebench_v2_128k/needlebench_v2_single_128k.py`中找到可选的子数据集变量,如:
|
||||
|
||||
```bash
|
||||
python run.py --dataset needlebench_single_4k/needlebench_zh_datasets --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
|
||||
python run.py --dataset needlebench_v2_single_128k/needlebench_zh_datasets --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_v2_128k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
|
||||
```
|
||||
|
||||
注意在评估前预先安装[LMDeploy](https://github.com/InternLM/lmdeploy)工具
|
||||
注意在评估前预先安装[VLLM](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html)工具
|
||||
|
||||
```bash
|
||||
pip install lmdeploy
|
||||
# Install vLLM with CUDA 12.4.
|
||||
# For other CUDA versions, please refer to the [official documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html)
|
||||
pip install vllm
|
||||
|
||||
```
|
||||
|
||||
这个命令将启动评估流程,参数 `-p partition_name -q auto` 和 `--max-num-workers 32` 用于指定 Slurm 分区名称和最大工作进程数。
|
||||
这个命令将启动评估流程,其中参数 `-p partition_name` 用于指定 Slurm 分区名称,`-q auto` 用于指定 quota type(资源队列类型,例如 auto、reserved 等),`--max-num-workers 32` 用于设置最大工作进程数。
|
||||
|
||||
#### 评估其他`Huggingface`模型
|
||||
|
||||
对于其他模型,我们建议额外书写一个运行的配置文件以便对模型的`max_seq_len`, `max_out_len`参数进行修改,以便模型可以接收到完整的长文本内容。如我们预先写好的`configs/eval_needlebench.py`文件。完整内容如下
|
||||
|
||||
```python
|
||||
from mmengine.config import read_base
|
||||
# 我们使用mmengine.config来import其他的配置文件中的变量
|
||||
|
||||
with read_base():
|
||||
# from .models.hf_internlm.lmdeploy_internlm2_chat_7b import models as internlm2_chat_7b_200k
|
||||
from .models.hf_internlm.hf_internlm2_chat_7b import models as internlm2_chat_7b
|
||||
|
||||
# Evaluate needlebench_4k, adjust the configuration to use 8k, 32k, 128k, 200k, or 1000k if necessary.
|
||||
# from .datasets.needlebench.needlebench_4k.needlebench_4k import needlebench_datasets
|
||||
# from .summarizers.needlebench import needlebench_4k_summarizer as summarizer
|
||||
|
||||
# only eval original "needle in a haystack test" in needlebench_4k
|
||||
from .datasets.needlebench.needlebench_4k.needlebench_single_4k import needlebench_zh_datasets, needlebench_en_datasets
|
||||
from .summarizers.needlebench import needlebench_4k_summarizer as summarizer
|
||||
|
||||
# eval Ancestral Tracing Challenge(ATC)
|
||||
# from .datasets.needlebench.atc.atc_choice_50 import needlebench_datasets
|
||||
# from .summarizers.needlebench import atc_summarizer_50 as summarizer
|
||||
|
||||
datasets = sum([v for k, v in locals().items() if ('datasets' in k)], [])
|
||||
|
||||
for m in internlm2_chat_7b:
|
||||
m['max_seq_len'] = 30768 # 保证InternLM2-7B模型能接收到完整的长文本,其他模型需要根据各自支持的最大序列长度修改。
|
||||
m['max_out_len'] = 2000 # 保证在多针召回任务中能接收到模型完整的回答
|
||||
|
||||
models = internlm2_chat_7b
|
||||
|
||||
work_dir = './outputs/needlebench'
|
||||
```
|
||||
对于其他模型,我们建议额外书写一个运行的配置文件以便对模型的`max_seq_len`, `max_out_len`参数进行修改,以便模型可以接收到完整的长文本内容。如`examples/eval_needlebench_v2.py`文件。
|
||||
|
||||
当书写好测试的`config`文件后,我们可以命令行中通过`run.py`文件传入对应的config文件路径,例如:
|
||||
|
||||
```bash
|
||||
python run.py configs/eval_needlebench.py --slurm -p partition_name -q reserved --max-num-workers 16
|
||||
python run.py configs/eval_needlebench_v2.py --slurm -p partition_name -q reserved --max-num-workers 16
|
||||
```
|
||||
|
||||
注意,此时我们不需传入`--dataset, --models, --summarizer `等参数,因为我们已经在config文件中定义了这些配置。你可以自己手动调节`--max-num-workers`的设定以调节并行工作的workers的数量。
|
||||
@ -155,18 +101,20 @@ python run.py configs/eval_needlebench.py --slurm -p partition_name -q reserved
|
||||
|
||||
我们已经在最新的代码中将结果可视化内置到`summarizer`实现中,您在对应的output文件夹的plots目录下可以看到相应的可视化。而不需要自己手动可视化各个深度和长度下的分数。
|
||||
|
||||
### 引用
|
||||
|
||||
如果使用了该方法,请添加引用:
|
||||
|
||||
```bibtex
|
||||
|
||||
@misc{li2024needlebenchllmsretrievalreasoning,
|
||||
title={NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?},
|
||||
author={Mo Li and Songyang Zhang and Yunxin Liu and Kai Chen},
|
||||
year={2024},
|
||||
@misc{li2025needlebenchllmsretrievalreasoning,
|
||||
title={NeedleBench: Can LLMs Do Retrieval and Reasoning in Information-Dense Context?},
|
||||
author={Mo Li and Songyang Zhang and Taolin Zhang and Haodong Duan and Yunxin Liu and Kai Chen},
|
||||
year={2025},
|
||||
eprint={2407.11963},
|
||||
archivePrefix={arXiv},
|
||||
primaryClass={cs.CL},
|
||||
url={https://arxiv.org/abs/2407.11963},
|
||||
url={https://arxiv.org/abs/2407.11963},
|
||||
}
|
||||
|
||||
@misc{2023opencompass,
|
||||
|
@ -1,29 +0,0 @@
|
||||
from mmengine.config import read_base
|
||||
|
||||
with read_base():
|
||||
# Evaluate needlebench_4k, adjust the configuration to use 8k, 32k, 128k, 200k, or 1000k if necessary.
|
||||
# from opencompass.configs.datasets.needlebench.needlebench_4k.needlebench_4k import needlebench_datasets
|
||||
# from opencompass.configs.summarizers.needlebench import needlebench_4k_summarizer as summarizer
|
||||
# only eval original "needle in a haystack test" in needlebench_4k
|
||||
from opencompass.configs.datasets.needlebench.needlebench_4k.needlebench_single_4k import (
|
||||
needlebench_en_datasets, needlebench_zh_datasets)
|
||||
from opencompass.configs.models.hf_internlm.hf_internlm2_chat_7b import \
|
||||
models as internlm2_chat_7b
|
||||
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_chat_7b import \
|
||||
models as internlm2_chat_7b_200k
|
||||
from opencompass.configs.summarizers.needlebench import \
|
||||
needlebench_4k_summarizer as summarizer
|
||||
|
||||
# eval Ancestral Tracing Challenge(ATC)
|
||||
# from opencompass.configs.datasets.needlebench.atc.atc_choice_50 import needlebench_datasets
|
||||
# from opencompass.configs.summarizers.needlebench import atc_summarizer_50 as summarizer
|
||||
|
||||
datasets = sum([v for k, v in locals().items() if ('datasets' in k)], [])
|
||||
|
||||
for m in internlm2_chat_7b:
|
||||
m['max_seq_len'] = 32768 # Ensure InternLM2-7B model can receive the full length of long texts, adjust for other models based on their supported maximum sequence length.
|
||||
m['max_out_len'] = 2000 # Ensure complete responses from the model in multi-needle retrieval tasks.
|
||||
|
||||
models = internlm2_chat_7b
|
||||
|
||||
work_dir = './outputs/needlebench'
|
27
examples/eval_needlebench_v2.py
Normal file
27
examples/eval_needlebench_v2.py
Normal file
@ -0,0 +1,27 @@
|
||||
from mmengine.config import read_base
|
||||
# we use mmengine.config to import other config files
|
||||
|
||||
with read_base():
|
||||
from opencompass.configs.models.hf_internlm.hf_internlm2_chat_7b import models as internlm2_chat_7b
|
||||
|
||||
# Evaluate needlebench_32k, adjust the configuration to use 4k, 32k, 128k, 200k, or 1000k if necessary.
|
||||
# from opencompass.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_32k import needlebench_datasets
|
||||
# from opencompass.configs.summarizers.needlebench import needlebench_32k_summarizer as summarizer
|
||||
|
||||
# only eval original "needle in a haystack test" in needlebench_32k
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_single_32k import needlebench_zh_datasets, needlebench_en_datasets
|
||||
from opencompass.configs.summarizers.needlebench import needlebench_v2_32k_summarizer as summarizer
|
||||
|
||||
# eval Ancestral Tracing Challenge(ATC)
|
||||
# from opencompass.configs.datasets.needlebench_v2.atc.atc_0shot_nocot_2_power_en import needlebench_datasets
|
||||
# ATC use default summarizer thus no need to import summarizer
|
||||
|
||||
datasets = sum([v for k, v in locals().items() if ('datasets' in k)], [])
|
||||
|
||||
for m in internlm2_chat_7b:
|
||||
m['max_seq_len'] = 32768 # 保证InternLM2-7B模型能接收到完整的长文本,其他模型需要根据各自支持的最大序列长度修改。
|
||||
m['max_out_len'] = 4096
|
||||
|
||||
models = internlm2_chat_7b
|
||||
|
||||
work_dir = './outputs/needlebench'
|
@ -0,0 +1,55 @@
|
||||
from opencompass.openicl.icl_prompt_template import PromptTemplate
|
||||
from opencompass.openicl.icl_retriever import ZeroRetriever
|
||||
from opencompass.openicl.icl_inferencer import GenInferencer
|
||||
from opencompass.datasets.needlebench_v2.atc import NeedleBenchATCDataset
|
||||
from opencompass.datasets.needlebench_v2.atc import needlebench_atc_postprocess_v2
|
||||
from opencompass.datasets.needlebench_v2.atc import NeedleBenchATCEvaluator
|
||||
|
||||
# ----------------------- Prompt Settings ----------------------- #
|
||||
needle_num_list = [2, 4, 8, 16, 32, 64, 128, 256, 512]
|
||||
path = 'opencompass/needlebench'
|
||||
file_name = 'names.json'
|
||||
repeats = 10
|
||||
|
||||
# ----------------------- Dataset Settings ----------------------- #
|
||||
|
||||
needlebench_datasets = []
|
||||
|
||||
needlebench_atc_reader_cfg = dict(input_columns=['prompt'], output_column='answer')
|
||||
|
||||
needlebench_atc_infer_cfg = dict(
|
||||
prompt_template=dict(
|
||||
type=PromptTemplate,
|
||||
template=dict(
|
||||
round=[
|
||||
dict(role='HUMAN', prompt='{prompt}'),
|
||||
],
|
||||
),
|
||||
),
|
||||
retriever=dict(type=ZeroRetriever),
|
||||
inferencer=dict(
|
||||
type=GenInferencer,
|
||||
),
|
||||
)
|
||||
|
||||
needlebench_atc_eval_cfg = dict(
|
||||
evaluator=dict(type=NeedleBenchATCEvaluator),
|
||||
pred_postprocessor=dict(type=needlebench_atc_postprocess_v2),
|
||||
)
|
||||
|
||||
for num_needles in needle_num_list:
|
||||
abbr = f'NeedleBenchATCDataset-{num_needles}Needle-EN'
|
||||
language = 'English'
|
||||
dataset_dict = {
|
||||
'abbr': abbr,
|
||||
'type': NeedleBenchATCDataset,
|
||||
'path': path,
|
||||
'file_name': file_name,
|
||||
'num_needles': num_needles,
|
||||
'language': language,
|
||||
'repeats': repeats,
|
||||
'reader_cfg': needlebench_atc_reader_cfg,
|
||||
'infer_cfg': needlebench_atc_infer_cfg,
|
||||
'eval_cfg': needlebench_atc_eval_cfg,
|
||||
}
|
||||
needlebench_datasets.append(dataset_dict)
|
@ -0,0 +1,18 @@
|
||||
from mmengine.config import read_base
|
||||
|
||||
with read_base():
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_multi_reasoning_1000k import needlebench_2needle_en_datasets as needlebench_multi_2needle_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_multi_reasoning_1000k import needlebench_3needle_en_datasets as needlebench_multi_3needle_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_multi_reasoning_1000k import needlebench_4needle_en_datasets as needlebench_multi_4needle_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_multi_reasoning_1000k import needlebench_5needle_en_datasets as needlebench_multi_5needle_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_multi_reasoning_1000k import needlebench_2needle_zh_datasets as needlebench_multi_2needle_zh_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_multi_reasoning_1000k import needlebench_3needle_zh_datasets as needlebench_multi_3needle_zh_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_multi_reasoning_1000k import needlebench_4needle_zh_datasets as needlebench_multi_4needle_zh_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_multi_reasoning_1000k import needlebench_5needle_zh_datasets as needlebench_multi_5needle_zh_datasets
|
||||
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_single_1000k import needlebench_en_datasets as needlebench_origin_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_single_1000k import needlebench_zh_datasets as needlebench_origin_zh_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_multi_retrieval_1000k import needlebench_en_datasets as needlebench_parallel_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_multi_retrieval_1000k import needlebench_zh_datasets as needlebench_parallel_zh_datasets
|
||||
|
||||
needlebench_datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
|
@ -0,0 +1,93 @@
|
||||
from opencompass.datasets.needlebench_v2.multi import NeedleBenchMultiDataset
|
||||
from mmengine.config import read_base
|
||||
with read_base():
|
||||
from .needlebench_v2_single_1000k import depths_list, context_lengths
|
||||
from .needlebench_v2_single_1000k import needlebench_reader_cfg, needlebench_infer_cfg
|
||||
from opencompass.configs.datasets.needlebench_v2.atc.atc_0shot_nocot_2_power_en import needlebench_atc_eval_cfg as needlebench_eval_cfg
|
||||
|
||||
|
||||
# ----------English Version----------
|
||||
base_path = 'opencompass/needlebench'
|
||||
file_list = ['PaulGrahamEssays.jsonl']
|
||||
needle_file_name = 'names.json'
|
||||
diff = 10
|
||||
language = 'English'
|
||||
length_buffer = 3000
|
||||
|
||||
# Initialize dataset lists
|
||||
needlebench_2needle_en_datasets = []
|
||||
needlebench_3needle_en_datasets = []
|
||||
needlebench_4needle_en_datasets = []
|
||||
needlebench_5needle_en_datasets = []
|
||||
|
||||
# Create datasets for different numbers of needles
|
||||
for num_needles in range(2, 6):
|
||||
dataset_list_name = f'needlebench_{num_needles}needle_en_datasets'
|
||||
|
||||
for original_context_length in context_lengths:
|
||||
for depth_percent in depths_list:
|
||||
dataset_dict = {
|
||||
'abbr': f'Length{original_context_length}'
|
||||
f'Depth{int(depth_percent)}_{num_needles}needle_en_1000k',
|
||||
'type': NeedleBenchMultiDataset,
|
||||
'path': base_path,
|
||||
'length': original_context_length,
|
||||
'depth': int(depth_percent),
|
||||
'tokenizer_model': 'gpt-4',
|
||||
'file_list': file_list,
|
||||
'num_repeats_per_file': 10,
|
||||
'length_buffer': length_buffer,
|
||||
'language': language,
|
||||
'needle_file_name': needle_file_name,
|
||||
'num_needles': num_needles,
|
||||
'diff': diff,
|
||||
'reader_cfg': needlebench_reader_cfg,
|
||||
'infer_cfg': needlebench_infer_cfg,
|
||||
'eval_cfg': needlebench_eval_cfg,
|
||||
}
|
||||
|
||||
# Add to the appropriate list using globals()
|
||||
globals()[f'needlebench_{num_needles}needle_en_datasets'].append(dataset_dict)
|
||||
|
||||
# ----------Chinese Version----------
|
||||
base_path = 'opencompass/needlebench'
|
||||
file_list = ['zh_finance.jsonl']
|
||||
needle_file_name = 'names.json'
|
||||
diff = 10
|
||||
language = 'Chinese'
|
||||
length_buffer = 200
|
||||
|
||||
# Initialize dataset lists
|
||||
needlebench_2needle_zh_datasets = []
|
||||
needlebench_3needle_zh_datasets = []
|
||||
needlebench_4needle_zh_datasets = []
|
||||
needlebench_5needle_zh_datasets = []
|
||||
|
||||
# Create datasets for different numbers of needles
|
||||
for num_needles in range(2, 6):
|
||||
dataset_list_name = f'needlebench_{num_needles}needle_zh_datasets'
|
||||
|
||||
for original_context_length in context_lengths:
|
||||
for depth_percent in depths_list:
|
||||
dataset_dict = {
|
||||
'abbr': f'Length{original_context_length}'
|
||||
f'Depth{int(depth_percent)}_{num_needles}needle_zh_1000k',
|
||||
'type': NeedleBenchMultiDataset,
|
||||
'path': base_path,
|
||||
'length': original_context_length,
|
||||
'depth': int(depth_percent),
|
||||
'tokenizer_model': 'gpt-4',
|
||||
'file_list': file_list,
|
||||
'num_repeats_per_file': 10,
|
||||
'length_buffer': length_buffer,
|
||||
'language': language,
|
||||
'needle_file_name': needle_file_name,
|
||||
'num_needles': num_needles,
|
||||
'diff': diff,
|
||||
'reader_cfg': needlebench_reader_cfg,
|
||||
'infer_cfg': needlebench_infer_cfg,
|
||||
'eval_cfg': needlebench_eval_cfg,
|
||||
}
|
||||
|
||||
# Add to the appropriate list using globals()
|
||||
globals()[f'needlebench_{num_needles}needle_zh_datasets'].append(dataset_dict)
|
@ -0,0 +1,55 @@
|
||||
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelDataset
|
||||
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelEvaluator
|
||||
from mmengine.config import read_base
|
||||
|
||||
with read_base():
|
||||
from .needlebench_v2_single_1000k import depths_list as depths, context_lengths
|
||||
from .needlebench_v2_single_1000k import needlebench_reader_cfg, needlebench_infer_cfg, needlebench_eval_cfg
|
||||
|
||||
needlebench_eval_cfg['evaluator']['type'] = NeedleBenchParallelEvaluator
|
||||
|
||||
base_path = 'opencompass/needlebench'
|
||||
needle_file_name = 'needles.jsonl'
|
||||
|
||||
# Define configurations for both English and Chinese datasets
|
||||
language_configs = [
|
||||
{
|
||||
'file_list': ['PaulGrahamEssays.jsonl'],
|
||||
'dataset_var': 'needlebench_en_datasets',
|
||||
'language': 'English',
|
||||
'length_buffer': 3000,
|
||||
'suffix': 'en'
|
||||
},
|
||||
{
|
||||
'file_list': ['zh_finance.jsonl'],
|
||||
'dataset_var': 'needlebench_zh_datasets',
|
||||
'language': 'Chinese',
|
||||
'length_buffer': 200,
|
||||
'suffix': 'zh'
|
||||
}
|
||||
]
|
||||
|
||||
# Initialize empty dataset lists
|
||||
needlebench_en_datasets = []
|
||||
needlebench_zh_datasets = []
|
||||
|
||||
# Single loop to handle both languages
|
||||
for config in language_configs:
|
||||
for original_context_length in context_lengths:
|
||||
dataset_dict = {
|
||||
'abbr': f'Length{original_context_length}_parallel_{config["suffix"]}_1000k',
|
||||
'type': NeedleBenchParallelDataset,
|
||||
'path': base_path,
|
||||
'needle_file_name': needle_file_name,
|
||||
'length': original_context_length,
|
||||
'depths': depths,
|
||||
'tokenizer_model': 'gpt-4',
|
||||
'file_list': config['file_list'],
|
||||
'num_repeats_per_file': 25,
|
||||
'length_buffer': config['length_buffer'],
|
||||
'language': config['language'],
|
||||
'reader_cfg': needlebench_reader_cfg,
|
||||
'infer_cfg': needlebench_infer_cfg,
|
||||
'eval_cfg': needlebench_eval_cfg,
|
||||
}
|
||||
globals()[config['dataset_var']].append(dataset_dict)
|
@ -0,0 +1,81 @@
|
||||
from opencompass.openicl.icl_prompt_template import PromptTemplate
|
||||
from opencompass.openicl.icl_retriever import ZeroRetriever
|
||||
from opencompass.openicl.icl_inferencer import GenInferencer
|
||||
from opencompass.datasets.needlebench_v2.origin import NeedleBenchOriginDataset
|
||||
from opencompass.datasets.needlebench_v2.origin import NeedleBenchOriginEvaluator
|
||||
from opencompass.datasets.needlebench_v2.origin import needlebench_postprocess
|
||||
from opencompass.datasets.needlebench_v2.origin import needlebench_dataset_postprocess
|
||||
|
||||
|
||||
needlebench_reader_cfg = dict(input_columns=['prompt'], output_column='answer')
|
||||
|
||||
needlebench_infer_cfg = dict(
|
||||
prompt_template=dict(
|
||||
type=PromptTemplate,
|
||||
template=dict(
|
||||
round=[
|
||||
dict(role='HUMAN', prompt='{prompt}'),
|
||||
dict(role='BOT', prompt='{answer}\n'),
|
||||
]
|
||||
),
|
||||
),
|
||||
retriever=dict(type=ZeroRetriever),
|
||||
inferencer=dict(type=GenInferencer),
|
||||
)
|
||||
|
||||
needlebench_eval_cfg = dict(
|
||||
evaluator=dict(type=NeedleBenchOriginEvaluator),
|
||||
pred_postprocessor=dict(type=needlebench_postprocess),
|
||||
dataset_postprocessor=dict(type=needlebench_dataset_postprocess),
|
||||
pred_role='BOT',
|
||||
)
|
||||
|
||||
context_lengths = list([1000, 125000, 250000, 375000, 500000, 625000, 750000, 875000, 1000000])
|
||||
depths_list = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
|
||||
base_path = 'opencompass/needlebench'
|
||||
needle_file_name = 'needles.jsonl'
|
||||
|
||||
# Define configurations for both English and Chinese datasets
|
||||
language_configs = [
|
||||
{
|
||||
'file_list': ['PaulGrahamEssays.jsonl'],
|
||||
'dataset_var': 'needlebench_en_datasets',
|
||||
'language': 'English',
|
||||
'length_buffer': 3000,
|
||||
'suffix': 'en'
|
||||
},
|
||||
{
|
||||
'file_list': ['zh_finance.jsonl'],
|
||||
'dataset_var': 'needlebench_zh_datasets',
|
||||
'language': 'Chinese',
|
||||
'length_buffer': 200,
|
||||
'suffix': 'zh'
|
||||
}
|
||||
]
|
||||
|
||||
# Initialize empty dataset lists
|
||||
needlebench_en_datasets = []
|
||||
needlebench_zh_datasets = []
|
||||
|
||||
# Single loop to handle both languages
|
||||
for config in language_configs:
|
||||
for original_context_length in context_lengths:
|
||||
for depth_percent in depths_list:
|
||||
dataset_dict = {
|
||||
'abbr': f'Length{original_context_length}'
|
||||
f'Depth{int(depth_percent)}_origin_{config["suffix"]}_1000k',
|
||||
'type': NeedleBenchOriginDataset,
|
||||
'path': base_path,
|
||||
'length': original_context_length,
|
||||
'depth': int(depth_percent),
|
||||
'tokenizer_model': 'gpt-4',
|
||||
'file_list': config['file_list'],
|
||||
'num_repeats_per_file': 10,
|
||||
'length_buffer': config['length_buffer'],
|
||||
'language': config['language'],
|
||||
'needle_file_name': needle_file_name,
|
||||
'reader_cfg': needlebench_reader_cfg,
|
||||
'infer_cfg': needlebench_infer_cfg,
|
||||
'eval_cfg': needlebench_eval_cfg,
|
||||
}
|
||||
globals()[config['dataset_var']].append(dataset_dict)
|
@ -0,0 +1,32 @@
|
||||
from mmengine.config import read_base
|
||||
|
||||
with read_base():
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_multi_reasoning_128k import needlebench_2needle_en_datasets as needlebench_multi_2needle_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_multi_reasoning_128k import needlebench_3needle_en_datasets as needlebench_multi_3needle_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_multi_reasoning_128k import needlebench_4needle_en_datasets as needlebench_multi_4needle_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_multi_reasoning_128k import needlebench_5needle_en_datasets as needlebench_multi_5needle_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_multi_reasoning_128k import needlebench_2needle_zh_datasets as needlebench_multi_2needle_zh_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_multi_reasoning_128k import needlebench_3needle_zh_datasets as needlebench_multi_3needle_zh_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_multi_reasoning_128k import needlebench_4needle_zh_datasets as needlebench_multi_4needle_zh_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_multi_reasoning_128k import needlebench_5needle_zh_datasets as needlebench_multi_5needle_zh_datasets
|
||||
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_single_128k import needlebench_en_datasets as needlebench_origin_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_single_128k import needlebench_zh_datasets as needlebench_origin_zh_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_multi_retrieval_128k import needlebench_en_datasets as needlebench_parallel_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_multi_retrieval_128k import needlebench_zh_datasets as needlebench_parallel_zh_datasets
|
||||
|
||||
needlebench_datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
print(len(needlebench_datasets))
|
||||
# sum num_repeats_per_file of all datasets
|
||||
num_repeats_per_file = sum(dataset['num_repeats_per_file'] for dataset in needlebench_datasets) * 8
|
||||
print(num_repeats_per_file)
|
||||
# every repeat is 5 seconds
|
||||
print(num_repeats_per_file * 5 / 60, 'minutes')
|
||||
# print number of hours
|
||||
print(num_repeats_per_file * 5 / 3600, 'hours')
|
||||
|
||||
# if every repeat is 2 minutes, how many days
|
||||
print(num_repeats_per_file * 2 / 60 / 24, 'days')
|
@ -0,0 +1,93 @@
|
||||
from opencompass.datasets.needlebench_v2.multi import NeedleBenchMultiDataset
|
||||
from mmengine.config import read_base
|
||||
with read_base():
|
||||
from .needlebench_v2_single_128k import depths_list, context_lengths
|
||||
from .needlebench_v2_single_128k import needlebench_reader_cfg, needlebench_infer_cfg
|
||||
from opencompass.configs.datasets.needlebench_v2.atc.atc_0shot_nocot_2_power_en import needlebench_atc_eval_cfg as needlebench_eval_cfg
|
||||
|
||||
|
||||
# ----------English Version----------
|
||||
base_path = 'opencompass/needlebench'
|
||||
file_list = ['PaulGrahamEssays.jsonl']
|
||||
needle_file_name = 'names.json'
|
||||
diff = 10
|
||||
language = 'English'
|
||||
length_buffer = 3000
|
||||
|
||||
# Initialize dataset lists
|
||||
needlebench_2needle_en_datasets = []
|
||||
needlebench_3needle_en_datasets = []
|
||||
needlebench_4needle_en_datasets = []
|
||||
needlebench_5needle_en_datasets = []
|
||||
|
||||
# Create datasets for different numbers of needles
|
||||
for num_needles in range(2, 6):
|
||||
dataset_list_name = f'needlebench_{num_needles}needle_en_datasets'
|
||||
|
||||
for original_context_length in context_lengths:
|
||||
for depth_percent in depths_list:
|
||||
dataset_dict = {
|
||||
'abbr': f'Length{original_context_length}'
|
||||
f'Depth{int(depth_percent)}_{num_needles}needle_en_128k',
|
||||
'type': NeedleBenchMultiDataset,
|
||||
'path': base_path,
|
||||
'length': original_context_length,
|
||||
'depth': int(depth_percent),
|
||||
'tokenizer_model': 'gpt-4',
|
||||
'file_list': file_list,
|
||||
'num_repeats_per_file': 10,
|
||||
'length_buffer': length_buffer,
|
||||
'language': language,
|
||||
'needle_file_name': needle_file_name,
|
||||
'num_needles': num_needles,
|
||||
'diff': diff,
|
||||
'reader_cfg': needlebench_reader_cfg,
|
||||
'infer_cfg': needlebench_infer_cfg,
|
||||
'eval_cfg': needlebench_eval_cfg,
|
||||
}
|
||||
|
||||
# Add to the appropriate list using globals()
|
||||
globals()[f'needlebench_{num_needles}needle_en_datasets'].append(dataset_dict)
|
||||
|
||||
# ----------Chinese Version----------
|
||||
base_path = 'opencompass/needlebench'
|
||||
file_list = ['zh_finance.jsonl']
|
||||
needle_file_name = 'names.json'
|
||||
diff = 10
|
||||
language = 'Chinese'
|
||||
length_buffer = 200
|
||||
|
||||
# Initialize dataset lists
|
||||
needlebench_2needle_zh_datasets = []
|
||||
needlebench_3needle_zh_datasets = []
|
||||
needlebench_4needle_zh_datasets = []
|
||||
needlebench_5needle_zh_datasets = []
|
||||
|
||||
# Create datasets for different numbers of needles
|
||||
for num_needles in range(2, 6):
|
||||
dataset_list_name = f'needlebench_{num_needles}needle_zh_datasets'
|
||||
|
||||
for original_context_length in context_lengths:
|
||||
for depth_percent in depths_list:
|
||||
dataset_dict = {
|
||||
'abbr': f'Length{original_context_length}'
|
||||
f'Depth{int(depth_percent)}_{num_needles}needle_zh_128k',
|
||||
'type': NeedleBenchMultiDataset,
|
||||
'path': base_path,
|
||||
'length': original_context_length,
|
||||
'depth': int(depth_percent),
|
||||
'tokenizer_model': 'gpt-4',
|
||||
'file_list': file_list,
|
||||
'num_repeats_per_file': 10,
|
||||
'length_buffer': length_buffer,
|
||||
'language': language,
|
||||
'needle_file_name': needle_file_name,
|
||||
'num_needles': num_needles,
|
||||
'diff': diff,
|
||||
'reader_cfg': needlebench_reader_cfg,
|
||||
'infer_cfg': needlebench_infer_cfg,
|
||||
'eval_cfg': needlebench_eval_cfg,
|
||||
}
|
||||
|
||||
# Add to the appropriate list using globals()
|
||||
globals()[f'needlebench_{num_needles}needle_zh_datasets'].append(dataset_dict)
|
@ -0,0 +1,55 @@
|
||||
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelDataset
|
||||
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelEvaluator
|
||||
from mmengine.config import read_base
|
||||
|
||||
with read_base():
|
||||
from .needlebench_v2_single_128k import depths_list as depths, context_lengths
|
||||
from .needlebench_v2_single_128k import needlebench_reader_cfg, needlebench_infer_cfg, needlebench_eval_cfg
|
||||
|
||||
needlebench_eval_cfg['evaluator']['type'] = NeedleBenchParallelEvaluator
|
||||
|
||||
base_path = 'opencompass/needlebench'
|
||||
needle_file_name = 'needles.jsonl'
|
||||
|
||||
# Define configurations for both English and Chinese datasets
|
||||
language_configs = [
|
||||
{
|
||||
'file_list': ['PaulGrahamEssays.jsonl'],
|
||||
'dataset_var': 'needlebench_en_datasets',
|
||||
'language': 'English',
|
||||
'length_buffer': 3000,
|
||||
'suffix': 'en'
|
||||
},
|
||||
{
|
||||
'file_list': ['zh_finance.jsonl'],
|
||||
'dataset_var': 'needlebench_zh_datasets',
|
||||
'language': 'Chinese',
|
||||
'length_buffer': 200,
|
||||
'suffix': 'zh'
|
||||
}
|
||||
]
|
||||
|
||||
# Initialize empty dataset lists
|
||||
needlebench_en_datasets = []
|
||||
needlebench_zh_datasets = []
|
||||
|
||||
# Single loop to handle both languages
|
||||
for config in language_configs:
|
||||
for original_context_length in context_lengths:
|
||||
dataset_dict = {
|
||||
'abbr': f'Length{original_context_length}_parallel_{config["suffix"]}_128k',
|
||||
'type': NeedleBenchParallelDataset,
|
||||
'path': base_path,
|
||||
'needle_file_name': needle_file_name,
|
||||
'length': original_context_length,
|
||||
'depths': depths,
|
||||
'tokenizer_model': 'gpt-4',
|
||||
'file_list': config['file_list'],
|
||||
'num_repeats_per_file': 25,
|
||||
'length_buffer': config['length_buffer'],
|
||||
'language': config['language'],
|
||||
'reader_cfg': needlebench_reader_cfg,
|
||||
'infer_cfg': needlebench_infer_cfg,
|
||||
'eval_cfg': needlebench_eval_cfg,
|
||||
}
|
||||
globals()[config['dataset_var']].append(dataset_dict)
|
@ -0,0 +1,82 @@
|
||||
from opencompass.openicl.icl_prompt_template import PromptTemplate
|
||||
from opencompass.openicl.icl_retriever import ZeroRetriever
|
||||
from opencompass.openicl.icl_inferencer import GenInferencer
|
||||
from opencompass.datasets.needlebench_v2.origin import NeedleBenchOriginDataset
|
||||
from opencompass.datasets.needlebench_v2.origin import NeedleBenchOriginEvaluator
|
||||
from opencompass.datasets.needlebench_v2.origin import needlebench_postprocess
|
||||
from opencompass.datasets.needlebench_v2.origin import needlebench_dataset_postprocess
|
||||
|
||||
|
||||
needlebench_reader_cfg = dict(input_columns=['prompt'], output_column='answer')
|
||||
|
||||
needlebench_infer_cfg = dict(
|
||||
prompt_template=dict(
|
||||
type=PromptTemplate,
|
||||
template=dict(
|
||||
round=[
|
||||
dict(role='HUMAN', prompt='{prompt}'),
|
||||
dict(role='BOT', prompt='{answer}\n'),
|
||||
]
|
||||
),
|
||||
),
|
||||
retriever=dict(type=ZeroRetriever),
|
||||
inferencer=dict(type=GenInferencer),
|
||||
)
|
||||
|
||||
needlebench_eval_cfg = dict(
|
||||
evaluator=dict(type=NeedleBenchOriginEvaluator),
|
||||
pred_postprocessor=dict(type=needlebench_postprocess),
|
||||
dataset_postprocessor=dict(type=needlebench_dataset_postprocess),
|
||||
pred_role='BOT',
|
||||
)
|
||||
|
||||
context_lengths = list([1000, 2000, 4000, 8000, 16000, 32000, 64000, 128000])
|
||||
# context_lengths = [128000]
|
||||
depths_list = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
|
||||
base_path = 'opencompass/needlebench'
|
||||
needle_file_name = 'needles.jsonl'
|
||||
|
||||
# Define configurations for both English and Chinese datasets
|
||||
language_configs = [
|
||||
{
|
||||
'file_list': ['PaulGrahamEssays.jsonl'],
|
||||
'dataset_var': 'needlebench_en_datasets',
|
||||
'language': 'English',
|
||||
'length_buffer': 3000,
|
||||
'suffix': 'en'
|
||||
},
|
||||
{
|
||||
'file_list': ['zh_finance.jsonl'],
|
||||
'dataset_var': 'needlebench_zh_datasets',
|
||||
'language': 'Chinese',
|
||||
'length_buffer': 200,
|
||||
'suffix': 'zh'
|
||||
}
|
||||
]
|
||||
|
||||
# Initialize empty dataset lists
|
||||
needlebench_en_datasets = []
|
||||
needlebench_zh_datasets = []
|
||||
|
||||
# Single loop to handle both languages
|
||||
for config in language_configs:
|
||||
for original_context_length in context_lengths:
|
||||
for depth_percent in depths_list:
|
||||
dataset_dict = {
|
||||
'abbr': f'Length{original_context_length}'
|
||||
f'Depth{int(depth_percent)}_origin_{config["suffix"]}_128k',
|
||||
'type': NeedleBenchOriginDataset,
|
||||
'path': base_path,
|
||||
'length': original_context_length,
|
||||
'depth': int(depth_percent),
|
||||
'tokenizer_model': 'gpt-4',
|
||||
'file_list': config['file_list'],
|
||||
'num_repeats_per_file': 10,
|
||||
'length_buffer': config['length_buffer'],
|
||||
'language': config['language'],
|
||||
'needle_file_name': needle_file_name,
|
||||
'reader_cfg': needlebench_reader_cfg,
|
||||
'infer_cfg': needlebench_infer_cfg,
|
||||
'eval_cfg': needlebench_eval_cfg,
|
||||
}
|
||||
globals()[config['dataset_var']].append(dataset_dict)
|
@ -0,0 +1,18 @@
|
||||
from mmengine.config import read_base
|
||||
|
||||
with read_base():
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_multi_reasoning_200k import needlebench_2needle_en_datasets as needlebench_multi_2needle_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_multi_reasoning_200k import needlebench_3needle_en_datasets as needlebench_multi_3needle_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_multi_reasoning_200k import needlebench_4needle_en_datasets as needlebench_multi_4needle_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_multi_reasoning_200k import needlebench_5needle_en_datasets as needlebench_multi_5needle_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_multi_reasoning_200k import needlebench_2needle_zh_datasets as needlebench_multi_2needle_zh_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_multi_reasoning_200k import needlebench_3needle_zh_datasets as needlebench_multi_3needle_zh_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_multi_reasoning_200k import needlebench_4needle_zh_datasets as needlebench_multi_4needle_zh_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_multi_reasoning_200k import needlebench_5needle_zh_datasets as needlebench_multi_5needle_zh_datasets
|
||||
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_single_200k import needlebench_en_datasets as needlebench_origin_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_single_200k import needlebench_zh_datasets as needlebench_origin_zh_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_multi_retrieval_200k import needlebench_en_datasets as needlebench_parallel_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_multi_retrieval_200k import needlebench_zh_datasets as needlebench_parallel_zh_datasets
|
||||
|
||||
needlebench_datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
|
@ -0,0 +1,93 @@
|
||||
from opencompass.datasets.needlebench_v2.multi import NeedleBenchMultiDataset
|
||||
from mmengine.config import read_base
|
||||
with read_base():
|
||||
from .needlebench_v2_single_200k import depths_list, context_lengths
|
||||
from .needlebench_v2_single_200k import needlebench_reader_cfg, needlebench_infer_cfg
|
||||
from opencompass.configs.datasets.needlebench_v2.atc.atc_0shot_nocot_2_power_en import needlebench_atc_eval_cfg as needlebench_eval_cfg
|
||||
|
||||
|
||||
# ----------English Version----------
|
||||
base_path = 'opencompass/needlebench'
|
||||
file_list = ['PaulGrahamEssays.jsonl']
|
||||
needle_file_name = 'names.json'
|
||||
diff = 10
|
||||
language = 'English'
|
||||
length_buffer = 3000
|
||||
|
||||
# Initialize dataset lists
|
||||
needlebench_2needle_en_datasets = []
|
||||
needlebench_3needle_en_datasets = []
|
||||
needlebench_4needle_en_datasets = []
|
||||
needlebench_5needle_en_datasets = []
|
||||
|
||||
# Create datasets for different numbers of needles
|
||||
for num_needles in range(2, 6):
|
||||
dataset_list_name = f'needlebench_{num_needles}needle_en_datasets'
|
||||
|
||||
for original_context_length in context_lengths:
|
||||
for depth_percent in depths_list:
|
||||
dataset_dict = {
|
||||
'abbr': f'Length{original_context_length}'
|
||||
f'Depth{int(depth_percent)}_{num_needles}needle_en_200k',
|
||||
'type': NeedleBenchMultiDataset,
|
||||
'path': base_path,
|
||||
'length': original_context_length,
|
||||
'depth': int(depth_percent),
|
||||
'tokenizer_model': 'gpt-4',
|
||||
'file_list': file_list,
|
||||
'num_repeats_per_file': 10,
|
||||
'length_buffer': length_buffer,
|
||||
'language': language,
|
||||
'needle_file_name': needle_file_name,
|
||||
'num_needles': num_needles,
|
||||
'diff': diff,
|
||||
'reader_cfg': needlebench_reader_cfg,
|
||||
'infer_cfg': needlebench_infer_cfg,
|
||||
'eval_cfg': needlebench_eval_cfg,
|
||||
}
|
||||
|
||||
# Add to the appropriate list using globals()
|
||||
globals()[f'needlebench_{num_needles}needle_en_datasets'].append(dataset_dict)
|
||||
|
||||
# ----------Chinese Version----------
|
||||
base_path = 'opencompass/needlebench'
|
||||
file_list = ['zh_finance.jsonl']
|
||||
needle_file_name = 'names.json'
|
||||
diff = 10
|
||||
language = 'Chinese'
|
||||
length_buffer = 200
|
||||
|
||||
# Initialize dataset lists
|
||||
needlebench_2needle_zh_datasets = []
|
||||
needlebench_3needle_zh_datasets = []
|
||||
needlebench_4needle_zh_datasets = []
|
||||
needlebench_5needle_zh_datasets = []
|
||||
|
||||
# Create datasets for different numbers of needles
|
||||
for num_needles in range(2, 6):
|
||||
dataset_list_name = f'needlebench_{num_needles}needle_zh_datasets'
|
||||
|
||||
for original_context_length in context_lengths:
|
||||
for depth_percent in depths_list:
|
||||
dataset_dict = {
|
||||
'abbr': f'Length{original_context_length}'
|
||||
f'Depth{int(depth_percent)}_{num_needles}needle_zh_200k',
|
||||
'type': NeedleBenchMultiDataset,
|
||||
'path': base_path,
|
||||
'length': original_context_length,
|
||||
'depth': int(depth_percent),
|
||||
'tokenizer_model': 'gpt-4',
|
||||
'file_list': file_list,
|
||||
'num_repeats_per_file': 10,
|
||||
'length_buffer': length_buffer,
|
||||
'language': language,
|
||||
'needle_file_name': needle_file_name,
|
||||
'num_needles': num_needles,
|
||||
'diff': diff,
|
||||
'reader_cfg': needlebench_reader_cfg,
|
||||
'infer_cfg': needlebench_infer_cfg,
|
||||
'eval_cfg': needlebench_eval_cfg,
|
||||
}
|
||||
|
||||
# Add to the appropriate list using globals()
|
||||
globals()[f'needlebench_{num_needles}needle_zh_datasets'].append(dataset_dict)
|
@ -0,0 +1,55 @@
|
||||
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelDataset
|
||||
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelEvaluator
|
||||
from mmengine.config import read_base
|
||||
|
||||
with read_base():
|
||||
from .needlebench_v2_single_200k import depths_list as depths, context_lengths
|
||||
from .needlebench_v2_single_200k import needlebench_reader_cfg, needlebench_infer_cfg, needlebench_eval_cfg
|
||||
|
||||
needlebench_eval_cfg['evaluator']['type'] = NeedleBenchParallelEvaluator
|
||||
|
||||
base_path = 'opencompass/needlebench'
|
||||
needle_file_name = 'needles.jsonl'
|
||||
|
||||
# Define configurations for both English and Chinese datasets
|
||||
language_configs = [
|
||||
{
|
||||
'file_list': ['PaulGrahamEssays.jsonl'],
|
||||
'dataset_var': 'needlebench_en_datasets',
|
||||
'language': 'English',
|
||||
'length_buffer': 3000,
|
||||
'suffix': 'en'
|
||||
},
|
||||
{
|
||||
'file_list': ['zh_finance.jsonl'],
|
||||
'dataset_var': 'needlebench_zh_datasets',
|
||||
'language': 'Chinese',
|
||||
'length_buffer': 200,
|
||||
'suffix': 'zh'
|
||||
}
|
||||
]
|
||||
|
||||
# Initialize empty dataset lists
|
||||
needlebench_en_datasets = []
|
||||
needlebench_zh_datasets = []
|
||||
|
||||
# Single loop to handle both languages
|
||||
for config in language_configs:
|
||||
for original_context_length in context_lengths:
|
||||
dataset_dict = {
|
||||
'abbr': f'Length{original_context_length}_parallel_{config["suffix"]}_200k',
|
||||
'type': NeedleBenchParallelDataset,
|
||||
'path': base_path,
|
||||
'needle_file_name': needle_file_name,
|
||||
'length': original_context_length,
|
||||
'depths': depths,
|
||||
'tokenizer_model': 'gpt-4',
|
||||
'file_list': config['file_list'],
|
||||
'num_repeats_per_file': 25,
|
||||
'length_buffer': config['length_buffer'],
|
||||
'language': config['language'],
|
||||
'reader_cfg': needlebench_reader_cfg,
|
||||
'infer_cfg': needlebench_infer_cfg,
|
||||
'eval_cfg': needlebench_eval_cfg,
|
||||
}
|
||||
globals()[config['dataset_var']].append(dataset_dict)
|
@ -0,0 +1,81 @@
|
||||
from opencompass.openicl.icl_prompt_template import PromptTemplate
|
||||
from opencompass.openicl.icl_retriever import ZeroRetriever
|
||||
from opencompass.openicl.icl_inferencer import GenInferencer
|
||||
from opencompass.datasets.needlebench_v2.origin import NeedleBenchOriginDataset
|
||||
from opencompass.datasets.needlebench_v2.origin import NeedleBenchOriginEvaluator
|
||||
from opencompass.datasets.needlebench_v2.origin import needlebench_postprocess
|
||||
from opencompass.datasets.needlebench_v2.origin import needlebench_dataset_postprocess
|
||||
|
||||
|
||||
needlebench_reader_cfg = dict(input_columns=['prompt'], output_column='answer')
|
||||
|
||||
needlebench_infer_cfg = dict(
|
||||
prompt_template=dict(
|
||||
type=PromptTemplate,
|
||||
template=dict(
|
||||
round=[
|
||||
dict(role='HUMAN', prompt='{prompt}'),
|
||||
dict(role='BOT', prompt='{answer}\n'),
|
||||
]
|
||||
),
|
||||
),
|
||||
retriever=dict(type=ZeroRetriever),
|
||||
inferencer=dict(type=GenInferencer),
|
||||
)
|
||||
|
||||
needlebench_eval_cfg = dict(
|
||||
evaluator=dict(type=NeedleBenchOriginEvaluator),
|
||||
pred_postprocessor=dict(type=needlebench_postprocess),
|
||||
dataset_postprocessor=dict(type=needlebench_dataset_postprocess),
|
||||
pred_role='BOT',
|
||||
)
|
||||
|
||||
context_lengths = list([1000, 25000, 50000, 75000, 100000, 125000, 150000, 175000, 200000])
|
||||
depths_list = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
|
||||
base_path = 'opencompass/needlebench'
|
||||
needle_file_name = 'needles.jsonl'
|
||||
|
||||
# Define configurations for both English and Chinese datasets
|
||||
language_configs = [
|
||||
{
|
||||
'file_list': ['PaulGrahamEssays.jsonl'],
|
||||
'dataset_var': 'needlebench_en_datasets',
|
||||
'language': 'English',
|
||||
'length_buffer': 3000,
|
||||
'suffix': 'en'
|
||||
},
|
||||
{
|
||||
'file_list': ['zh_finance.jsonl'],
|
||||
'dataset_var': 'needlebench_zh_datasets',
|
||||
'language': 'Chinese',
|
||||
'length_buffer': 200,
|
||||
'suffix': 'zh'
|
||||
}
|
||||
]
|
||||
|
||||
# Initialize empty dataset lists
|
||||
needlebench_en_datasets = []
|
||||
needlebench_zh_datasets = []
|
||||
|
||||
# Single loop to handle both languages
|
||||
for config in language_configs:
|
||||
for original_context_length in context_lengths:
|
||||
for depth_percent in depths_list:
|
||||
dataset_dict = {
|
||||
'abbr': f'Length{original_context_length}'
|
||||
f'Depth{int(depth_percent)}_origin_{config["suffix"]}_200k',
|
||||
'type': NeedleBenchOriginDataset,
|
||||
'path': base_path,
|
||||
'length': original_context_length,
|
||||
'depth': int(depth_percent),
|
||||
'tokenizer_model': 'gpt-4',
|
||||
'file_list': config['file_list'],
|
||||
'num_repeats_per_file': 10,
|
||||
'length_buffer': config['length_buffer'],
|
||||
'language': config['language'],
|
||||
'needle_file_name': needle_file_name,
|
||||
'reader_cfg': needlebench_reader_cfg,
|
||||
'infer_cfg': needlebench_infer_cfg,
|
||||
'eval_cfg': needlebench_eval_cfg,
|
||||
}
|
||||
globals()[config['dataset_var']].append(dataset_dict)
|
@ -0,0 +1,18 @@
|
||||
from mmengine.config import read_base
|
||||
|
||||
with read_base():
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_multi_reasoning_256k import needlebench_2needle_en_datasets as needlebench_multi_2needle_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_multi_reasoning_256k import needlebench_3needle_en_datasets as needlebench_multi_3needle_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_multi_reasoning_256k import needlebench_4needle_en_datasets as needlebench_multi_4needle_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_multi_reasoning_256k import needlebench_5needle_en_datasets as needlebench_multi_5needle_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_multi_reasoning_256k import needlebench_2needle_zh_datasets as needlebench_multi_2needle_zh_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_multi_reasoning_256k import needlebench_3needle_zh_datasets as needlebench_multi_3needle_zh_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_multi_reasoning_256k import needlebench_4needle_zh_datasets as needlebench_multi_4needle_zh_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_multi_reasoning_256k import needlebench_5needle_zh_datasets as needlebench_multi_5needle_zh_datasets
|
||||
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_single_256k import needlebench_en_datasets as needlebench_origin_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_single_256k import needlebench_zh_datasets as needlebench_origin_zh_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_multi_retrieval_256k import needlebench_en_datasets as needlebench_parallel_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_multi_retrieval_256k import needlebench_zh_datasets as needlebench_parallel_zh_datasets
|
||||
|
||||
needlebench_datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
|
@ -0,0 +1,93 @@
|
||||
from opencompass.datasets.needlebench_v2.multi import NeedleBenchMultiDataset
|
||||
from mmengine.config import read_base
|
||||
with read_base():
|
||||
from .needlebench_v2_single_256k import depths_list, context_lengths
|
||||
from .needlebench_v2_single_256k import needlebench_reader_cfg, needlebench_infer_cfg
|
||||
from opencompass.configs.datasets.needlebench_v2.atc.atc_0shot_nocot_2_power_en import needlebench_atc_eval_cfg as needlebench_eval_cfg
|
||||
|
||||
|
||||
# ----------English Version----------
|
||||
base_path = 'opencompass/needlebench'
|
||||
file_list = ['PaulGrahamEssays.jsonl']
|
||||
needle_file_name = 'names.json'
|
||||
diff = 10
|
||||
language = 'English'
|
||||
length_buffer = 3000
|
||||
|
||||
# Initialize dataset lists
|
||||
needlebench_2needle_en_datasets = []
|
||||
needlebench_3needle_en_datasets = []
|
||||
needlebench_4needle_en_datasets = []
|
||||
needlebench_5needle_en_datasets = []
|
||||
|
||||
# Create datasets for different numbers of needles
|
||||
for num_needles in range(2, 6):
|
||||
dataset_list_name = f'needlebench_{num_needles}needle_en_datasets'
|
||||
|
||||
for original_context_length in context_lengths:
|
||||
for depth_percent in depths_list:
|
||||
dataset_dict = {
|
||||
'abbr': f'Length{original_context_length}'
|
||||
f'Depth{int(depth_percent)}_{num_needles}needle_en_256k',
|
||||
'type': NeedleBenchMultiDataset,
|
||||
'path': base_path,
|
||||
'length': original_context_length,
|
||||
'depth': int(depth_percent),
|
||||
'tokenizer_model': 'gpt-4',
|
||||
'file_list': file_list,
|
||||
'num_repeats_per_file': 10,
|
||||
'length_buffer': length_buffer,
|
||||
'language': language,
|
||||
'needle_file_name': needle_file_name,
|
||||
'num_needles': num_needles,
|
||||
'diff': diff,
|
||||
'reader_cfg': needlebench_reader_cfg,
|
||||
'infer_cfg': needlebench_infer_cfg,
|
||||
'eval_cfg': needlebench_eval_cfg,
|
||||
}
|
||||
|
||||
# Add to the appropriate list using globals()
|
||||
globals()[f'needlebench_{num_needles}needle_en_datasets'].append(dataset_dict)
|
||||
|
||||
# ----------Chinese Version----------
|
||||
base_path = 'opencompass/needlebench'
|
||||
file_list = ['zh_finance.jsonl']
|
||||
needle_file_name = 'names.json'
|
||||
diff = 10
|
||||
language = 'Chinese'
|
||||
length_buffer = 200
|
||||
|
||||
# Initialize dataset lists
|
||||
needlebench_2needle_zh_datasets = []
|
||||
needlebench_3needle_zh_datasets = []
|
||||
needlebench_4needle_zh_datasets = []
|
||||
needlebench_5needle_zh_datasets = []
|
||||
|
||||
# Create datasets for different numbers of needles
|
||||
for num_needles in range(2, 6):
|
||||
dataset_list_name = f'needlebench_{num_needles}needle_zh_datasets'
|
||||
|
||||
for original_context_length in context_lengths:
|
||||
for depth_percent in depths_list:
|
||||
dataset_dict = {
|
||||
'abbr': f'Length{original_context_length}'
|
||||
f'Depth{int(depth_percent)}_{num_needles}needle_zh_256k',
|
||||
'type': NeedleBenchMultiDataset,
|
||||
'path': base_path,
|
||||
'length': original_context_length,
|
||||
'depth': int(depth_percent),
|
||||
'tokenizer_model': 'gpt-4',
|
||||
'file_list': file_list,
|
||||
'num_repeats_per_file': 10,
|
||||
'length_buffer': length_buffer,
|
||||
'language': language,
|
||||
'needle_file_name': needle_file_name,
|
||||
'num_needles': num_needles,
|
||||
'diff': diff,
|
||||
'reader_cfg': needlebench_reader_cfg,
|
||||
'infer_cfg': needlebench_infer_cfg,
|
||||
'eval_cfg': needlebench_eval_cfg,
|
||||
}
|
||||
|
||||
# Add to the appropriate list using globals()
|
||||
globals()[f'needlebench_{num_needles}needle_zh_datasets'].append(dataset_dict)
|
@ -0,0 +1,55 @@
|
||||
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelDataset
|
||||
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelEvaluator
|
||||
from mmengine.config import read_base
|
||||
|
||||
with read_base():
|
||||
from .needlebench_v2_single_256k import depths_list as depths, context_lengths
|
||||
from .needlebench_v2_single_256k import needlebench_reader_cfg, needlebench_infer_cfg, needlebench_eval_cfg
|
||||
|
||||
needlebench_eval_cfg['evaluator']['type'] = NeedleBenchParallelEvaluator
|
||||
|
||||
base_path = 'opencompass/needlebench'
|
||||
needle_file_name = 'needles.jsonl'
|
||||
|
||||
# Define configurations for both English and Chinese datasets
|
||||
language_configs = [
|
||||
{
|
||||
'file_list': ['PaulGrahamEssays.jsonl'],
|
||||
'dataset_var': 'needlebench_en_datasets',
|
||||
'language': 'English',
|
||||
'length_buffer': 3000,
|
||||
'suffix': 'en'
|
||||
},
|
||||
{
|
||||
'file_list': ['zh_finance.jsonl'],
|
||||
'dataset_var': 'needlebench_zh_datasets',
|
||||
'language': 'Chinese',
|
||||
'length_buffer': 200,
|
||||
'suffix': 'zh'
|
||||
}
|
||||
]
|
||||
|
||||
# Initialize empty dataset lists
|
||||
needlebench_en_datasets = []
|
||||
needlebench_zh_datasets = []
|
||||
|
||||
# Single loop to handle both languages
|
||||
for config in language_configs:
|
||||
for original_context_length in context_lengths:
|
||||
dataset_dict = {
|
||||
'abbr': f'Length{original_context_length}_parallel_{config["suffix"]}_256k',
|
||||
'type': NeedleBenchParallelDataset,
|
||||
'path': base_path,
|
||||
'needle_file_name': needle_file_name,
|
||||
'length': original_context_length,
|
||||
'depths': depths,
|
||||
'tokenizer_model': 'gpt-4',
|
||||
'file_list': config['file_list'],
|
||||
'num_repeats_per_file': 25,
|
||||
'length_buffer': config['length_buffer'],
|
||||
'language': config['language'],
|
||||
'reader_cfg': needlebench_reader_cfg,
|
||||
'infer_cfg': needlebench_infer_cfg,
|
||||
'eval_cfg': needlebench_eval_cfg,
|
||||
}
|
||||
globals()[config['dataset_var']].append(dataset_dict)
|
@ -0,0 +1,81 @@
|
||||
from opencompass.openicl.icl_prompt_template import PromptTemplate
|
||||
from opencompass.openicl.icl_retriever import ZeroRetriever
|
||||
from opencompass.openicl.icl_inferencer import GenInferencer
|
||||
from opencompass.datasets.needlebench_v2.origin import NeedleBenchOriginDataset
|
||||
from opencompass.datasets.needlebench_v2.origin import NeedleBenchOriginEvaluator
|
||||
from opencompass.datasets.needlebench_v2.origin import needlebench_postprocess
|
||||
from opencompass.datasets.needlebench_v2.origin import needlebench_dataset_postprocess
|
||||
|
||||
|
||||
needlebench_reader_cfg = dict(input_columns=['prompt'], output_column='answer')
|
||||
|
||||
needlebench_infer_cfg = dict(
|
||||
prompt_template=dict(
|
||||
type=PromptTemplate,
|
||||
template=dict(
|
||||
round=[
|
||||
dict(role='HUMAN', prompt='{prompt}'),
|
||||
dict(role='BOT', prompt='{answer}\n'),
|
||||
]
|
||||
),
|
||||
),
|
||||
retriever=dict(type=ZeroRetriever),
|
||||
inferencer=dict(type=GenInferencer),
|
||||
)
|
||||
|
||||
needlebench_eval_cfg = dict(
|
||||
evaluator=dict(type=NeedleBenchOriginEvaluator),
|
||||
pred_postprocessor=dict(type=needlebench_postprocess),
|
||||
dataset_postprocessor=dict(type=needlebench_dataset_postprocess),
|
||||
pred_role='BOT',
|
||||
)
|
||||
|
||||
context_lengths = [32000, 128000, 256000]
|
||||
depths_list = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
|
||||
base_path = 'opencompass/needlebench'
|
||||
needle_file_name = 'needles.jsonl'
|
||||
|
||||
# Define configurations for both English and Chinese datasets
|
||||
language_configs = [
|
||||
{
|
||||
'file_list': ['PaulGrahamEssays.jsonl'],
|
||||
'dataset_var': 'needlebench_en_datasets',
|
||||
'language': 'English',
|
||||
'length_buffer': 3000,
|
||||
'suffix': 'en'
|
||||
},
|
||||
{
|
||||
'file_list': ['zh_finance.jsonl'],
|
||||
'dataset_var': 'needlebench_zh_datasets',
|
||||
'language': 'Chinese',
|
||||
'length_buffer': 200,
|
||||
'suffix': 'zh'
|
||||
}
|
||||
]
|
||||
|
||||
# Initialize empty dataset lists
|
||||
needlebench_en_datasets = []
|
||||
needlebench_zh_datasets = []
|
||||
|
||||
# Single loop to handle both languages
|
||||
for config in language_configs:
|
||||
for original_context_length in context_lengths:
|
||||
for depth_percent in depths_list:
|
||||
dataset_dict = {
|
||||
'abbr': f'Length{original_context_length}'
|
||||
f'Depth{int(depth_percent)}_origin_{config["suffix"]}_256k',
|
||||
'type': NeedleBenchOriginDataset,
|
||||
'path': base_path,
|
||||
'length': original_context_length,
|
||||
'depth': int(depth_percent),
|
||||
'tokenizer_model': 'gpt-4',
|
||||
'file_list': config['file_list'],
|
||||
'num_repeats_per_file': 10,
|
||||
'length_buffer': config['length_buffer'],
|
||||
'language': config['language'],
|
||||
'needle_file_name': needle_file_name,
|
||||
'reader_cfg': needlebench_reader_cfg,
|
||||
'infer_cfg': needlebench_infer_cfg,
|
||||
'eval_cfg': needlebench_eval_cfg,
|
||||
}
|
||||
globals()[config['dataset_var']].append(dataset_dict)
|
@ -0,0 +1,18 @@
|
||||
from mmengine.config import read_base
|
||||
|
||||
with read_base():
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_multi_reasoning_32k import needlebench_2needle_en_datasets as needlebench_multi_2needle_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_multi_reasoning_32k import needlebench_3needle_en_datasets as needlebench_multi_3needle_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_multi_reasoning_32k import needlebench_4needle_en_datasets as needlebench_multi_4needle_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_multi_reasoning_32k import needlebench_5needle_en_datasets as needlebench_multi_5needle_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_multi_reasoning_32k import needlebench_2needle_zh_datasets as needlebench_multi_2needle_zh_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_multi_reasoning_32k import needlebench_3needle_zh_datasets as needlebench_multi_3needle_zh_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_multi_reasoning_32k import needlebench_4needle_zh_datasets as needlebench_multi_4needle_zh_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_multi_reasoning_32k import needlebench_5needle_zh_datasets as needlebench_multi_5needle_zh_datasets
|
||||
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_single_32k import needlebench_en_datasets as needlebench_origin_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_single_32k import needlebench_zh_datasets as needlebench_origin_zh_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_multi_retrieval_32k import needlebench_en_datasets as needlebench_parallel_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_multi_retrieval_32k import needlebench_zh_datasets as needlebench_parallel_zh_datasets
|
||||
|
||||
needlebench_datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
|
@ -0,0 +1,93 @@
|
||||
from opencompass.datasets.needlebench_v2.multi import NeedleBenchMultiDataset
|
||||
from mmengine.config import read_base
|
||||
with read_base():
|
||||
from .needlebench_v2_single_32k import depths_list, context_lengths
|
||||
from .needlebench_v2_single_32k import needlebench_reader_cfg, needlebench_infer_cfg
|
||||
from opencompass.configs.datasets.needlebench_v2.atc.atc_0shot_nocot_2_power_en import needlebench_atc_eval_cfg as needlebench_eval_cfg
|
||||
|
||||
|
||||
# ----------English Version----------
|
||||
base_path = 'opencompass/needlebench'
|
||||
file_list = ['PaulGrahamEssays.jsonl']
|
||||
needle_file_name = 'names.json'
|
||||
diff = 10
|
||||
language = 'English'
|
||||
length_buffer = 3000
|
||||
|
||||
# Initialize dataset lists
|
||||
needlebench_2needle_en_datasets = []
|
||||
needlebench_3needle_en_datasets = []
|
||||
needlebench_4needle_en_datasets = []
|
||||
needlebench_5needle_en_datasets = []
|
||||
|
||||
# Create datasets for different numbers of needles
|
||||
for num_needles in range(2, 6):
|
||||
dataset_list_name = f'needlebench_{num_needles}needle_en_datasets'
|
||||
|
||||
for original_context_length in context_lengths:
|
||||
for depth_percent in depths_list:
|
||||
dataset_dict = {
|
||||
'abbr': f'Length{original_context_length}'
|
||||
f'Depth{int(depth_percent)}_{num_needles}needle_en_32k',
|
||||
'type': NeedleBenchMultiDataset,
|
||||
'path': base_path,
|
||||
'length': original_context_length,
|
||||
'depth': int(depth_percent),
|
||||
'tokenizer_model': 'gpt-4',
|
||||
'file_list': file_list,
|
||||
'num_repeats_per_file': 10,
|
||||
'length_buffer': length_buffer,
|
||||
'language': language,
|
||||
'needle_file_name': needle_file_name,
|
||||
'num_needles': num_needles,
|
||||
'diff': diff,
|
||||
'reader_cfg': needlebench_reader_cfg,
|
||||
'infer_cfg': needlebench_infer_cfg,
|
||||
'eval_cfg': needlebench_eval_cfg,
|
||||
}
|
||||
|
||||
# Add to the appropriate list using globals()
|
||||
globals()[f'needlebench_{num_needles}needle_en_datasets'].append(dataset_dict)
|
||||
|
||||
# ----------Chinese Version----------
|
||||
base_path = 'opencompass/needlebench'
|
||||
file_list = ['zh_finance.jsonl']
|
||||
needle_file_name = 'names.json'
|
||||
diff = 10
|
||||
language = 'Chinese'
|
||||
length_buffer = 200
|
||||
|
||||
# Initialize dataset lists
|
||||
needlebench_2needle_zh_datasets = []
|
||||
needlebench_3needle_zh_datasets = []
|
||||
needlebench_4needle_zh_datasets = []
|
||||
needlebench_5needle_zh_datasets = []
|
||||
|
||||
# Create datasets for different numbers of needles
|
||||
for num_needles in range(2, 6):
|
||||
dataset_list_name = f'needlebench_{num_needles}needle_zh_datasets'
|
||||
|
||||
for original_context_length in context_lengths:
|
||||
for depth_percent in depths_list:
|
||||
dataset_dict = {
|
||||
'abbr': f'Length{original_context_length}'
|
||||
f'Depth{int(depth_percent)}_{num_needles}needle_zh_32k',
|
||||
'type': NeedleBenchMultiDataset,
|
||||
'path': base_path,
|
||||
'length': original_context_length,
|
||||
'depth': int(depth_percent),
|
||||
'tokenizer_model': 'gpt-4',
|
||||
'file_list': file_list,
|
||||
'num_repeats_per_file': 10,
|
||||
'length_buffer': length_buffer,
|
||||
'language': language,
|
||||
'needle_file_name': needle_file_name,
|
||||
'num_needles': num_needles,
|
||||
'diff': diff,
|
||||
'reader_cfg': needlebench_reader_cfg,
|
||||
'infer_cfg': needlebench_infer_cfg,
|
||||
'eval_cfg': needlebench_eval_cfg,
|
||||
}
|
||||
|
||||
# Add to the appropriate list using globals()
|
||||
globals()[f'needlebench_{num_needles}needle_zh_datasets'].append(dataset_dict)
|
@ -0,0 +1,55 @@
|
||||
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelDataset
|
||||
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelEvaluator
|
||||
from mmengine.config import read_base
|
||||
|
||||
with read_base():
|
||||
from .needlebench_v2_single_32k import depths_list as depths, context_lengths
|
||||
from .needlebench_v2_single_32k import needlebench_reader_cfg, needlebench_infer_cfg, needlebench_eval_cfg
|
||||
|
||||
needlebench_eval_cfg['evaluator']['type'] = NeedleBenchParallelEvaluator
|
||||
|
||||
base_path = 'opencompass/needlebench'
|
||||
needle_file_name = 'needles.jsonl'
|
||||
|
||||
# Define configurations for both English and Chinese datasets
|
||||
language_configs = [
|
||||
{
|
||||
'file_list': ['PaulGrahamEssays.jsonl'],
|
||||
'dataset_var': 'needlebench_en_datasets',
|
||||
'language': 'English',
|
||||
'length_buffer': 3000,
|
||||
'suffix': 'en'
|
||||
},
|
||||
{
|
||||
'file_list': ['zh_finance.jsonl'],
|
||||
'dataset_var': 'needlebench_zh_datasets',
|
||||
'language': 'Chinese',
|
||||
'length_buffer': 200,
|
||||
'suffix': 'zh'
|
||||
}
|
||||
]
|
||||
|
||||
# Initialize empty dataset lists
|
||||
needlebench_en_datasets = []
|
||||
needlebench_zh_datasets = []
|
||||
|
||||
# Single loop to handle both languages
|
||||
for config in language_configs:
|
||||
for original_context_length in context_lengths:
|
||||
dataset_dict = {
|
||||
'abbr': f'Length{original_context_length}_parallel_{config["suffix"]}_32k',
|
||||
'type': NeedleBenchParallelDataset,
|
||||
'path': base_path,
|
||||
'needle_file_name': needle_file_name,
|
||||
'length': original_context_length,
|
||||
'depths': depths,
|
||||
'tokenizer_model': 'gpt-4',
|
||||
'file_list': config['file_list'],
|
||||
'num_repeats_per_file': 25,
|
||||
'length_buffer': config['length_buffer'],
|
||||
'language': config['language'],
|
||||
'reader_cfg': needlebench_reader_cfg,
|
||||
'infer_cfg': needlebench_infer_cfg,
|
||||
'eval_cfg': needlebench_eval_cfg,
|
||||
}
|
||||
globals()[config['dataset_var']].append(dataset_dict)
|
@ -0,0 +1,81 @@
|
||||
from opencompass.openicl.icl_prompt_template import PromptTemplate
|
||||
from opencompass.openicl.icl_retriever import ZeroRetriever
|
||||
from opencompass.openicl.icl_inferencer import GenInferencer
|
||||
from opencompass.datasets.needlebench_v2.origin import NeedleBenchOriginDataset
|
||||
from opencompass.datasets.needlebench_v2.origin import NeedleBenchOriginEvaluator
|
||||
from opencompass.datasets.needlebench_v2.origin import needlebench_postprocess
|
||||
from opencompass.datasets.needlebench_v2.origin import needlebench_dataset_postprocess
|
||||
|
||||
|
||||
needlebench_reader_cfg = dict(input_columns=['prompt'], output_column='answer')
|
||||
|
||||
needlebench_infer_cfg = dict(
|
||||
prompt_template=dict(
|
||||
type=PromptTemplate,
|
||||
template=dict(
|
||||
round=[
|
||||
dict(role='HUMAN', prompt='{prompt}'),
|
||||
dict(role='BOT', prompt='{answer}\n'),
|
||||
]
|
||||
),
|
||||
),
|
||||
retriever=dict(type=ZeroRetriever),
|
||||
inferencer=dict(type=GenInferencer),
|
||||
)
|
||||
|
||||
needlebench_eval_cfg = dict(
|
||||
evaluator=dict(type=NeedleBenchOriginEvaluator),
|
||||
pred_postprocessor=dict(type=needlebench_postprocess),
|
||||
dataset_postprocessor=dict(type=needlebench_dataset_postprocess),
|
||||
pred_role='BOT',
|
||||
)
|
||||
|
||||
context_lengths = list([1000, 4000, 8000, 12000, 16000, 20000, 24000, 28000, 32000])
|
||||
depths_list = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
|
||||
base_path = 'opencompass/needlebench'
|
||||
needle_file_name = 'needles.jsonl'
|
||||
|
||||
# Define configurations for both English and Chinese datasets
|
||||
language_configs = [
|
||||
{
|
||||
'file_list': ['PaulGrahamEssays.jsonl'],
|
||||
'dataset_var': 'needlebench_en_datasets',
|
||||
'language': 'English',
|
||||
'length_buffer': 3000,
|
||||
'suffix': 'en'
|
||||
},
|
||||
{
|
||||
'file_list': ['zh_finance.jsonl'],
|
||||
'dataset_var': 'needlebench_zh_datasets',
|
||||
'language': 'Chinese',
|
||||
'length_buffer': 200,
|
||||
'suffix': 'zh'
|
||||
}
|
||||
]
|
||||
|
||||
# Initialize empty dataset lists
|
||||
needlebench_en_datasets = []
|
||||
needlebench_zh_datasets = []
|
||||
|
||||
# Single loop to handle both languages
|
||||
for config in language_configs:
|
||||
for original_context_length in context_lengths:
|
||||
for depth_percent in depths_list:
|
||||
dataset_dict = {
|
||||
'abbr': f'Length{original_context_length}'
|
||||
f'Depth{int(depth_percent)}_origin_{config["suffix"]}_32k',
|
||||
'type': NeedleBenchOriginDataset,
|
||||
'path': base_path,
|
||||
'length': original_context_length,
|
||||
'depth': int(depth_percent),
|
||||
'tokenizer_model': 'gpt-4',
|
||||
'file_list': config['file_list'],
|
||||
'num_repeats_per_file': 10,
|
||||
'length_buffer': config['length_buffer'],
|
||||
'language': config['language'],
|
||||
'needle_file_name': needle_file_name,
|
||||
'reader_cfg': needlebench_reader_cfg,
|
||||
'infer_cfg': needlebench_infer_cfg,
|
||||
'eval_cfg': needlebench_eval_cfg,
|
||||
}
|
||||
globals()[config['dataset_var']].append(dataset_dict)
|
@ -0,0 +1,18 @@
|
||||
from mmengine.config import read_base
|
||||
|
||||
with read_base():
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_multi_reasoning_4k import needlebench_2needle_en_datasets as needlebench_multi_2needle_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_multi_reasoning_4k import needlebench_3needle_en_datasets as needlebench_multi_3needle_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_multi_reasoning_4k import needlebench_4needle_en_datasets as needlebench_multi_4needle_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_multi_reasoning_4k import needlebench_5needle_en_datasets as needlebench_multi_5needle_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_multi_reasoning_4k import needlebench_2needle_zh_datasets as needlebench_multi_2needle_zh_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_multi_reasoning_4k import needlebench_3needle_zh_datasets as needlebench_multi_3needle_zh_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_multi_reasoning_4k import needlebench_4needle_zh_datasets as needlebench_multi_4needle_zh_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_multi_reasoning_4k import needlebench_5needle_zh_datasets as needlebench_multi_5needle_zh_datasets
|
||||
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_single_4k import needlebench_en_datasets as needlebench_origin_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_single_4k import needlebench_zh_datasets as needlebench_origin_zh_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_multi_retrieval_4k import needlebench_en_datasets as needlebench_parallel_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_multi_retrieval_4k import needlebench_zh_datasets as needlebench_parallel_zh_datasets
|
||||
|
||||
needlebench_datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
|
@ -0,0 +1,93 @@
|
||||
from opencompass.datasets.needlebench_v2.multi import NeedleBenchMultiDataset
|
||||
from mmengine.config import read_base
|
||||
with read_base():
|
||||
from .needlebench_v2_single_4k import depths_list, context_lengths
|
||||
from .needlebench_v2_single_4k import needlebench_reader_cfg, needlebench_infer_cfg
|
||||
from opencompass.configs.datasets.needlebench_v2.atc.atc_0shot_nocot_2_power_en import needlebench_atc_eval_cfg as needlebench_eval_cfg
|
||||
|
||||
|
||||
# ----------English Version----------
|
||||
base_path = 'opencompass/needlebench'
|
||||
file_list = ['PaulGrahamEssays.jsonl']
|
||||
needle_file_name = 'names.json'
|
||||
diff = 10
|
||||
language = 'English'
|
||||
length_buffer = 500
|
||||
|
||||
# Initialize dataset lists
|
||||
needlebench_2needle_en_datasets = []
|
||||
needlebench_3needle_en_datasets = []
|
||||
needlebench_4needle_en_datasets = []
|
||||
needlebench_5needle_en_datasets = []
|
||||
|
||||
# Create datasets for different numbers of needles
|
||||
for num_needles in range(2, 6):
|
||||
dataset_list_name = f'needlebench_{num_needles}needle_en_datasets'
|
||||
|
||||
for original_context_length in context_lengths:
|
||||
for depth_percent in depths_list:
|
||||
dataset_dict = {
|
||||
'abbr': f'Length{original_context_length}'
|
||||
f'Depth{int(depth_percent)}_{num_needles}needle_en_4k',
|
||||
'type': NeedleBenchMultiDataset,
|
||||
'path': base_path,
|
||||
'length': original_context_length,
|
||||
'depth': int(depth_percent),
|
||||
'tokenizer_model': 'gpt-4',
|
||||
'file_list': file_list,
|
||||
'num_repeats_per_file': 10,
|
||||
'length_buffer': length_buffer,
|
||||
'language': language,
|
||||
'needle_file_name': needle_file_name,
|
||||
'num_needles': num_needles,
|
||||
'diff': diff,
|
||||
'reader_cfg': needlebench_reader_cfg,
|
||||
'infer_cfg': needlebench_infer_cfg,
|
||||
'eval_cfg': needlebench_eval_cfg,
|
||||
}
|
||||
|
||||
# Add to the appropriate list using globals()
|
||||
globals()[f'needlebench_{num_needles}needle_en_datasets'].append(dataset_dict)
|
||||
|
||||
# ----------Chinese Version----------
|
||||
base_path = 'opencompass/needlebench'
|
||||
file_list = ['zh_finance.jsonl']
|
||||
needle_file_name = 'names.json'
|
||||
diff = 10
|
||||
language = 'Chinese'
|
||||
length_buffer = 200
|
||||
|
||||
# Initialize dataset lists
|
||||
needlebench_2needle_zh_datasets = []
|
||||
needlebench_3needle_zh_datasets = []
|
||||
needlebench_4needle_zh_datasets = []
|
||||
needlebench_5needle_zh_datasets = []
|
||||
|
||||
# Create datasets for different numbers of needles
|
||||
for num_needles in range(2, 6):
|
||||
dataset_list_name = f'needlebench_{num_needles}needle_zh_datasets'
|
||||
|
||||
for original_context_length in context_lengths:
|
||||
for depth_percent in depths_list:
|
||||
dataset_dict = {
|
||||
'abbr': f'Length{original_context_length}'
|
||||
f'Depth{int(depth_percent)}_{num_needles}needle_zh_4k',
|
||||
'type': NeedleBenchMultiDataset,
|
||||
'path': base_path,
|
||||
'length': original_context_length,
|
||||
'depth': int(depth_percent),
|
||||
'tokenizer_model': 'gpt-4',
|
||||
'file_list': file_list,
|
||||
'num_repeats_per_file': 10,
|
||||
'length_buffer': length_buffer,
|
||||
'language': language,
|
||||
'needle_file_name': needle_file_name,
|
||||
'num_needles': num_needles,
|
||||
'diff': diff,
|
||||
'reader_cfg': needlebench_reader_cfg,
|
||||
'infer_cfg': needlebench_infer_cfg,
|
||||
'eval_cfg': needlebench_eval_cfg,
|
||||
}
|
||||
|
||||
# Add to the appropriate list using globals()
|
||||
globals()[f'needlebench_{num_needles}needle_zh_datasets'].append(dataset_dict)
|
@ -0,0 +1,55 @@
|
||||
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelDataset
|
||||
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelEvaluator
|
||||
from mmengine.config import read_base
|
||||
|
||||
with read_base():
|
||||
from .needlebench_v2_single_4k import depths_list as depths, context_lengths
|
||||
from .needlebench_v2_single_4k import needlebench_reader_cfg, needlebench_infer_cfg, needlebench_eval_cfg
|
||||
|
||||
needlebench_eval_cfg['evaluator']['type'] = NeedleBenchParallelEvaluator
|
||||
|
||||
base_path = 'opencompass/needlebench'
|
||||
needle_file_name = 'needles.jsonl'
|
||||
|
||||
# Define configurations for both English and Chinese datasets
|
||||
language_configs = [
|
||||
{
|
||||
'file_list': ['PaulGrahamEssays.jsonl'],
|
||||
'dataset_var': 'needlebench_en_datasets',
|
||||
'language': 'English',
|
||||
'length_buffer': 500,
|
||||
'suffix': 'en'
|
||||
},
|
||||
{
|
||||
'file_list': ['zh_finance.jsonl'],
|
||||
'dataset_var': 'needlebench_zh_datasets',
|
||||
'language': 'Chinese',
|
||||
'length_buffer': 200,
|
||||
'suffix': 'zh'
|
||||
}
|
||||
]
|
||||
|
||||
# Initialize empty dataset lists
|
||||
needlebench_en_datasets = []
|
||||
needlebench_zh_datasets = []
|
||||
|
||||
# Single loop to handle both languages
|
||||
for config in language_configs:
|
||||
for original_context_length in context_lengths:
|
||||
dataset_dict = {
|
||||
'abbr': f'Length{original_context_length}_parallel_{config["suffix"]}_4k',
|
||||
'type': NeedleBenchParallelDataset,
|
||||
'path': base_path,
|
||||
'needle_file_name': needle_file_name,
|
||||
'length': original_context_length,
|
||||
'depths': depths,
|
||||
'tokenizer_model': 'gpt-4',
|
||||
'file_list': config['file_list'],
|
||||
'num_repeats_per_file': 25,
|
||||
'length_buffer': config['length_buffer'],
|
||||
'language': config['language'],
|
||||
'reader_cfg': needlebench_reader_cfg,
|
||||
'infer_cfg': needlebench_infer_cfg,
|
||||
'eval_cfg': needlebench_eval_cfg,
|
||||
}
|
||||
globals()[config['dataset_var']].append(dataset_dict)
|
@ -0,0 +1,81 @@
|
||||
from opencompass.openicl.icl_prompt_template import PromptTemplate
|
||||
from opencompass.openicl.icl_retriever import ZeroRetriever
|
||||
from opencompass.openicl.icl_inferencer import GenInferencer
|
||||
from opencompass.datasets.needlebench_v2.origin import NeedleBenchOriginDataset
|
||||
from opencompass.datasets.needlebench_v2.origin import NeedleBenchOriginEvaluator
|
||||
from opencompass.datasets.needlebench_v2.origin import needlebench_postprocess
|
||||
from opencompass.datasets.needlebench_v2.origin import needlebench_dataset_postprocess
|
||||
|
||||
|
||||
needlebench_reader_cfg = dict(input_columns=['prompt'], output_column='answer')
|
||||
|
||||
needlebench_infer_cfg = dict(
|
||||
prompt_template=dict(
|
||||
type=PromptTemplate,
|
||||
template=dict(
|
||||
round=[
|
||||
dict(role='HUMAN', prompt='{prompt}'),
|
||||
dict(role='BOT', prompt='{answer}\n'),
|
||||
]
|
||||
),
|
||||
),
|
||||
retriever=dict(type=ZeroRetriever),
|
||||
inferencer=dict(type=GenInferencer),
|
||||
)
|
||||
|
||||
needlebench_eval_cfg = dict(
|
||||
evaluator=dict(type=NeedleBenchOriginEvaluator),
|
||||
pred_postprocessor=dict(type=needlebench_postprocess),
|
||||
dataset_postprocessor=dict(type=needlebench_dataset_postprocess),
|
||||
pred_role='BOT',
|
||||
)
|
||||
|
||||
context_lengths = list([1000, 2000, 3000, 4000])
|
||||
depths_list = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
|
||||
base_path = 'opencompass/needlebench'
|
||||
needle_file_name = 'needles.jsonl'
|
||||
|
||||
# Define configurations for both English and Chinese datasets
|
||||
language_configs = [
|
||||
{
|
||||
'file_list': ['PaulGrahamEssays.jsonl'],
|
||||
'dataset_var': 'needlebench_en_datasets',
|
||||
'language': 'English',
|
||||
'length_buffer': 500,
|
||||
'suffix': 'en'
|
||||
},
|
||||
{
|
||||
'file_list': ['zh_finance.jsonl'],
|
||||
'dataset_var': 'needlebench_zh_datasets',
|
||||
'language': 'Chinese',
|
||||
'length_buffer': 200,
|
||||
'suffix': 'zh'
|
||||
}
|
||||
]
|
||||
|
||||
# Initialize empty dataset lists
|
||||
needlebench_en_datasets = []
|
||||
needlebench_zh_datasets = []
|
||||
|
||||
# Single loop to handle both languages
|
||||
for config in language_configs:
|
||||
for original_context_length in context_lengths:
|
||||
for depth_percent in depths_list:
|
||||
dataset_dict = {
|
||||
'abbr': f'Length{original_context_length}'
|
||||
f'Depth{int(depth_percent)}_origin_{config["suffix"]}_4k',
|
||||
'type': NeedleBenchOriginDataset,
|
||||
'path': base_path,
|
||||
'length': original_context_length,
|
||||
'depth': int(depth_percent),
|
||||
'tokenizer_model': 'gpt-4',
|
||||
'file_list': config['file_list'],
|
||||
'num_repeats_per_file': 10,
|
||||
'length_buffer': config['length_buffer'],
|
||||
'language': config['language'],
|
||||
'needle_file_name': needle_file_name,
|
||||
'reader_cfg': needlebench_reader_cfg,
|
||||
'infer_cfg': needlebench_infer_cfg,
|
||||
'eval_cfg': needlebench_eval_cfg,
|
||||
}
|
||||
globals()[config['dataset_var']].append(dataset_dict)
|
@ -0,0 +1,18 @@
|
||||
from mmengine.config import read_base
|
||||
|
||||
with read_base():
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_multi_reasoning_8k import needlebench_2needle_en_datasets as needlebench_multi_2needle_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_multi_reasoning_8k import needlebench_3needle_en_datasets as needlebench_multi_3needle_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_multi_reasoning_8k import needlebench_4needle_en_datasets as needlebench_multi_4needle_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_multi_reasoning_8k import needlebench_5needle_en_datasets as needlebench_multi_5needle_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_multi_reasoning_8k import needlebench_2needle_zh_datasets as needlebench_multi_2needle_zh_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_multi_reasoning_8k import needlebench_3needle_zh_datasets as needlebench_multi_3needle_zh_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_multi_reasoning_8k import needlebench_4needle_zh_datasets as needlebench_multi_4needle_zh_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_multi_reasoning_8k import needlebench_5needle_zh_datasets as needlebench_multi_5needle_zh_datasets
|
||||
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_single_8k import needlebench_en_datasets as needlebench_origin_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_single_8k import needlebench_zh_datasets as needlebench_origin_zh_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_multi_retrieval_8k import needlebench_en_datasets as needlebench_parallel_en_datasets
|
||||
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_multi_retrieval_8k import needlebench_zh_datasets as needlebench_parallel_zh_datasets
|
||||
|
||||
needlebench_datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
|
@ -0,0 +1,93 @@
|
||||
from opencompass.datasets.needlebench_v2.multi import NeedleBenchMultiDataset
|
||||
from mmengine.config import read_base
|
||||
with read_base():
|
||||
from .needlebench_v2_single_8k import depths_list, context_lengths
|
||||
from .needlebench_v2_single_8k import needlebench_reader_cfg, needlebench_infer_cfg
|
||||
from opencompass.configs.datasets.needlebench_v2.atc.atc_0shot_nocot_2_power_en import needlebench_atc_eval_cfg as needlebench_eval_cfg
|
||||
|
||||
|
||||
# ----------English Version----------
|
||||
base_path = 'opencompass/needlebench'
|
||||
file_list = ['PaulGrahamEssays.jsonl']
|
||||
needle_file_name = 'names.json'
|
||||
diff = 10
|
||||
language = 'English'
|
||||
length_buffer = 500
|
||||
|
||||
# Initialize dataset lists
|
||||
needlebench_2needle_en_datasets = []
|
||||
needlebench_3needle_en_datasets = []
|
||||
needlebench_4needle_en_datasets = []
|
||||
needlebench_5needle_en_datasets = []
|
||||
|
||||
# Create datasets for different numbers of needles
|
||||
for num_needles in range(2, 6):
|
||||
dataset_list_name = f'needlebench_{num_needles}needle_en_datasets'
|
||||
|
||||
for original_context_length in context_lengths:
|
||||
for depth_percent in depths_list:
|
||||
dataset_dict = {
|
||||
'abbr': f'Length{original_context_length}'
|
||||
f'Depth{int(depth_percent)}_{num_needles}needle_en_8k',
|
||||
'type': NeedleBenchMultiDataset,
|
||||
'path': base_path,
|
||||
'length': original_context_length,
|
||||
'depth': int(depth_percent),
|
||||
'tokenizer_model': 'gpt-4',
|
||||
'file_list': file_list,
|
||||
'num_repeats_per_file': 10,
|
||||
'length_buffer': length_buffer,
|
||||
'language': language,
|
||||
'needle_file_name': needle_file_name,
|
||||
'num_needles': num_needles,
|
||||
'diff': diff,
|
||||
'reader_cfg': needlebench_reader_cfg,
|
||||
'infer_cfg': needlebench_infer_cfg,
|
||||
'eval_cfg': needlebench_eval_cfg,
|
||||
}
|
||||
|
||||
# Add to the appropriate list using globals()
|
||||
globals()[f'needlebench_{num_needles}needle_en_datasets'].append(dataset_dict)
|
||||
|
||||
# ----------Chinese Version----------
|
||||
base_path = 'opencompass/needlebench'
|
||||
file_list = ['zh_finance.jsonl']
|
||||
needle_file_name = 'names.json'
|
||||
diff = 10
|
||||
language = 'Chinese'
|
||||
length_buffer = 200
|
||||
|
||||
# Initialize dataset lists
|
||||
needlebench_2needle_zh_datasets = []
|
||||
needlebench_3needle_zh_datasets = []
|
||||
needlebench_4needle_zh_datasets = []
|
||||
needlebench_5needle_zh_datasets = []
|
||||
|
||||
# Create datasets for different numbers of needles
|
||||
for num_needles in range(2, 6):
|
||||
dataset_list_name = f'needlebench_{num_needles}needle_zh_datasets'
|
||||
|
||||
for original_context_length in context_lengths:
|
||||
for depth_percent in depths_list:
|
||||
dataset_dict = {
|
||||
'abbr': f'Length{original_context_length}'
|
||||
f'Depth{int(depth_percent)}_{num_needles}needle_zh_8k',
|
||||
'type': NeedleBenchMultiDataset,
|
||||
'path': base_path,
|
||||
'length': original_context_length,
|
||||
'depth': int(depth_percent),
|
||||
'tokenizer_model': 'gpt-4',
|
||||
'file_list': file_list,
|
||||
'num_repeats_per_file': 10,
|
||||
'length_buffer': length_buffer,
|
||||
'language': language,
|
||||
'needle_file_name': needle_file_name,
|
||||
'num_needles': num_needles,
|
||||
'diff': diff,
|
||||
'reader_cfg': needlebench_reader_cfg,
|
||||
'infer_cfg': needlebench_infer_cfg,
|
||||
'eval_cfg': needlebench_eval_cfg,
|
||||
}
|
||||
|
||||
# Add to the appropriate list using globals()
|
||||
globals()[f'needlebench_{num_needles}needle_zh_datasets'].append(dataset_dict)
|
@ -0,0 +1,55 @@
|
||||
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelDataset
|
||||
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelEvaluator
|
||||
from mmengine.config import read_base
|
||||
|
||||
with read_base():
|
||||
from .needlebench_v2_single_8k import depths_list as depths, context_lengths
|
||||
from .needlebench_v2_single_8k import needlebench_reader_cfg, needlebench_infer_cfg, needlebench_eval_cfg
|
||||
|
||||
needlebench_eval_cfg['evaluator']['type'] = NeedleBenchParallelEvaluator
|
||||
|
||||
base_path = 'opencompass/needlebench'
|
||||
needle_file_name = 'needles.jsonl'
|
||||
|
||||
# Define configurations for both English and Chinese datasets
|
||||
language_configs = [
|
||||
{
|
||||
'file_list': ['PaulGrahamEssays.jsonl'],
|
||||
'dataset_var': 'needlebench_en_datasets',
|
||||
'language': 'English',
|
||||
'length_buffer': 500,
|
||||
'suffix': 'en'
|
||||
},
|
||||
{
|
||||
'file_list': ['zh_finance.jsonl'],
|
||||
'dataset_var': 'needlebench_zh_datasets',
|
||||
'language': 'Chinese',
|
||||
'length_buffer': 200,
|
||||
'suffix': 'zh'
|
||||
}
|
||||
]
|
||||
|
||||
# Initialize empty dataset lists
|
||||
needlebench_en_datasets = []
|
||||
needlebench_zh_datasets = []
|
||||
|
||||
# Single loop to handle both languages
|
||||
for config in language_configs:
|
||||
for original_context_length in context_lengths:
|
||||
dataset_dict = {
|
||||
'abbr': f'Length{original_context_length}_parallel_{config["suffix"]}_8k',
|
||||
'type': NeedleBenchParallelDataset,
|
||||
'path': base_path,
|
||||
'needle_file_name': needle_file_name,
|
||||
'length': original_context_length,
|
||||
'depths': depths,
|
||||
'tokenizer_model': 'gpt-4',
|
||||
'file_list': config['file_list'],
|
||||
'num_repeats_per_file': 25,
|
||||
'length_buffer': config['length_buffer'],
|
||||
'language': config['language'],
|
||||
'reader_cfg': needlebench_reader_cfg,
|
||||
'infer_cfg': needlebench_infer_cfg,
|
||||
'eval_cfg': needlebench_eval_cfg,
|
||||
}
|
||||
globals()[config['dataset_var']].append(dataset_dict)
|
@ -0,0 +1,122 @@
|
||||
from opencompass.openicl.icl_prompt_template import PromptTemplate
|
||||
from opencompass.openicl.icl_retriever import ZeroRetriever
|
||||
from opencompass.openicl.icl_inferencer import GenInferencer
|
||||
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelDataset
|
||||
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelEvaluator
|
||||
from opencompass.datasets.needlebench_v2.origin import needlebench_postprocess
|
||||
from opencompass.datasets.needlebench_v2.origin import needlebench_dataset_postprocess
|
||||
import math
|
||||
|
||||
|
||||
def logistic(x, L=100, x0=50, k=0.1):
|
||||
return round(L / (1 + math.exp(-k * (x - x0))), 3)
|
||||
|
||||
|
||||
def generate_linear_space(start, end, num):
|
||||
if num == 1:
|
||||
return [start]
|
||||
elif num < 1:
|
||||
raise ValueError('num must be at least 1.')
|
||||
step = (end - start) / (num - 1)
|
||||
return [start + step * i for i in range(num)]
|
||||
|
||||
|
||||
def generate_depth_percents(intervals, interval_type):
|
||||
if interval_type == 'linear':
|
||||
return generate_linear_space(0, 100, intervals)
|
||||
elif interval_type == 'sigmoid':
|
||||
linear_space = generate_linear_space(0, 100, intervals)
|
||||
return [logistic(x) for x in linear_space]
|
||||
else:
|
||||
raise ValueError('Unsupported interval type')
|
||||
|
||||
|
||||
needlebench_reader_cfg = dict(input_columns=['prompt'], output_column='answer')
|
||||
|
||||
needlebench_infer_cfg = dict(
|
||||
prompt_template=dict(
|
||||
type=PromptTemplate,
|
||||
template=dict(
|
||||
round=[
|
||||
dict(role='HUMAN', prompt='{prompt}'),
|
||||
dict(role='BOT', prompt='{answer}\n'),
|
||||
]
|
||||
),
|
||||
),
|
||||
retriever=dict(type=ZeroRetriever),
|
||||
inferencer=dict(type=GenInferencer),
|
||||
)
|
||||
|
||||
needlebench_eval_cfg = dict(
|
||||
evaluator=dict(type=NeedleBenchParallelEvaluator),
|
||||
pred_postprocessor=dict(type=needlebench_postprocess),
|
||||
dataset_postprocessor=dict(type=needlebench_dataset_postprocess),
|
||||
pred_role='BOT',
|
||||
)
|
||||
|
||||
context_lengths = list(range(5000, 9000, 1000))
|
||||
document_depth_percent_intervals_list = [1, 5, 10, 15, 20]
|
||||
document_depth_percent_interval_type = 'linear'
|
||||
|
||||
base_path = 'opencompass/needlebench'
|
||||
file_list = ['PaulGrahamEssays.jsonl']
|
||||
needlebench_en_datasets = []
|
||||
needle_file_name = 'needles.jsonl'
|
||||
|
||||
for document_depth_percent_intervals in document_depth_percent_intervals_list:
|
||||
depths_float = generate_depth_percents(
|
||||
document_depth_percent_intervals, document_depth_percent_interval_type
|
||||
)
|
||||
depths = [int(depth) for depth in depths_float]
|
||||
|
||||
for original_context_length in context_lengths:
|
||||
dataset_dict = {
|
||||
'abbr': f'Length{original_context_length}'
|
||||
f'_parallel_en_8k_batch{document_depth_percent_intervals}',
|
||||
'type': NeedleBenchParallelDataset,
|
||||
'path': base_path,
|
||||
'needle_file_name': needle_file_name,
|
||||
'length': original_context_length,
|
||||
'depths': depths,
|
||||
'tokenizer_model': 'gpt-4',
|
||||
'file_list': file_list,
|
||||
'num_repeats_per_file': 50,
|
||||
'length_buffer': 1300,
|
||||
'guide': True,
|
||||
'language': 'English',
|
||||
'reader_cfg': needlebench_reader_cfg,
|
||||
'infer_cfg': needlebench_infer_cfg,
|
||||
'eval_cfg': needlebench_eval_cfg,
|
||||
}
|
||||
needlebench_en_datasets.append(dataset_dict)
|
||||
|
||||
file_list = ['zh_finance.jsonl']
|
||||
needlebench_zh_datasets = []
|
||||
needle_file_name = 'needles.jsonl'
|
||||
|
||||
for document_depth_percent_intervals in document_depth_percent_intervals_list:
|
||||
depths_float = generate_depth_percents(
|
||||
document_depth_percent_intervals, document_depth_percent_interval_type
|
||||
)
|
||||
depths = [int(depth) for depth in depths_float]
|
||||
|
||||
for original_context_length in context_lengths:
|
||||
dataset_dict = {
|
||||
'abbr': f'Length{original_context_length}'
|
||||
f'_parallel_zh_8k_batch{document_depth_percent_intervals}',
|
||||
'type': NeedleBenchParallelDataset,
|
||||
'path': base_path,
|
||||
'needle_file_name': needle_file_name,
|
||||
'length': original_context_length,
|
||||
'depths': depths,
|
||||
'tokenizer_model': 'gpt-4',
|
||||
'file_list': file_list,
|
||||
'num_repeats_per_file': 50,
|
||||
'length_buffer': 200,
|
||||
'guide': True,
|
||||
'language': 'Chinese',
|
||||
'reader_cfg': needlebench_reader_cfg,
|
||||
'infer_cfg': needlebench_infer_cfg,
|
||||
'eval_cfg': needlebench_eval_cfg,
|
||||
}
|
||||
needlebench_zh_datasets.append(dataset_dict)
|
@ -0,0 +1,81 @@
|
||||
from opencompass.openicl.icl_prompt_template import PromptTemplate
|
||||
from opencompass.openicl.icl_retriever import ZeroRetriever
|
||||
from opencompass.openicl.icl_inferencer import GenInferencer
|
||||
from opencompass.datasets.needlebench_v2.origin import NeedleBenchOriginDataset
|
||||
from opencompass.datasets.needlebench_v2.origin import NeedleBenchOriginEvaluator
|
||||
from opencompass.datasets.needlebench_v2.origin import needlebench_postprocess
|
||||
from opencompass.datasets.needlebench_v2.origin import needlebench_dataset_postprocess
|
||||
|
||||
|
||||
needlebench_reader_cfg = dict(input_columns=['prompt'], output_column='answer')
|
||||
|
||||
needlebench_infer_cfg = dict(
|
||||
prompt_template=dict(
|
||||
type=PromptTemplate,
|
||||
template=dict(
|
||||
round=[
|
||||
dict(role='HUMAN', prompt='{prompt}'),
|
||||
dict(role='BOT', prompt='{answer}\n'),
|
||||
]
|
||||
),
|
||||
),
|
||||
retriever=dict(type=ZeroRetriever),
|
||||
inferencer=dict(type=GenInferencer),
|
||||
)
|
||||
|
||||
needlebench_eval_cfg = dict(
|
||||
evaluator=dict(type=NeedleBenchOriginEvaluator),
|
||||
pred_postprocessor=dict(type=needlebench_postprocess),
|
||||
dataset_postprocessor=dict(type=needlebench_dataset_postprocess),
|
||||
pred_role='BOT',
|
||||
)
|
||||
|
||||
context_lengths = list([1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000])
|
||||
depths_list = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
|
||||
base_path = 'opencompass/needlebench'
|
||||
needle_file_name = 'needles.jsonl'
|
||||
|
||||
# Define configurations for both English and Chinese datasets
|
||||
language_configs = [
|
||||
{
|
||||
'file_list': ['PaulGrahamEssays.jsonl'],
|
||||
'dataset_var': 'needlebench_en_datasets',
|
||||
'language': 'English',
|
||||
'length_buffer': 500,
|
||||
'suffix': 'en'
|
||||
},
|
||||
{
|
||||
'file_list': ['zh_finance.jsonl'],
|
||||
'dataset_var': 'needlebench_zh_datasets',
|
||||
'language': 'Chinese',
|
||||
'length_buffer': 200,
|
||||
'suffix': 'zh'
|
||||
}
|
||||
]
|
||||
|
||||
# Initialize empty dataset lists
|
||||
needlebench_en_datasets = []
|
||||
needlebench_zh_datasets = []
|
||||
|
||||
# Single loop to handle both languages
|
||||
for config in language_configs:
|
||||
for original_context_length in context_lengths:
|
||||
for depth_percent in depths_list:
|
||||
dataset_dict = {
|
||||
'abbr': f'Length{original_context_length}'
|
||||
f'Depth{int(depth_percent)}_origin_{config["suffix"]}_8k',
|
||||
'type': NeedleBenchOriginDataset,
|
||||
'path': base_path,
|
||||
'length': original_context_length,
|
||||
'depth': int(depth_percent),
|
||||
'tokenizer_model': 'gpt-4',
|
||||
'file_list': config['file_list'],
|
||||
'num_repeats_per_file': 10,
|
||||
'length_buffer': config['length_buffer'],
|
||||
'language': config['language'],
|
||||
'needle_file_name': needle_file_name,
|
||||
'reader_cfg': needlebench_reader_cfg,
|
||||
'infer_cfg': needlebench_infer_cfg,
|
||||
'eval_cfg': needlebench_eval_cfg,
|
||||
}
|
||||
globals()[config['dataset_var']].append(dataset_dict)
|
69
opencompass/configs/datasets/needlebench_v2/readme.md
Normal file
69
opencompass/configs/datasets/needlebench_v2/readme.md
Normal file
@ -0,0 +1,69 @@
|
||||
# NeedleBench V2: An Enhanced Benchmark for Needle-In-A-Haystack Evaluations
|
||||
|
||||
English | [简体中文](readme_zh-CN.md)
|
||||
|
||||
## Overview
|
||||
|
||||
NeedleBench V2 is an improved benchmark that rigorously assesses the information retrieval and reasoning capabilities of large language models (LLMs) in long-context scenarios. Building upon the original NeedleBench, this version introduces significant enhancements to provide more accurate and unbiased evaluations of LLMs' abilities to locate and reason with critical information in extensive texts.
|
||||
|
||||
### Directory Structure
|
||||
|
||||
```
|
||||
configs/datasets/needlebench_v2/
|
||||
├── atc
|
||||
├── needlebench_v2_4k
|
||||
├── needlebench_v2_8k
|
||||
├── needlebench_v2_32k
|
||||
├── needlebench_v2_128k
|
||||
├── needlebench_v2_200k
|
||||
├── needlebench_v2_256k
|
||||
├── needlebench_v2_1000k
|
||||
├── readme.md
|
||||
└── readme_zh-CN.md
|
||||
```
|
||||
|
||||
Within each configuration directory (e.g., `needlebench_v2_4k`), there are configuration files tailored for testing within that specific length setting.
|
||||
|
||||
## Task Descriptions and Length Configurations
|
||||
|
||||
NeedleBench V2 offers tasks in various length configurations (4k, 8k, 32k, 128k, 200k, 256k, 1000k) to accommodate different scales of language model evaluation needs. Each length configuration provides specialized test scripts for the following tasks:
|
||||
|
||||
### Single-Needle Retrieval
|
||||
|
||||
The Single-Needle Retrieval task evaluates LLMs' ability to recall a single piece of crucial information from a haystack text of a specific length. This task assesses the model's precision in identifying and recalling specific information from extended texts.
|
||||
|
||||
### Multi-Needle Retrieval
|
||||
|
||||
The Multi-Needle Retrieval task challenges LLMs' ability to identify and extract multiple key information points from extensive texts. It simulates real-world scenarios where multiple data points, facts, or figures need to be retrieved from documents or reports, evaluating the model's efficiency in navigating and extracting relevant information from dense texts.
|
||||
|
||||
### Multi-Needle Reasoning
|
||||
|
||||
In NeedleBench V2, the Multi-Needle Reasoning task has been significantly improved. The original needles based on the R4C/MultiHop dataset have been replaced with fictional information similar to those in the Ancestral Trace Challenge. This change addresses potential biases from innate knowledge, as the original dataset may have been included in some models' training data. The task continues to evaluate LLMs' capacity for complex reasoning with retrieved information, requiring models to not only recall multiple pieces of information but also engage in logical reasoning.
|
||||
|
||||
### Ancestral Trace Challenge (ATC)
|
||||
|
||||
The Ancestral Trace Challenge has been refined in NeedleBench V2. The needle distribution pattern has changed from a dense form (1, 2, 3, 4, 5 needles) to a sparse form based on powers of 2 (2¹, 2², 2³, etc.). This task remains NeedleBench's most complex, requiring models to recall and analyze every detail in long texts for problems demanding an understanding of complex relationships, such as genealogical inquiries or detailed case analysis.
|
||||
|
||||
## Scoring Methodology
|
||||
|
||||
NeedleBench V2 introduces a more balanced scoring system. The overall score is now calculated as a simple average of the three main tasks (Single-Needle Retrieval, Multi-Needle Retrieval, and Multi-Needle Reasoning), with each task receiving equal weight. This change from the previous weighted average approach provides a more straightforward and equitable assessment of model capabilities across different retrieval and reasoning tasks.
|
||||
|
||||
## Prompt Enhancements
|
||||
|
||||
All prompts in NeedleBench V2 have been refined for greater clarity and effectiveness, with particular attention to the ATC experiment prompts. The configuration structure has also been streamlined for easier use and interpretation.
|
||||
|
||||
## Citation
|
||||
|
||||
If you use NeedleBench V2 in your research, please cite:
|
||||
|
||||
```bibtex
|
||||
@misc{li2025needlebenchllmsretrievalreasoning,
|
||||
title={NeedleBench: Can LLMs Do Retrieval and Reasoning in Information-Dense Context?},
|
||||
author={Mo Li and Songyang Zhang and Taolin Zhang and Haodong Duan and Yunxin Liu and Kai Chen},
|
||||
year={2025},
|
||||
eprint={2407.11963},
|
||||
archivePrefix={arXiv},
|
||||
primaryClass={cs.CL},
|
||||
url={https://arxiv.org/abs/2407.11963},
|
||||
}
|
||||
```
|
69
opencompass/configs/datasets/needlebench_v2/readme_zh-CN.md
Normal file
69
opencompass/configs/datasets/needlebench_v2/readme_zh-CN.md
Normal file
@ -0,0 +1,69 @@
|
||||
# NeedleBench V2:改进版大海捞针测试评估基准
|
||||
|
||||
[English](readme.md) | 简体中文
|
||||
|
||||
## 概览
|
||||
|
||||
NeedleBench V2是一个改进版基准测试,旨在严格评估大型语言模型(LLMs)在长文本场景中的信息检索和推理能力。在原有NeedleBench的基础上,这个版本引入了重要的增强功能,为LLMs在海量文本中定位和推理关键信息的能力提供更准确、更公正的评估。
|
||||
|
||||
### 目录结构
|
||||
|
||||
```
|
||||
configs/datasets/needlebench_v2/
|
||||
├── atc
|
||||
├── needlebench_v2_4k
|
||||
├── needlebench_v2_8k
|
||||
├── needlebench_v2_32k
|
||||
├── needlebench_v2_128k
|
||||
├── needlebench_v2_200k
|
||||
├── needlebench_v2_256k
|
||||
├── needlebench_v2_1000k
|
||||
├── readme.md
|
||||
└── readme_zh-CN.md
|
||||
```
|
||||
|
||||
在每个长度配置目录下(如 `needlebench_v2_4k`),包含了专门针对该长度设置的测试任务配置文件。
|
||||
|
||||
## 任务描述与长度配置
|
||||
|
||||
NeedleBench V2提供了不同长度配置的任务(4k、8k、32k、128k、200k、256k、1000k),以适应不同规模的语言模型评估需求。每种长度配置针对以下任务提供了专门的测试脚本:
|
||||
|
||||
### 单针信息检索
|
||||
|
||||
单针信息检索任务评估LLMs从特定长度的无关信息文本中回忆单个重要信息的能力。这个任务评估模型在长文本中识别和回忆特定信息的精确性。
|
||||
|
||||
### 多针信息检索
|
||||
|
||||
多针信息检索任务挑战LLMs识别和提取广泛文本中的多个关键信息点的能力。它模拟了现实世界中的场景,其中需要从文档或报告中检索多个数据点、事实或数字,评估模型在浏览和从密集文本中提取相关信息的效率。
|
||||
|
||||
### 多针信息推理
|
||||
|
||||
在NeedleBench V2中,多针信息推理任务得到了显著改进。原来基于R4C/MultiHop数据集的"针"已被替换为类似于祖源追溯挑战中的虚构信息。这一改变解决了潜在的内生知识偏差问题,因为原始数据集可能已被包含在一些模型的训练数据中。这个任务继续评估LLMs使用检索到的信息进行复杂推理的能力,要求模型不仅能回忆多个信息点,还能进行逻辑推理。
|
||||
|
||||
### 祖源追溯挑战 (ATC)
|
||||
|
||||
祖源追溯挑战在NeedleBench V2中进行了优化。针的分布模式从密集形式(1、2、3、4、5针)变为基于2的幂次的稀疏形式(2¹、2²、2³等)。这个任务仍然是NeedleBench中最复杂的任务,要求模型回忆和分析长文本中的每个细节,以解决需要理解复杂关系的问题,如家谱查询或详细案例分析。
|
||||
|
||||
## 评分方法
|
||||
|
||||
NeedleBench V2引入了更平衡的评分系统。总体评分现在是通过三个主要任务(单针信息检索、多针信息检索和多针信息推理)的简单平均值计算得出,每个任务获得相等的权重。这一改变从先前的加权平均方法提供了一种更直接、更公平的方式,评估模型在不同检索和推理任务中的能力。
|
||||
|
||||
## 提示增强
|
||||
|
||||
NeedleBench V2中的所有提示都经过了改进,以提高清晰度和有效性,特别关注了ATC实验的提示。配置结构也进行了精简,使其更易于使用和理解。
|
||||
|
||||
## 引用
|
||||
|
||||
如果您在研究中使用NeedleBench V2,请引用:
|
||||
|
||||
```bibtex
|
||||
@misc{li2025needlebenchllmsretrievalreasoning,
|
||||
title={NeedleBench: Can LLMs Do Retrieval and Reasoning in Information-Dense Context?},
|
||||
author={Mo Li and Songyang Zhang and Taolin Zhang and Haodong Duan and Yunxin Liu and Kai Chen},
|
||||
year={2025},
|
||||
eprint={2407.11963},
|
||||
archivePrefix={arXiv},
|
||||
primaryClass={cs.CL},
|
||||
url={https://arxiv.org/abs/2407.11963},
|
||||
}
|
||||
```
|
@ -1,4 +1,5 @@
|
||||
from opencompass.summarizers.needlebench import NeedleBenchSummarizer
|
||||
|
||||
from opencompass.summarizers.needlebench import NeedleBenchSummarizer, NeedleBenchSummarizerV2
|
||||
|
||||
|
||||
def create_m_rs_names_list(context_lengths, depths, needle_counts,
|
||||
@ -30,7 +31,7 @@ def create_m_rs_names_list(context_lengths, depths, needle_counts,
|
||||
return names_dict
|
||||
|
||||
def create_summarizer(context_lengths, depths, dataset_size,
|
||||
sparse_depths=None):
|
||||
sparse_depths=None, mean=False):
|
||||
needle_counts = ['2', '3', '4', '5']
|
||||
languages = ['en', 'zh']
|
||||
if sparse_depths:
|
||||
@ -81,17 +82,26 @@ def create_summarizer(context_lengths, depths, dataset_size,
|
||||
summary_groups = [
|
||||
{'name': key, 'subsets': value} for key, value in names_dict.items()
|
||||
]
|
||||
|
||||
summary_groups.append({
|
||||
'name': f'NeedleBench-Overall-Score-{dataset_size.upper()}',
|
||||
'subsets': [[f'Single-Needle-Retrieval(S-RT)-{dataset_size.upper()}', 'naive_average'],
|
||||
[f'Multi-Needle-Reasoning(M-RS)-{dataset_size.upper()}', 'naive_average'],
|
||||
[f'Multi-Needle-Retrieval(M-RT)-{dataset_size.upper()}', 'average_score']],
|
||||
'weights': {f'Single-Needle-Retrieval(S-RT)-{dataset_size.upper()}': 0.4,
|
||||
f'Multi-Needle-Reasoning(M-RS)-{dataset_size.upper()}': 0.3,
|
||||
f'Multi-Needle-Retrieval(M-RT)-{dataset_size.upper()}': 0.3}})
|
||||
if mean:
|
||||
summary_groups.append({
|
||||
'name': f'NeedleBench-Overall-Score-{dataset_size.upper()}',
|
||||
'subsets': [[f'Single-Needle-Retrieval(S-RT)-{dataset_size.upper()}', 'naive_average'],
|
||||
[f'Multi-Needle-Reasoning(M-RS)-{dataset_size.upper()}', 'naive_average'],
|
||||
[f'Multi-Needle-Retrieval(M-RT)-{dataset_size.upper()}', 'average_score']],
|
||||
'weights': {f'Single-Needle-Retrieval(S-RT)-{dataset_size.upper()}': 1/3,
|
||||
f'Multi-Needle-Reasoning(M-RS)-{dataset_size.upper()}': 1/3,
|
||||
f'Multi-Needle-Retrieval(M-RT)-{dataset_size.upper()}': 1/3}})
|
||||
else:
|
||||
summary_groups.append({
|
||||
'name': f'NeedleBench-Overall-Score-{dataset_size.upper()}',
|
||||
'subsets': [[f'Single-Needle-Retrieval(S-RT)-{dataset_size.upper()}', 'naive_average'],
|
||||
[f'Multi-Needle-Reasoning(M-RS)-{dataset_size.upper()}', 'naive_average'],
|
||||
[f'Multi-Needle-Retrieval(M-RT)-{dataset_size.upper()}', 'average_score']],
|
||||
'weights': {f'Single-Needle-Retrieval(S-RT)-{dataset_size.upper()}': 0.4,
|
||||
f'Multi-Needle-Reasoning(M-RS)-{dataset_size.upper()}': 0.3,
|
||||
f'Multi-Needle-Retrieval(M-RT)-{dataset_size.upper()}': 0.3}})
|
||||
summarizer_config = {
|
||||
'type': NeedleBenchSummarizer,
|
||||
'type': NeedleBenchSummarizerV2 if mean else NeedleBenchSummarizer,
|
||||
'summary_groups': summary_groups,
|
||||
'dataset_abbrs': [
|
||||
f'NeedleBench-Overall-Score-{dataset_size.upper()}',
|
||||
@ -143,177 +153,20 @@ needlebench_internal_32k_summarizer = create_summarizer([32000], depths_list_int
|
||||
needlebench_internal_100k_summarizer = create_summarizer([100000], depths_list_internal, '100000')
|
||||
needlebench_internal_200k_summarizer = create_summarizer([200000], depths_list_internal, '200000')
|
||||
|
||||
_needlebench_8k_parallel_en_batch1 = []
|
||||
_needlebench_8k_parallel_en_batch5 = []
|
||||
_needlebench_8k_parallel_en_batch10 = []
|
||||
_needlebench_8k_parallel_en_batch15 = []
|
||||
_needlebench_8k_parallel_en_batch20 = []
|
||||
_needlebench_8k_parallel_zh_batch1 = []
|
||||
_needlebench_8k_parallel_zh_batch5 = []
|
||||
_needlebench_8k_parallel_zh_batch10 = []
|
||||
_needlebench_8k_parallel_zh_batch15 = []
|
||||
_needlebench_8k_parallel_zh_batch20 = []
|
||||
for original_context_length in context_lengths_8k:
|
||||
_needlebench_8k_parallel_en_batch1.append(f'Length{original_context_length}_parallel_en_8k_batch1')
|
||||
_needlebench_8k_parallel_en_batch5.append(f'Length{original_context_length}_parallel_en_8k_batch5')
|
||||
_needlebench_8k_parallel_en_batch10.append(f'Length{original_context_length}_parallel_en_8k_batch10')
|
||||
_needlebench_8k_parallel_en_batch15.append(f'Length{original_context_length}_parallel_en_8k_batch15')
|
||||
_needlebench_8k_parallel_en_batch20.append(f'Length{original_context_length}_parallel_en_8k_batch20')
|
||||
_needlebench_8k_parallel_zh_batch1.append(f'Length{original_context_length}_parallel_zh_8k_batch1')
|
||||
_needlebench_8k_parallel_zh_batch5.append(f'Length{original_context_length}_parallel_zh_8k_batch5')
|
||||
_needlebench_8k_parallel_zh_batch10.append(f'Length{original_context_length}_parallel_zh_8k_batch10')
|
||||
_needlebench_8k_parallel_zh_batch15.append(f'Length{original_context_length}_parallel_zh_8k_batch15')
|
||||
_needlebench_8k_parallel_zh_batch20.append(f'Length{original_context_length}_parallel_zh_8k_batch20')
|
||||
depths_list_20 = [i for i in range(0, 101, 5)] # [0, 5, 10, ..., 100]
|
||||
depths_list_10 = [i for i in range(0, 101, 10)] # [0, 10, 20, ..., 100]
|
||||
|
||||
|
||||
_needlebench_8k_parallel_batch1 = _needlebench_8k_parallel_en_batch1 + _needlebench_8k_parallel_zh_batch1
|
||||
_needlebench_8k_parallel_batch5 = _needlebench_8k_parallel_en_batch5 + _needlebench_8k_parallel_zh_batch5
|
||||
_needlebench_8k_parallel_batch10 = _needlebench_8k_parallel_en_batch10 + _needlebench_8k_parallel_zh_batch10
|
||||
_needlebench_8k_parallel_batch15 = _needlebench_8k_parallel_en_batch15 + _needlebench_8k_parallel_zh_batch15
|
||||
_needlebench_8k_parallel_batch20 = _needlebench_8k_parallel_en_batch20 + _needlebench_8k_parallel_zh_batch20
|
||||
|
||||
needlebench_summary_groups = [
|
||||
{'name': 'parallel_version_batch1', 'subsets': [[_dataset, 'average_score'] for _dataset in _needlebench_8k_parallel_batch1]},
|
||||
{'name': 'parallel_version_zh_batch1', 'subsets': [[_dataset, 'average_score'] for _dataset in _needlebench_8k_parallel_zh_batch1]},
|
||||
{'name': 'parallel_version_en_batch1', 'subsets': [[_dataset, 'average_score'] for _dataset in _needlebench_8k_parallel_en_batch1]},
|
||||
{'name': 'parallel_version_batch5', 'subsets': [[_dataset, 'average_score'] for _dataset in _needlebench_8k_parallel_batch5]},
|
||||
{'name': 'parallel_version_zh_batch5', 'subsets': [[_dataset, 'average_score'] for _dataset in _needlebench_8k_parallel_zh_batch5]},
|
||||
{'name': 'parallel_version_en_batch5', 'subsets': [[_dataset, 'average_score'] for _dataset in _needlebench_8k_parallel_en_batch5]},
|
||||
{'name': 'parallel_version_batch10', 'subsets': [[_dataset, 'average_score'] for _dataset in _needlebench_8k_parallel_batch10]},
|
||||
{'name': 'parallel_version_zh_batch10', 'subsets': [[_dataset, 'average_score'] for _dataset in _needlebench_8k_parallel_zh_batch10]},
|
||||
{'name': 'parallel_version_en_batch10', 'subsets': [[_dataset, 'average_score'] for _dataset in _needlebench_8k_parallel_en_batch10]},
|
||||
{'name': 'parallel_version_batch15', 'subsets': [[_dataset, 'average_score'] for _dataset in _needlebench_8k_parallel_batch15]},
|
||||
{'name': 'parallel_version_zh_batch15', 'subsets': [[_dataset, 'average_score'] for _dataset in _needlebench_8k_parallel_zh_batch15]},
|
||||
{'name': 'parallel_version_en_batch15', 'subsets': [[_dataset, 'average_score'] for _dataset in _needlebench_8k_parallel_en_batch15]},
|
||||
{'name': 'parallel_version_batch20', 'subsets': [[_dataset, 'average_score'] for _dataset in _needlebench_8k_parallel_batch20]},
|
||||
{'name': 'parallel_version_zh_batch20', 'subsets': [[_dataset, 'average_score'] for _dataset in _needlebench_8k_parallel_zh_batch20]},
|
||||
{'name': 'parallel_version_en_batch20', 'subsets': [[_dataset, 'average_score'] for _dataset in _needlebench_8k_parallel_en_batch20]},
|
||||
]
|
||||
|
||||
needlebench_8k_batch_overall_summarizer = dict(
|
||||
dataset_abbrs=[
|
||||
'--------- NeedleBench-8k Parallel-Needles ---------', # category
|
||||
'parallel_version_batch1',
|
||||
'parallel_version_batch5',
|
||||
'parallel_version_batch10',
|
||||
'parallel_version_batch15',
|
||||
'parallel_version_batch20',
|
||||
'parallel_version_zh_batch1',
|
||||
'parallel_version_en_batch1',
|
||||
'parallel_version_zh_batch5',
|
||||
'parallel_version_en_batch5',
|
||||
'parallel_version_zh_batch10',
|
||||
'parallel_version_en_batch10',
|
||||
'parallel_version_zh_batch15',
|
||||
'parallel_version_en_batch15',
|
||||
'parallel_version_zh_batch20',
|
||||
'parallel_version_en_batch20',
|
||||
],
|
||||
summary_groups=needlebench_summary_groups,
|
||||
)
|
||||
|
||||
needlebench_summary_groups = [
|
||||
{'name': 'parallel_version_batch1', 'subsets': [[_dataset, 'Depth0'] for _dataset in _needlebench_8k_parallel_batch1]},
|
||||
{'name': 'parallel_version_zh_batch1', 'subsets': [[_dataset, 'Depth0'] for _dataset in _needlebench_8k_parallel_zh_batch1]},
|
||||
{'name': 'parallel_version_en_batch1', 'subsets': [[_dataset, 'Depth0'] for _dataset in _needlebench_8k_parallel_en_batch1]},
|
||||
{'name': 'parallel_version_batch5', 'subsets': [[_dataset, 'Depth0'] for _dataset in _needlebench_8k_parallel_batch5]},
|
||||
{'name': 'parallel_version_zh_batch5', 'subsets': [[_dataset, 'Depth0'] for _dataset in _needlebench_8k_parallel_zh_batch5]},
|
||||
{'name': 'parallel_version_en_batch5', 'subsets': [[_dataset, 'Depth0'] for _dataset in _needlebench_8k_parallel_en_batch5]},
|
||||
{'name': 'parallel_version_batch10', 'subsets': [[_dataset, 'Depth0'] for _dataset in _needlebench_8k_parallel_batch10]},
|
||||
{'name': 'parallel_version_zh_batch10', 'subsets': [[_dataset, 'Depth0'] for _dataset in _needlebench_8k_parallel_zh_batch10]},
|
||||
{'name': 'parallel_version_en_batch10', 'subsets': [[_dataset, 'Depth0'] for _dataset in _needlebench_8k_parallel_en_batch10]},
|
||||
{'name': 'parallel_version_batch15', 'subsets': [[_dataset, 'Depth0'] for _dataset in _needlebench_8k_parallel_batch15]},
|
||||
{'name': 'parallel_version_zh_batch15', 'subsets': [[_dataset, 'Depth0'] for _dataset in _needlebench_8k_parallel_zh_batch15]},
|
||||
{'name': 'parallel_version_en_batch15', 'subsets': [[_dataset, 'Depth0'] for _dataset in _needlebench_8k_parallel_en_batch15]},
|
||||
{'name': 'parallel_version_batch20', 'subsets': [[_dataset, 'Depth0'] for _dataset in _needlebench_8k_parallel_batch20]},
|
||||
{'name': 'parallel_version_zh_batch20', 'subsets': [[_dataset, 'Depth0'] for _dataset in _needlebench_8k_parallel_zh_batch20]},
|
||||
{'name': 'parallel_version_en_batch20', 'subsets': [[_dataset, 'Depth0'] for _dataset in _needlebench_8k_parallel_en_batch20]},
|
||||
]
|
||||
|
||||
needlebench_8k_batch_depth0_summarizer = dict(
|
||||
dataset_abbrs=[
|
||||
'--------- NeedleBench-8k Parallel-Needles ---------', # category
|
||||
'parallel_version_batch1',
|
||||
'parallel_version_batch5',
|
||||
'parallel_version_batch10',
|
||||
'parallel_version_batch15',
|
||||
'parallel_version_batch20',
|
||||
'parallel_version_zh_batch1',
|
||||
'parallel_version_en_batch1',
|
||||
'parallel_version_zh_batch5',
|
||||
'parallel_version_en_batch5',
|
||||
'parallel_version_zh_batch10',
|
||||
'parallel_version_en_batch10',
|
||||
'parallel_version_zh_batch15',
|
||||
'parallel_version_en_batch15',
|
||||
'parallel_version_zh_batch20',
|
||||
'parallel_version_en_batch20',
|
||||
],
|
||||
summary_groups=needlebench_summary_groups,
|
||||
)
|
||||
|
||||
def gen_atc_summarizer(needle_num_list):
|
||||
categories = [
|
||||
'ZH-Direct-CE', 'EN-Direct-CE',
|
||||
'ZH-Reasoning-CE', 'EN-Reasoning-CE'
|
||||
]
|
||||
needlebench_atc_summary_groups = []
|
||||
|
||||
# 根据分类生成summary groups
|
||||
for category in categories:
|
||||
# 对于CircularEval相关的评分,使用perf_4指标,否则使用acc_1指标
|
||||
metric = 'perf_4' if 'CE' in category else 'acc_1'
|
||||
# 生成subsets时,不需要在数据集名称中包含CircularEval信息
|
||||
cleaned_category = category.replace('-CE', '').replace('-Direct', '')
|
||||
needlebench_atc_summary_groups.append({
|
||||
'name': category,
|
||||
'subsets': [
|
||||
[f'NeedleBenchATCDataset-{num_needles}Needle-{cleaned_category}', metric]
|
||||
for num_needles in needle_num_list
|
||||
],
|
||||
'weights': {f'NeedleBenchATCDataset-{num_needles}Needle-{cleaned_category}': num_needles for num_needles in needle_num_list},
|
||||
})
|
||||
|
||||
needlebench_atc_summary_groups.append({
|
||||
'name': 'ATC-CE-Overall',
|
||||
'subsets': [
|
||||
[f'{category}', 'weighted_average'] for category in categories
|
||||
],
|
||||
})
|
||||
atc_dataset_abbrs = []
|
||||
atc_dataset_abbrs.append(['ATC-CE-Overall', 'naive_average'])
|
||||
|
||||
for category in categories:
|
||||
weighted_average_score_entry = [f'{category}', 'weighted_average']
|
||||
atc_dataset_abbrs.append(weighted_average_score_entry)
|
||||
|
||||
needlebench_atc_summarizer = dict(
|
||||
dataset_abbrs=[
|
||||
*atc_dataset_abbrs,
|
||||
'######## Needlebench-ATC Accuracy ########', # category
|
||||
*[[f'NeedleBenchATCDataset-{num_needles}Needle-ZH', 'acc_1'] for num_needles in needle_num_list],
|
||||
'------------------------------------------',
|
||||
*[[f'NeedleBenchATCDataset-{num_needles}Needle-EN', 'acc_1'] for num_needles in needle_num_list],
|
||||
'------------------------------------------',
|
||||
*[[f'NeedleBenchATCDataset-{num_needles}Needle-ZH-Reasoning', 'acc_1'] for num_needles in needle_num_list],
|
||||
'------------------------------------------',
|
||||
*[[f'NeedleBenchATCDataset-{num_needles}Needle-EN-Reasoning', 'acc_1'] for num_needles in needle_num_list],
|
||||
'------------------------------------------',
|
||||
'######## Needlebench-ATC CircularEval ########', # category
|
||||
*[[f'NeedleBenchATCDataset-{num_needles}Needle-ZH', 'perf_4'] for num_needles in needle_num_list],
|
||||
'------------------------------------------',
|
||||
*[[f'NeedleBenchATCDataset-{num_needles}Needle-EN', 'perf_4'] for num_needles in needle_num_list],
|
||||
'------------------------------------------',
|
||||
*[[f'NeedleBenchATCDataset-{num_needles}Needle-ZH-Reasoning', 'perf_4'] for num_needles in needle_num_list],
|
||||
'------------------------------------------',
|
||||
*[[f'NeedleBenchATCDataset-{num_needles}Needle-EN-Reasoning', 'perf_4'] for num_needles in needle_num_list],
|
||||
'------------------------------------------',
|
||||
],
|
||||
summary_groups=needlebench_atc_summary_groups
|
||||
)
|
||||
return needlebench_atc_summarizer
|
||||
|
||||
|
||||
atc_summarizer_20 = gen_atc_summarizer(list(range(2, 20, 1)))
|
||||
atc_summarizer_50 = gen_atc_summarizer(list(range(2, 50, 1)))
|
||||
atc_summarizer_80 = gen_atc_summarizer(list(range(2, 80, 1)))
|
||||
context_lengths_4k = [1000, 2000, 3000, 4000]
|
||||
needlebench_v2_4k_summarizer = create_summarizer(context_lengths_4k, depths_list_10, '4k', mean=True)
|
||||
context_lengths_8k = [1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000]
|
||||
needlebench_v2_8k_summarizer = create_summarizer(context_lengths_8k, depths_list_10, '8k', mean=True)
|
||||
context_lengths_32k = [1000, 4000, 8000, 12000, 16000, 20000, 24000, 28000, 32000]
|
||||
needlebench_v2_32k_summarizer = create_summarizer(context_lengths_32k, depths_list_10, '32k', mean=True)
|
||||
context_lengths_128k = [1000, 2000, 4000, 8000, 16000, 32000, 64000, 128000]
|
||||
needlebench_v2_128k_summarizer = create_summarizer(context_lengths_128k, depths_list_10, '128k', mean=True)
|
||||
context_lengths_200k = [16000, 48000, 80000, 112000, 128000, 144000, 176000, 200000]
|
||||
needlebench_v2_200k_summarizer = create_summarizer(context_lengths_200k, depths_list_10, '200k', mean=True)
|
||||
context_lengths_256k = [32000, 128000, 256000]
|
||||
needlebench_v2_256k_summarizer = create_summarizer(context_lengths_256k, depths_list_10, '256k', mean=True)
|
||||
context_lengths_1000k = [20000, 160000, 300000, 440000, 580000, 720000, 860000, 1000000]
|
||||
needlebench_v2_1000k_summarizer = create_summarizer(context_lengths_1000k, depths_list_10, '1000k', mean=True)
|
0
opencompass/datasets/needlebench_v2/__init__.py
Normal file
0
opencompass/datasets/needlebench_v2/__init__.py
Normal file
440
opencompass/datasets/needlebench_v2/atc.py
Normal file
440
opencompass/datasets/needlebench_v2/atc.py
Normal file
@ -0,0 +1,440 @@
|
||||
# flake8: noqa
|
||||
import json
|
||||
import os
|
||||
import random
|
||||
import re
|
||||
from enum import Enum
|
||||
|
||||
from datasets import Dataset
|
||||
|
||||
from opencompass.datasets.base import BaseDataset
|
||||
from opencompass.datasets.needlebench_v2.atc_elder_only import (
|
||||
NeedleBenchATCEvaluator, clean_atc_answer, needlebench_atc_postprocess_v2)
|
||||
from opencompass.registry import (ICL_EVALUATORS, LOAD_DATASET,
|
||||
TEXT_POSTPROCESSORS)
|
||||
from opencompass.utils import get_data_path
|
||||
|
||||
|
||||
# 定义问题类型枚举
|
||||
class QuestionType(Enum):
|
||||
ELDEST_ANCESTOR = 0 # 最年长祖先
|
||||
NTH_ANCESTOR = 1 # N级祖先
|
||||
NTH_DESCENDANT = 2 # N级子节点
|
||||
RELATIONSHIP_DISTANCE = 3 # 关系距离
|
||||
|
||||
|
||||
# 定义关系术语的代数映射(一代关系还是两代关系)
|
||||
relationship_generation_map_zh = {
|
||||
'父亲': 1,
|
||||
'母亲': 1,
|
||||
'爸爸': 1,
|
||||
'妈妈': 1,
|
||||
'爷爷': 2,
|
||||
'奶奶': 2,
|
||||
'姥姥': 2,
|
||||
'姥爷': 2,
|
||||
'外公': 2,
|
||||
'外婆': 2,
|
||||
}
|
||||
|
||||
relationship_generation_map_en = {
|
||||
'father': 1,
|
||||
'mother': 1,
|
||||
'dad': 1,
|
||||
'mom': 1,
|
||||
'grandfather': 2,
|
||||
'grandmother': 2,
|
||||
'maternal grandmother': 2,
|
||||
'maternal grandfather': 2,
|
||||
'paternal grandfather': 2,
|
||||
'paternal grandmother': 2,
|
||||
}
|
||||
|
||||
relationship_templates_zh_CN = [
|
||||
'{A}是{B}的{relationship}。',
|
||||
'{B}的{relationship}是{A}。',
|
||||
'{A}作为{B}的{relationship},对{B}的成长有重要影响。',
|
||||
'{A}不仅是{B}的{relationship},还是{B}的榜样。',
|
||||
'{A}在{B}的成长过程中,不仅仅是{B}的{relationship},还是{B}的监护人。',
|
||||
'{A}对{B}来说,不只是一个{relationship},还是一个朋友。',
|
||||
]
|
||||
|
||||
relationship_terms_zh_CN = [
|
||||
'父亲',
|
||||
'母亲',
|
||||
'爸爸',
|
||||
'妈妈',
|
||||
'爷爷',
|
||||
'奶奶',
|
||||
'姥姥',
|
||||
'姥爷',
|
||||
'外公',
|
||||
'外婆',
|
||||
]
|
||||
|
||||
relationship_terms_en = [
|
||||
'father',
|
||||
'mother',
|
||||
'dad',
|
||||
'mom',
|
||||
'grandfather',
|
||||
'grandmother',
|
||||
'maternal grandmother',
|
||||
'maternal grandfather',
|
||||
'paternal grandfather',
|
||||
'paternal grandmother',
|
||||
]
|
||||
|
||||
relationship_templates_en = [
|
||||
"{A} is {B}'s {relationship}.",
|
||||
"{B}'s {relationship} is {A}.",
|
||||
("{A}, as {B}'s {relationship}, "
|
||||
"has a significant impact on {B}'s upbringing."),
|
||||
("{A} is not only {B}'s {relationship} "
|
||||
"but also {B}'s role model."),
|
||||
("During {B}'s upbringing, {A} was not only {B}'s {relationship}, "
|
||||
"but also {B}'s guardian."),
|
||||
('For {B}, {A} is not just a {relationship}, '
|
||||
'but also a friend.'),
|
||||
'For {B}, {A} is more than just a {relationship}; {A} is a lifelong mentor of {B}.',
|
||||
]
|
||||
|
||||
# Eldest ancestor problem template
|
||||
shuffled_story_with_prompt_zh_CN = """下面是对你的多步推理能力的测试,这个测试叫做祖先追溯测试,我们会模拟不同人的家庭亲属关系,你的任务是在其中不断推理,直到找到最年长的祖先。
|
||||
|
||||
例如:
|
||||
例子1.如果张强的父亲是马克,除此以外提供的文本中没有更多关于亲属关系的信息,那么在提供的文本中张强能够向上追溯到的最年长的亲人就是马克。
|
||||
例子2.如果李明的姥姥是张红,而张红的父亲是张强,除此以外提供的文本中没有更多关于亲属关系的信息,那么在提供的文本中李明能够向上追溯到的最年长的亲人就是张强。
|
||||
例子3.如果小明是张红的曾孙女,张红的祖母是王华,王华的父亲是王刚,除此以外提供的文本中没有更多关于亲属关系的信息,那么小明能够向上追溯到的最年长的亲人就是王刚。
|
||||
|
||||
注意:
|
||||
1. 你不必纠结这个测试中的人名的性别关系,例如,一个通常被视为女性化的名字仍然可以是其他人的父亲,我们的重点是谁更年长。
|
||||
2. 忽略这个测试中的姓氏遗传问题,例如,李明仍然可能是王鹏的亲生父亲,我们只关注谁更年长,不必纠结孩子是否应该继承父亲或母亲的性别。
|
||||
3. 在回答的最后,将你的答案放在\\boxed{{}}中,例如:"所以{last_person}能向上追溯到的最年长的亲人就是\\boxed{{某人(你的答案)}}"
|
||||
|
||||
现在,打乱的家族关系文本如下:
|
||||
{shuffled_story}
|
||||
|
||||
在上面提供的打乱的家族关系文本中,'{last_person}'的能够向上追溯到的最年长的亲人是谁?
|
||||
"""
|
||||
|
||||
shuffled_story_with_prompt_en = """Here is a test for multi-step reasoning ability called the Ancestral Trace Challenge. In this test, we will simulate different people's familial relationships, and your task is to continuously reason through them until you identify the eldest ancestor.
|
||||
|
||||
For example:
|
||||
Example 1: If James Hill's father is Jasmine Lane, and no further information about familial relationships is provided in the text, then the oldest relative James Hill can trace back to in the provided text is \\boxed{{Jasmine Lane}}.
|
||||
Example 2: If Andrew Williams's grandmother is Dan Newton, and Dan Newton's father is James Hill, and no further information about familial relationships is provided in the text, then the oldest relative Andrew Williams can trace back to in the provided text is \\boxed{{James Hill}}.
|
||||
Example 3: If Jeff White's father is Kevin Le, Dan Newton's grandmother is Jeff White, and Jeff White's father is Kevin Le, and Shelley Mills is Dan Newton's great-granddaughter, and no further information about familial relationships is provided in the text, then the oldest relative Shelley Mills can trace back to in the provided text is \\boxed{{Kevin Le}}.
|
||||
|
||||
Notes:
|
||||
1. You do not need to worry about the gender consistency of names in this test. For example, a name that is typically considered feminine can still be the father of another person. Our primary focus is on who is older.
|
||||
2. Ignore surname inheritance issues. For instance, Andrew Williams could still be the biological father of Christopher Baker. We only care about who is older and do not need to consider whether a child should inherit the father's or mother's surname.
|
||||
3. At the end of your response, remember to put your final answer within \\boxed{{}}. For example: "So the oldest relative '{last_person}' can trace back to in the provided text is \\boxed{{somebody (your answer here)}}."
|
||||
|
||||
Now, the scrambled family relationships are provided below:
|
||||
{shuffled_story}
|
||||
|
||||
Given the scrambled family relationships described above, who is the eldest relative that '{last_person}' can trace back to in the context?
|
||||
"""
|
||||
|
||||
# Nth ancestor problem template
|
||||
nth_ancestor_prompt_zh_CN = """下面是对你的多步推理能力的测试,这个测试叫做祖先追溯测试,我们会模拟不同人的家庭亲属关系,你的任务是在其中不断推理,找到指定人物的特定代祖先。
|
||||
|
||||
例如:
|
||||
例子1.如果张强的父亲是马克,我们说马克是张强的1代祖先。
|
||||
例子2.如果李明的姥姥是张红(姥姥算两代关系),而张红的父亲是张强,那么张红是李明的2代祖先,张强是李明的3代祖先。
|
||||
例子3.如果小明的奶奶是王华(奶奶算两代关系),王华的妈妈是刘芳,那么王华是小明的2代祖先,刘芳是小明的3代祖先。
|
||||
|
||||
注意:
|
||||
1. 你不必纠结这个测试中的人名的性别关系,我们只关注辈分关系。
|
||||
2. 忽略这个测试中的姓氏遗传问题,我们只关注亲属关系。
|
||||
3. 父亲/母亲/爸爸/妈妈算1代关系,爷爷/奶奶/姥姥/姥爷/外公/外婆算2代关系。
|
||||
4. 在回答的最后,将你的答案放在\\boxed{{}}中,例如:"所以{person}的{n}代祖先就是\\boxed{{某人(你的答案)}}"
|
||||
|
||||
现在,打乱的家族关系文本如下:
|
||||
{shuffled_story}
|
||||
|
||||
在上面提供的打乱的家族关系文本中,'{person}'的{n}代祖先是谁?
|
||||
"""
|
||||
|
||||
nth_ancestor_prompt_en = """Here is a test for multi-step reasoning ability called the Ancestral Trace Challenge. In this test, we will simulate different people's familial relationships, and your task is to identify a specific ancestor of a given person.
|
||||
|
||||
For example:
|
||||
Example 1: If James Hill's father is Jasmine Lane, then Jasmine Lane is James Hill's 1st generation ancestor.
|
||||
Example 2: If Andrew Williams's grandmother is Dan Newton (grandmother counts as 2 generations), and Dan Newton's father is James Hill, then Dan Newton is Andrew Williams's 2nd generation ancestor, and James Hill is Andrew Williams's 3rd generation ancestor.
|
||||
Example 3: If Shelley Mills's grandfather is Jeff White (grandfather counts as 2 generations), and Jeff White's mother is Mary Johnson, then Jeff White is Shelley Mills's 2nd generation ancestor, and Mary Johnson is Shelley Mills's 3rd generation ancestor.
|
||||
|
||||
Notes:
|
||||
1. You do not need to worry about the gender consistency of names in this test. We only care about generational relationships.
|
||||
2. Ignore surname inheritance issues. We only care about familial relationships.
|
||||
3. Father/mother/dad/mom count as 1 generation, while grandfather/grandmother/maternal grandmother/maternal grandfather/paternal grandfather/paternal grandmother count as 2 generations.
|
||||
4. At the end of your response, remember to put your final answer within \\boxed{{}}. For example: "So the {n}th generation ancestor of '{person}' is \\boxed{{somebody (your answer here)}}."
|
||||
|
||||
Now, the scrambled family relationships are provided below:
|
||||
{shuffled_story}
|
||||
|
||||
Given the scrambled family relationships described above, who is the {n}th generation ancestor of '{person}'?
|
||||
"""
|
||||
|
||||
# Nth descendant problem template
|
||||
nth_descendant_prompt_zh_CN = """下面是对你的多步推理能力的测试,这个测试叫做家族关系追溯测试,我们会模拟不同人的家庭亲属关系,你的任务是在其中不断推理,找到指定人物的特定代子孙。
|
||||
|
||||
例如:
|
||||
例子1.如果马克是张强的父亲,我们说张强是马克的1代子孙。
|
||||
例子2.如果张红是李明的姥姥(姥姥算两代关系),而张强是张红的父亲,那么李明是张红的2代子孙,李明是张强的3代子孙。
|
||||
例子3.如果王华是小明的爷爷(爷爷算两代关系),刘芳是王华的妈妈,那么小明是王华的2代子孙,小明是刘芳的3代子孙。
|
||||
|
||||
注意:
|
||||
1. 你不必纠结这个测试中的人名的性别关系,我们只关注辈分关系。
|
||||
2. 忽略这个测试中的姓氏遗传问题,我们只关注亲属关系。
|
||||
3. 父亲/母亲/爸爸/妈妈算1代关系,爷爷/奶奶/姥姥/姥爷/外公/外婆算2代关系。
|
||||
4. 在回答的最后,将你的答案放在\\boxed{{}}中,例如:"所以{person}的{n}代子孙就是\\boxed{{某人(你的答案)}}"
|
||||
|
||||
现在,打乱的家族关系文本如下:
|
||||
{shuffled_story}
|
||||
|
||||
在上面提供的打乱的家族关系文本中,'{person}'的{n}代子孙是谁?
|
||||
"""
|
||||
|
||||
nth_descendant_prompt_en = """Here is a test for multi-step reasoning ability called the Ancestral Trace Challenge. In this test, we will simulate different people's familial relationships, and your task is to identify a specific descendant of a given person.
|
||||
|
||||
For example:
|
||||
Example 1: If Jasmine Lane is James Hill's father, then James Hill is Jasmine Lane's 1st generation descendant.
|
||||
Example 2: If Dan Newton is Andrew Williams's grandmother (grandmother counts as 2 generations), and James Hill is Dan Newton's father, then Andrew Williams is Dan Newton's 2nd generation descendant, and Andrew Williams is James Hill's 3rd generation descendant.
|
||||
Example 3: If Jeff White is Shelley Mills's grandfather (grandfather counts as 2 generations), and Mary Johnson is Jeff White's mother, then Shelley Mills is Jeff White's 2nd generation descendant, and Shelley Mills is Mary Johnson's 3rd generation descendant.
|
||||
|
||||
Notes:
|
||||
1. You do not need to worry about the gender consistency of names in this test. We only care about generational relationships.
|
||||
2. Ignore surname inheritance issues. We only care about familial relationships.
|
||||
3. Father/mother/dad/mom count as 1 generation, while grandfather/grandmother/maternal grandmother/maternal grandfather/paternal grandfather/paternal grandmother count as 2 generations.
|
||||
4. At the end of your response, remember to put your final answer within \\boxed{{}}. For example: "So the {n}th generation descendant of '{person}' is \\boxed{{somebody (your answer here)}}."
|
||||
|
||||
Now, the scrambled family relationships are provided below:
|
||||
{shuffled_story}
|
||||
|
||||
Given the scrambled family relationships described above, who is the {n}th generation descendant of '{person}'?
|
||||
"""
|
||||
|
||||
# Relationship distance problem template
|
||||
relationship_distance_prompt_zh_CN = """下面是对你的多步推理能力的测试,这个测试叫做家族关系追溯测试,我们会模拟不同人的家庭亲属关系,你的任务是在其中不断推理,计算两个人之间的关系距离。
|
||||
|
||||
关系距离定义为:家族图中从一个人到另一个人所需的最少代数差距。注意不同关系有不同的代数差距,例如:
|
||||
例子1.如果马克是张强的父亲(父亲算1代关系),那么张强和马克之间的关系距离是1。
|
||||
例子2.如果张红是李明的姥姥(姥姥算2代关系),而张强是张红的父亲(父亲算1代关系),那么李明和张红之间的关系距离是2,李明和张强之间的关系距离是3。
|
||||
例子3.如果小明的爷爷是王华(爷爷算2代关系),王华的妈妈是刘芳(妈妈算1代关系),那么小明和王华之间的关系距离是2,小明和刘芳之间的关系距离是3。
|
||||
|
||||
注意:
|
||||
1. 你不必纠结这个测试中的人名的性别关系,我们只关注辈分关系。
|
||||
2. 忽略这个测试中的姓氏遗传问题,我们只关注亲属关系。
|
||||
3. 父亲/母亲/爸爸/妈妈算1代关系,爷爷/奶奶/姥姥/姥爷/外公/外婆算2代关系。
|
||||
4. 在回答的最后,将你的答案放在\\boxed{{}}中,例如:"所以{person_a}和{person_b}之间的关系距离是\\boxed{{5}}"
|
||||
|
||||
现在,打乱的家族关系文本如下:
|
||||
{shuffled_story}
|
||||
|
||||
在上面提供的打乱的家族关系文本中,'{person_a}'和'{person_b}'之间的关系距离是多少?
|
||||
"""
|
||||
|
||||
relationship_distance_prompt_en = """Here is a test for multi-step reasoning ability called the Ancestral Trace Challenge. In this test, we will simulate different people's familial relationships, and your task is to calculate the relationship distance between two individuals.
|
||||
|
||||
The relationship distance is defined as the minimum number of generational gaps needed to go from one person to another in the family graph. Note that different relationships have different generational gaps. For example:
|
||||
Example 1: If Jasmine Lane is James Hill's father (father counts as 1 generation), then the relationship distance between James Hill and Jasmine Lane is 1.
|
||||
Example 2: If Dan Newton is Andrew Williams's grandmother (grandmother counts as 2 generations), and James Hill is Dan Newton's father (father counts as 1 generation), then the relationship distance between Andrew Williams and Dan Newton is 2, and the relationship distance between Andrew Williams and James Hill is 3.
|
||||
Example 3: If Jeff White is Shelley Mills's grandfather (grandfather counts as 2 generations), and Mary Johnson is Jeff White's mother (mother counts as 1 generation), then the relationship distance between Shelley Mills and Jeff White is 2, and the relationship distance between Shelley Mills and Mary Johnson is 3.
|
||||
|
||||
Notes:
|
||||
1. You do not need to worry about the gender consistency of names in this test. We only care about relationship connections.
|
||||
2. Ignore surname inheritance issues. We only care about familial relationships.
|
||||
3. Father/mother/dad/mom count as 1 generation, while grandfather/grandmother/maternal grandmother/maternal grandfather/paternal grandfather/paternal grandmother count as 2 generations.
|
||||
4. At the end of your response, remember to put your final answer within \\boxed{{}}. For example: "So the relationship distance between '{person_a}' and '{person_b}' is \\boxed{{5}}."
|
||||
|
||||
Now, the scrambled family relationships are provided below:
|
||||
{shuffled_story}
|
||||
|
||||
Given the scrambled family relationships described above, what is the relationship distance between '{person_a}' and '{person_b}'?
|
||||
"""
|
||||
|
||||
|
||||
@LOAD_DATASET.register_module()
|
||||
class NeedleBenchATCDataset(BaseDataset):
|
||||
|
||||
@staticmethod
|
||||
def load(
|
||||
path,
|
||||
file_name: str,
|
||||
num_needles: int,
|
||||
language: str,
|
||||
repeats: int,
|
||||
# This parameter cannot be passed through mmengine because it is blocked as lazy
|
||||
question_types: list[QuestionType] = [
|
||||
QuestionType.ELDEST_ANCESTOR,
|
||||
QuestionType.NTH_ANCESTOR,
|
||||
QuestionType.NTH_DESCENDANT,
|
||||
QuestionType.RELATIONSHIP_DISTANCE,
|
||||
], # Support specifying a list of question types
|
||||
):
|
||||
data = {'prompt': [], 'answer': [], 'question_type': []}
|
||||
path = get_data_path(path)
|
||||
if os.environ.get('DATASET_SOURCE') == 'HF':
|
||||
from huggingface_hub import snapshot_download
|
||||
|
||||
path = snapshot_download(repo_id=path, repo_type='dataset')
|
||||
file_path = os.path.join(path, file_name)
|
||||
|
||||
with open(file_path, 'r', encoding='utf-8') as file:
|
||||
names_data = json.load(file)
|
||||
|
||||
all_names = names_data[language].split(',')
|
||||
# Ensure question_types is not empty
|
||||
if not question_types:
|
||||
raise ValueError('question_types cannot be empty')
|
||||
|
||||
for question_type in question_types:
|
||||
# Generate the specified number of examples for each question type
|
||||
for i in range(repeats):
|
||||
# Set a different seed for each question type and repeat
|
||||
# Use the enum value of the question type multiplied by 10000 as the base to ensure non-overlapping seed ranges
|
||||
seed = (i + 1) + (10000 * question_type.value)
|
||||
random.seed(seed)
|
||||
|
||||
# Randomly select the specified number of names from all names
|
||||
# The number of names is num_needles + 1
|
||||
names = random.sample(all_names, num_needles + 1)
|
||||
|
||||
# Select the corresponding relationship terms and templates according to the language
|
||||
if language == 'Chinese':
|
||||
relationship_terms = relationship_terms_zh_CN
|
||||
relationship_templates = relationship_templates_zh_CN
|
||||
relationship_map = relationship_generation_map_zh
|
||||
elif language == 'English':
|
||||
relationship_terms = relationship_terms_en
|
||||
relationship_templates = relationship_templates_en
|
||||
relationship_map = relationship_generation_map_en
|
||||
else:
|
||||
raise ValueError(
|
||||
'Unsupported language specified. '
|
||||
'Please choose either "Chinese" or "English".')
|
||||
|
||||
def generate_chain_family_story(names, templates,
|
||||
relationship_terms,
|
||||
relationship_map):
|
||||
story = ''
|
||||
relationships = []
|
||||
total_generations = 0 # Track the total generational difference
|
||||
|
||||
for i in range(len(names) - 1):
|
||||
template = random.choice(templates)
|
||||
relation_term = random.choice(relationship_terms)
|
||||
relation = template.format(A=names[i],
|
||||
B=names[i + 1],
|
||||
relationship=relation_term)
|
||||
story += f'{relation}*'
|
||||
|
||||
# Get the generation difference for this relationship
|
||||
gen_diff = relationship_map.get(
|
||||
relation_term, 1) # Default to 1 generation
|
||||
total_generations += gen_diff
|
||||
|
||||
# Record relationship information for later use
|
||||
relationships.append(
|
||||
(names[i], names[i + 1], relation_term, gen_diff))
|
||||
|
||||
return story, relationships, total_generations
|
||||
|
||||
chain_story, relationships, total_generations = generate_chain_family_story(
|
||||
names, relationship_templates, relationship_terms,
|
||||
relationship_map)
|
||||
|
||||
# Split the chain_story into a list of fragments
|
||||
family_story_fragments = chain_story.split('*')
|
||||
family_story_fragments = [
|
||||
f for f in family_story_fragments if f
|
||||
]
|
||||
|
||||
# Shuffle the list of fragments
|
||||
random.shuffle(family_story_fragments)
|
||||
|
||||
# Join the shuffled fragments back into a string
|
||||
shuffled_story = ''.join(family_story_fragments)
|
||||
|
||||
if question_type == QuestionType.ELDEST_ANCESTOR:
|
||||
# Eldest ancestor question
|
||||
last_person = names[-1]
|
||||
if language == 'Chinese':
|
||||
prompt = shuffled_story_with_prompt_zh_CN.format(
|
||||
shuffled_story=shuffled_story,
|
||||
last_person=last_person)
|
||||
else:
|
||||
prompt = shuffled_story_with_prompt_en.format(
|
||||
shuffled_story=shuffled_story,
|
||||
last_person=last_person)
|
||||
answer = names[
|
||||
0] # The first person is the eldest ancestor
|
||||
|
||||
elif question_type == QuestionType.NTH_ANCESTOR:
|
||||
# Nth ancestor question - trace from the youngest person to the oldest
|
||||
person = names[
|
||||
-1] # The youngest person (end of the chain)
|
||||
n = total_generations # Use the calculated total generational difference
|
||||
if language == 'Chinese':
|
||||
prompt = nth_ancestor_prompt_zh_CN.format(
|
||||
shuffled_story=shuffled_story, person=person, n=n)
|
||||
else:
|
||||
prompt = nth_ancestor_prompt_en.format(
|
||||
shuffled_story=shuffled_story, person=person, n=n)
|
||||
answer = names[
|
||||
0] # The oldest person (start of the chain) is the nth ancestor
|
||||
|
||||
elif question_type == QuestionType.NTH_DESCENDANT:
|
||||
# Nth descendant question - trace from the oldest person to the youngest
|
||||
person = names[0] # The oldest person (start of the chain)
|
||||
n = total_generations # Use the calculated total generational difference
|
||||
if language == 'Chinese':
|
||||
prompt = nth_descendant_prompt_zh_CN.format(
|
||||
shuffled_story=shuffled_story, person=person, n=n)
|
||||
else:
|
||||
prompt = nth_descendant_prompt_en.format(
|
||||
shuffled_story=shuffled_story, person=person, n=n)
|
||||
answer = names[
|
||||
-1] # The youngest person (end of the chain) is the nth descendant
|
||||
|
||||
elif question_type == QuestionType.RELATIONSHIP_DISTANCE:
|
||||
# Relationship distance question - calculate the relationship distance between the two ends of the chain
|
||||
person_a = names[0] # The oldest person
|
||||
person_b = names[-1] # The youngest person
|
||||
if language == 'Chinese':
|
||||
prompt = relationship_distance_prompt_zh_CN.format(
|
||||
shuffled_story=shuffled_story,
|
||||
person_a=person_a,
|
||||
person_b=person_b)
|
||||
else:
|
||||
prompt = relationship_distance_prompt_en.format(
|
||||
shuffled_story=shuffled_story,
|
||||
person_a=person_a,
|
||||
person_b=person_b)
|
||||
# Use the calculated total generations as the relationship distance
|
||||
answer = str(total_generations)
|
||||
|
||||
else:
|
||||
# Default fallback to eldest ancestor question
|
||||
last_person = names[-1]
|
||||
if language == 'Chinese':
|
||||
prompt = shuffled_story_with_prompt_zh_CN.format(
|
||||
shuffled_story=shuffled_story,
|
||||
last_person=last_person)
|
||||
else:
|
||||
prompt = shuffled_story_with_prompt_en.format(
|
||||
shuffled_story=shuffled_story,
|
||||
last_person=last_person)
|
||||
answer = names[
|
||||
0] # The first person is the eldest ancestor
|
||||
|
||||
data['prompt'].append(prompt)
|
||||
data['answer'].append(answer)
|
||||
data['question_type'].append(question_type.name)
|
||||
|
||||
dataset = Dataset.from_dict({
|
||||
'prompt': data['prompt'],
|
||||
'answer': data['answer'],
|
||||
'question_type': data['question_type'],
|
||||
})
|
||||
return dataset
|
253
opencompass/datasets/needlebench_v2/atc_elder_only.py
Normal file
253
opencompass/datasets/needlebench_v2/atc_elder_only.py
Normal file
@ -0,0 +1,253 @@
|
||||
# flake8: noqa
|
||||
import json
|
||||
import os
|
||||
import random
|
||||
import re
|
||||
|
||||
from datasets import Dataset
|
||||
|
||||
from opencompass.datasets.base import BaseDataset
|
||||
from opencompass.datasets.math import extract_boxed_answer
|
||||
from opencompass.openicl.icl_evaluator import BaseEvaluator
|
||||
from opencompass.registry import (ICL_EVALUATORS, LOAD_DATASET,
|
||||
TEXT_POSTPROCESSORS)
|
||||
from opencompass.utils import get_data_path
|
||||
|
||||
relationship_templates_zh_CN = [
|
||||
'{A}是{B}的{relationship}。',
|
||||
'{B}的{relationship}是{A}。',
|
||||
'{A}作为{B}的{relationship},对{B}的成长有重要影响。',
|
||||
'{A}不仅是{B}的{relationship},还是{B}的榜样。',
|
||||
'{B}是{A}所生的孩子。',
|
||||
'{A}对{B}来说,不只是一个{relationship},还是一个朋友。',
|
||||
]
|
||||
|
||||
relationship_terms_zh_CN = [
|
||||
'父亲',
|
||||
'母亲',
|
||||
'爸爸',
|
||||
'妈妈',
|
||||
'爷爷',
|
||||
'奶奶',
|
||||
'姥姥',
|
||||
'姥爷',
|
||||
'外公',
|
||||
'外婆',
|
||||
]
|
||||
|
||||
relationship_terms_en = [
|
||||
'father',
|
||||
'mother',
|
||||
'dad',
|
||||
'mom',
|
||||
'grandfather',
|
||||
'grandmother',
|
||||
'maternal grandmother',
|
||||
'maternal grandfather',
|
||||
'paternal grandfather',
|
||||
'paternal grandmother',
|
||||
]
|
||||
|
||||
relationship_templates_en = [
|
||||
"{A} is {B}'s {relationship}.",
|
||||
"{B}'s {relationship} is {A}.",
|
||||
("{A}, as {B}'s {relationship}, "
|
||||
"has a significant impact on {B}'s upbringing."),
|
||||
("{A} is not only {B}'s {relationship} "
|
||||
"but also {B}'s role model."),
|
||||
'{B} is the child of {A}.',
|
||||
('For {B}, {A} is not just a {relationship}, '
|
||||
'but also a friend.'),
|
||||
'For {B}, {A} is more than just a {relationship}; {A} is a lifelong mentor of {B}.',
|
||||
]
|
||||
|
||||
shuffled_story_with_prompt_zh_CN = """下面是对你的多步推理能力的测试,这个测试叫做祖先追溯测试,我们会模拟不同人的家庭亲属关系,你的任务是在其中不断推理,直到找到最年长的祖先。
|
||||
|
||||
例如:
|
||||
例子1.如果张强的父亲是马克,除此以外提供的文本中没有更多关于亲属关系的信息,那么在提供的文本中张强能够向上追溯到的最年长的亲人就是马克。
|
||||
例子2.如果李明的姥姥是张红,而张红的父亲是张强,除此以外提供的文本中没有更多关于亲属关系的信息,那么在提供的文本中李明能够向上追溯到的最年长的亲人就是张强。
|
||||
例子3.如果小明是张红的曾孙女,张红的祖母是王华,王华的父亲是王刚,除此以外提供的文本中没有更多关于亲属关系的信息,那么小明能够向上追溯到的最年长的亲人就是王刚。
|
||||
|
||||
注意:
|
||||
1. 你不必纠结这个测试中的人名的性别关系,例如,一个通常被视为女性化的名字仍然可以是其他人的父亲,我们的重点是谁更年长。
|
||||
2. 忽略这个测试中的姓氏遗传问题,例如,李明仍然可能是王鹏的亲生父亲,我们只关注谁更年长,不必纠结孩子是否应该继承父亲或母亲的性别。
|
||||
3. 在回答的最后,将你的答案放在\\boxed{{}}中,例如:"所以{last_person}能向上追溯到的最年长的亲人就是\\boxed{{某人(你的答案)}}"
|
||||
|
||||
现在,打乱的家族关系文本如下:
|
||||
{shuffled_story}
|
||||
|
||||
在上面提供的打乱的家族关系文本中,'{last_person}'的能够向上追溯到的最年长的亲人是谁?
|
||||
"""
|
||||
|
||||
shuffled_story_with_prompt_en = """Here is a test for multi-step reasoning ability called the Ancestral Trace Challenge. In this test, we will simulate different people's familial relationships, and your task is to continuously reason through them until you identify the eldest ancestor.
|
||||
|
||||
For example:
|
||||
Example 1: If James Hill's father is Jasmine Lane, and no further information about familial relationships is provided in the text, then the oldest relative James Hill can trace back to in the provided text is \\boxed{{Jasmine Lane}}.
|
||||
Example 2: If Andrew Williams's grandmother is Dan Newton, and Dan Newton's father is James Hill, and no further information about familial relationships is provided in the text, then the oldest relative Andrew Williams can trace back to in the provided text is \\boxed{{James Hill}}.
|
||||
Example 3: If Jeff White's father is Kevin Le, Dan Newton's grandmother is Jeff White, and Jeff White's father is Kevin Le, and Shelley Mills is Dan Newton's great-granddaughter, and no further information about familial relationships is provided in the text, then the oldest relative Shelley Mills can trace back to in the provided text is \\boxed{{Kevin Le}}.
|
||||
|
||||
Notes:
|
||||
1. You do not need to worry about the gender consistency of names in this test. For example, a name that is typically considered feminine can still be the father of another person. Our primary focus is on who is older.
|
||||
2. Ignore surname inheritance issues. For instance, Andrew Williams could still be the biological father of Christopher Baker. We only care about who is older and do not need to consider whether a child should inherit the father's or mother's surname.
|
||||
3. At the end of your response, remember to put your final answer within \\boxed{{}}. For example: "So the oldest relative '{last_person}' can trace back to in the provided text is \\boxed{{somebody (your answer here)}}."
|
||||
|
||||
Now, the scrambled family relationships are provided below:
|
||||
{shuffled_story}
|
||||
|
||||
Given the scrambled family relationships described above, who is the eldest relative that '{last_person}' can trace back to in the context?
|
||||
"""
|
||||
|
||||
|
||||
@LOAD_DATASET.register_module()
|
||||
class NeedleBenchATCDataset(BaseDataset):
|
||||
|
||||
@staticmethod
|
||||
def load(
|
||||
path,
|
||||
file_name: str,
|
||||
num_needles: int,
|
||||
language: str,
|
||||
repeats: int,
|
||||
):
|
||||
data = {'prompt': [], 'answer': []}
|
||||
path = get_data_path(path)
|
||||
if os.environ.get('DATASET_SOURCE') == 'HF':
|
||||
from huggingface_hub import snapshot_download
|
||||
|
||||
path = snapshot_download(repo_id=path, repo_type='dataset')
|
||||
file_path = os.path.join(path, file_name)
|
||||
|
||||
with open(file_path, 'r', encoding='utf-8') as file:
|
||||
names_data = json.load(file)
|
||||
|
||||
all_names = names_data[language].split(',')
|
||||
|
||||
for i in range(repeats):
|
||||
# 使用固定种子来保持样本稳定性
|
||||
seed = i
|
||||
random.seed(seed)
|
||||
|
||||
names = random.sample(all_names, num_needles)
|
||||
if language == 'Chinese':
|
||||
relationship_terms = relationship_terms_zh_CN
|
||||
relationship_templates = relationship_templates_zh_CN
|
||||
elif language == 'English':
|
||||
relationship_terms = relationship_terms_en
|
||||
relationship_templates = relationship_templates_en
|
||||
|
||||
def generate_chain_family_story(names, templates,
|
||||
relationship_terms):
|
||||
story = ''
|
||||
for i in range(len(names) - 1):
|
||||
template = random.choice(templates)
|
||||
relation_term = random.choice(relationship_terms)
|
||||
relation = template.format(A=names[i],
|
||||
B=names[i + 1],
|
||||
relationship=relation_term)
|
||||
story += f'{relation}*'
|
||||
return story
|
||||
|
||||
chain_story = generate_chain_family_story(names,
|
||||
relationship_templates,
|
||||
relationship_terms)
|
||||
|
||||
# Splitting the chain_story into a list of fragments
|
||||
family_story_fragments = chain_story.split('*')
|
||||
|
||||
# Shuffling the list of fragments
|
||||
random.shuffle(family_story_fragments)
|
||||
|
||||
# Joining the shuffled fragments back into a string
|
||||
shuffled_story = ''.join(family_story_fragments)
|
||||
|
||||
last_person = names[-1]
|
||||
|
||||
# Generating the prompt based on the language
|
||||
if language == 'Chinese':
|
||||
shuffled_story_with_prompt = shuffled_story_with_prompt_zh_CN.format(
|
||||
shuffled_story=shuffled_story, last_person=last_person)
|
||||
elif language == 'English':
|
||||
shuffled_story_with_prompt = shuffled_story_with_prompt_en.format(
|
||||
shuffled_story=shuffled_story, last_person=last_person)
|
||||
else:
|
||||
prompt = 'Language not supported.'
|
||||
raise Exception('Unsupported language specified. '
|
||||
"Please choose either 'Chinese' or 'English'.")
|
||||
|
||||
data['prompt'].append(shuffled_story_with_prompt)
|
||||
data['answer'].append(names[0])
|
||||
|
||||
dataset = Dataset.from_dict({
|
||||
'prompt': data['prompt'],
|
||||
'answer': data['answer'],
|
||||
})
|
||||
return dataset
|
||||
|
||||
|
||||
def clean_atc_answer(text: str) -> str:
|
||||
"""Clean answer format specifically for QwQ-32B-Preview model.
|
||||
|
||||
Args:
|
||||
text: Raw prediction text
|
||||
|
||||
Returns:
|
||||
Standardized name format after cleaning
|
||||
"""
|
||||
if not text or text == 'None':
|
||||
return 'None'
|
||||
|
||||
# Remove LaTeX commands but keep content
|
||||
text = re.sub(r'\\text\{([^}]+)\}', r'\1', text)
|
||||
text = re.sub(r'\\boxed\{([^}]+)\}', r'\1', text)
|
||||
text = re.sub(r'\\[\[\]]', '', text)
|
||||
|
||||
# Remove extra backslashes
|
||||
text = text.replace('\\\\', '').replace('\\', '')
|
||||
|
||||
# Handle extra spaces
|
||||
text = re.sub(r'\s+', ' ', text).strip()
|
||||
|
||||
# Remove quotes
|
||||
text = text.replace('"', '').replace("'", '')
|
||||
# Remove tildes (波浪符号)
|
||||
text = text.replace('~', ' ')
|
||||
|
||||
return text
|
||||
|
||||
|
||||
@TEXT_POSTPROCESSORS.register_module('needlebench_atc_postprocess_v2')
|
||||
def needlebench_atc_postprocess_v2(text: str) -> str:
|
||||
|
||||
cand_ans = extract_boxed_answer(text, strip_double_curly_brace=True)
|
||||
|
||||
if cand_ans:
|
||||
return clean_atc_answer(cand_ans)
|
||||
return 'None'
|
||||
|
||||
|
||||
@ICL_EVALUATORS.register_module('needlebench_atc_evaluator')
|
||||
class NeedleBenchATCEvaluator(BaseEvaluator):
|
||||
|
||||
def score(self, predictions, gold):
|
||||
if len(predictions) != len(gold):
|
||||
return {'error': 'predictions and gold have different lengths'}
|
||||
|
||||
correct_count = 0
|
||||
details = []
|
||||
|
||||
for prediction, reference in zip(predictions, gold):
|
||||
reference_name = reference
|
||||
if prediction.strip() == reference_name.strip():
|
||||
correct_count += 1
|
||||
|
||||
detail = {
|
||||
'pred': prediction,
|
||||
'answer': reference_name,
|
||||
'correct': prediction.strip() == reference_name.strip()
|
||||
}
|
||||
details.append(detail)
|
||||
|
||||
accuracy = (correct_count /
|
||||
len(predictions)) * 100 if predictions else 0
|
||||
result = {'score': accuracy, 'details': details}
|
||||
return result
|
300
opencompass/datasets/needlebench_v2/multi.py
Normal file
300
opencompass/datasets/needlebench_v2/multi.py
Normal file
@ -0,0 +1,300 @@
|
||||
# flake8: noqa: E501
|
||||
import json
|
||||
import os
|
||||
import random
|
||||
|
||||
import tiktoken
|
||||
from datasets import Dataset
|
||||
from huggingface_hub import hf_hub_download
|
||||
|
||||
from opencompass.datasets.base import BaseDataset
|
||||
from opencompass.datasets.needlebench_v2.atc import (
|
||||
relationship_templates_en, relationship_templates_zh_CN,
|
||||
relationship_terms_en, relationship_terms_zh_CN)
|
||||
from opencompass.registry import LOAD_DATASET
|
||||
|
||||
|
||||
def get_random_needles(counter, file_path, num_needles, language):
|
||||
with open(file_path, 'r', encoding='utf-8') as file:
|
||||
names_data = json.load(file)
|
||||
|
||||
all_names = names_data[language].split(',')
|
||||
|
||||
random.seed(counter)
|
||||
names = random.sample(all_names, num_needles)
|
||||
|
||||
if language == 'Chinese':
|
||||
relationship_terms = relationship_terms_zh_CN
|
||||
relationship_templates = relationship_templates_zh_CN
|
||||
elif language == 'English':
|
||||
relationship_terms = relationship_terms_en
|
||||
relationship_templates = relationship_templates_en
|
||||
else:
|
||||
raise ValueError(f"Unsupported language '{language}' specified.")
|
||||
|
||||
def generate_chain_family_story(names, templates, relationship_terms):
|
||||
story = ''
|
||||
for i in range(len(names) - 1):
|
||||
template = random.choice(templates)
|
||||
relation_term = random.choice(relationship_terms)
|
||||
relation = template.format(A=names[i],
|
||||
B=names[i + 1],
|
||||
relationship=relation_term)
|
||||
story += f'{relation}*'
|
||||
return story
|
||||
|
||||
chain_story = generate_chain_family_story(names, relationship_templates,
|
||||
relationship_terms)
|
||||
|
||||
# Splitting the chain_story into a list of fragments
|
||||
family_story_fragments = chain_story.split('*')
|
||||
|
||||
# Removing the empty string from the list
|
||||
family_story_fragments = [
|
||||
fragment for fragment in family_story_fragments if fragment
|
||||
]
|
||||
|
||||
# Shuffling the list of fragments
|
||||
random.shuffle(family_story_fragments)
|
||||
|
||||
last_person = names[-1]
|
||||
|
||||
# Generating the retrieval question based on the language
|
||||
if language == 'Chinese':
|
||||
retrieval_question = f"在上面提供的文本中,'{last_person}'的能够向上追溯到的最年长的亲人是谁?"
|
||||
elif language == 'English':
|
||||
retrieval_question = f"Given the context described above, who is the eldest relative that '{last_person}' can trace back to in the context?"
|
||||
|
||||
# Returning the story, answer, and retrieval question
|
||||
return {
|
||||
'needles': family_story_fragments,
|
||||
'answer': names[0],
|
||||
'retrieval_question': retrieval_question,
|
||||
'last_person': last_person
|
||||
}
|
||||
|
||||
|
||||
@LOAD_DATASET.register_module()
|
||||
class NeedleBenchMultiDataset(BaseDataset):
|
||||
|
||||
@staticmethod
|
||||
def load(
|
||||
path: str, # depreciated
|
||||
length: int,
|
||||
depth: int,
|
||||
tokenizer_model: str,
|
||||
file_list: 'list[str]',
|
||||
num_repeats_per_file: int,
|
||||
length_buffer: int,
|
||||
language: str,
|
||||
needle_file_name: str,
|
||||
num_needles: int,
|
||||
diff: int,
|
||||
quesiton_position: str = 'End',
|
||||
):
|
||||
data = {'prompt': [], 'answer': []}
|
||||
tokenizer = tiktoken.encoding_for_model(tokenizer_model)
|
||||
|
||||
def _generate_context(tokens_context, depth_percent, needles):
|
||||
tokens_needle = [
|
||||
_get_tokens_from_context(needle) for needle in needles
|
||||
]
|
||||
insertion_points = []
|
||||
total_length = len(tokens_context)
|
||||
|
||||
for i, needle_tokens in enumerate(tokens_needle):
|
||||
if i == 0:
|
||||
insertion_point = int(total_length * (depth_percent / 100))
|
||||
else:
|
||||
insertion_point = int(insertion_points[i - 1] +
|
||||
len(tokens_needle[i - 1]) +
|
||||
total_length * (diff / 100))
|
||||
insertion_point = min(
|
||||
insertion_point,
|
||||
total_length + sum(len(tn) for tn in tokens_needle[:i]))
|
||||
insertion_points.append(insertion_point)
|
||||
|
||||
for i, needle_tokens in enumerate(tokens_needle):
|
||||
tokens_context = tokens_context[:insertion_points[i]] \
|
||||
+ needle_tokens + tokens_context[insertion_points[i]:]
|
||||
for j in range(i + 1, len(insertion_points)):
|
||||
insertion_points[j] += len(needle_tokens)
|
||||
|
||||
new_context = _decode_tokens(tokens_context)
|
||||
return new_context
|
||||
|
||||
def _get_tokens_from_context(context):
|
||||
if isinstance(context, list):
|
||||
return [tokenizer.encode(item) for item in context]
|
||||
else:
|
||||
return tokenizer.encode(context)
|
||||
|
||||
def _decode_tokens(tokens):
|
||||
return tokenizer.decode(tokens)
|
||||
|
||||
def _generate_prompt(context, retrieval_question, last_person):
|
||||
|
||||
if language == 'Chinese':
|
||||
if quesiton_position == 'End':
|
||||
prompt = f'''这是一个长文本能力的测试,你需要首先阅读下面的长文档,然后根据文档中的信息回答最后的问题。
|
||||
长文档的内容如下
|
||||
|
||||
<文档>
|
||||
{context}
|
||||
</文档>
|
||||
|
||||
根据文档中的信息,现在请问:{retrieval_question}
|
||||
|
||||
例如:
|
||||
例子1.如果张强的父亲是马克,除此以外提供的文本中没有更多关于亲属关系的信息,那么在提供的文本中张强能够向上追溯到的最年长的亲人就是马克。
|
||||
例子2.如果李明的姥姥是张红,而张红的父亲是张强,除此以外提供的文本中没有更多关于亲属关系的信息,那么在提供的文本中李明能够向上追溯到的最年长的亲人就是张强。
|
||||
例子3.如果小明是张红的曾孙女,张红的祖母是王华,王华的父亲是王刚,除此以外提供的文本中没有更多关于亲属关系的信息,那么小明能够向上追溯到的最年长的亲人就是王刚。
|
||||
|
||||
注意:
|
||||
1. 你不必纠结这个测试中的人名的性别关系,例如,一个通常被视为女性化的名字仍然可以是其他人的父亲,我们的重点是谁更年长。
|
||||
2. 忽略这个测试中的姓氏遗传问题,例如,李明仍然可能是王鹏的亲生父亲,我们只关注谁更年长,不必纠结孩子是否应该继承父亲或母亲的性别。
|
||||
3. 在回答的最后,将你的答案放在\\boxed{{}}中,例如:“所以{last_person}能向上追溯到的最年长的亲人就是\\boxed{{(你的答案)}}”
|
||||
|
||||
'''
|
||||
elif quesiton_position == 'Start':
|
||||
prompt = f'''这是一个长文本能力的测试,你需要首先阅读下面的问题,然后根据最后长文档中的信息回答下面的问题。
|
||||
现在请问:{retrieval_question}
|
||||
|
||||
例如:
|
||||
例子1.如果张强的父亲是马克,除此以外提供的文本中没有更多关于亲属关系的信息,那么在提供的文本中张强能够向上追溯到的最年长的亲人就是马克。
|
||||
例子2.如果李明的姥姥是张红,而张红的父亲是张强,除此以外提供的文本中没有更多关于亲属关系的信息,那么在提供的文本中李明能够向上追溯到的最年长的亲人就是张强。
|
||||
例子3.如果小明是张红的曾孙女,张红的祖母是王华,王华的父亲是王刚,除此以外提供的文本中没有更多关于亲属关系的信息,那么小明能够向上追溯到的最年长的亲人就是王刚。
|
||||
|
||||
注意:
|
||||
1. 你不必纠结这个测试中的人名的性别关系,例如,一个通常被视为女性化的名字仍然可以是其他人的父亲,我们的重点是谁更年长。
|
||||
2. 忽略这个测试中的姓氏遗传问题,例如,李明仍然可能是王鹏的亲生父亲,我们只关注谁更年长,不必纠结孩子是否应该继承父亲或母亲的性别。
|
||||
3. 在回答的最后,将你的答案放在\\boxed{{}}中,例如:“所以{last_person}能向上追溯到的最年长的亲人就是\\boxed{{(你的答案)}}”
|
||||
|
||||
长文档内容的如下
|
||||
|
||||
<文档>
|
||||
{context}
|
||||
</文档>
|
||||
|
||||
'''
|
||||
else:
|
||||
raise ValueError('Unsupported quesiton_position. '
|
||||
'Position must be "End" or "Start".')
|
||||
elif language == 'English':
|
||||
if quesiton_position == 'End':
|
||||
prompt = f'''This is a test of long-text capability. You need to first read the long document below, and then answer the final question based on the information in the document.
|
||||
The content of the long document is as follows
|
||||
|
||||
<Document>
|
||||
{context}
|
||||
</Document>
|
||||
|
||||
Based on the information in the document, now please answer: {retrieval_question}
|
||||
|
||||
For example:
|
||||
Example 1: If James Hill's father is Jasmine Lane, and no further information about familial relationships is provided in the text, then the oldest relative James Hill can trace back to in the provided text is \\boxed{{Jasmine Lane}}.
|
||||
Example 2: If Andrew Williams's grandmother is Dan Newton, and Dan Newton's father is James Hill, and no further information about familial relationships is provided in the text, then the oldest relative Andrew Williams can trace back to in the provided text is \\boxed{{James Hill}}.
|
||||
Example 3: If Jeff White's father is Kevin Le, Dan Newton's grandmother is Jeff White, and Jeff White's father is Kevin Le, and Shelley Mills is Dan Newton's great-granddaughter, and no further information about familial relationships is provided in the text, then the oldest relative Shelley Mills can trace back to in the provided text is \\boxed{{Kevin Le}}.
|
||||
|
||||
Notes:
|
||||
1. You do not need to worry about the gender consistency of names in this test. For example, a name that is typically considered feminine can still be the father of another person. Our primary focus is on who is older.
|
||||
2. Ignore surname inheritance issues. For instance, Andrew Williams could still be the biological father of Christopher Baker. We only care about who is older and do not need to consider whether a child should inherit the father's or mother's surname.
|
||||
3. At the end of your response, remember to put your final answer within \\boxed{{}}. For example: "So the oldest relative '{last_person}' can trace back to in the provided text is \\boxed{{(your answer here)}}."
|
||||
|
||||
'''
|
||||
elif quesiton_position == 'Start':
|
||||
prompt = f'''This is a test of long-text capability. You need to first read the question below, and then answer it based on the information in the long document that follows.
|
||||
Now please answer: {retrieval_question}
|
||||
|
||||
For example:
|
||||
Example 1: If James Hill's father is Jasmine Lane, and no further information about familial relationships is provided in the text, then the oldest relative James Hill can trace back to in the provided text is \\boxed{{Jasmine Lane}}.
|
||||
Example 2: If Andrew Williams's grandmother is Dan Newton, and Dan Newton's father is James Hill, and no further information about familial relationships is provided in the text, then the oldest relative Andrew Williams can trace back to in the provided text is \\boxed{{James Hill}}.
|
||||
Example 3: If Jeff White's father is Kevin Le, Dan Newton's grandmother is Jeff White, and Jeff White's father is Kevin Le, and Shelley Mills is Dan Newton's great-granddaughter, and no further information about familial relationships is provided in the text, then the oldest relative Shelley Mills can trace back to in the provided text is \\boxed{{Kevin Le}}.
|
||||
|
||||
Notes:
|
||||
1. You do not need to worry about the gender consistency of names in this test. For example, a name that is typically considered feminine can still be the father of another person. Our primary focus is on who is older.
|
||||
2. Ignore surname inheritance issues. For instance, Andrew Williams could still be the biological father of Christopher Baker. We only care about who is older and do not need to consider whether a child should inherit the father's or mother's surname.
|
||||
3. At the end of your response, remember to put your final answer within \\boxed{{}}. For example: "So the oldest relative '{last_person}' can trace back to in the provided text is \\boxed{{(your answer here)}}."
|
||||
|
||||
The content of the long document is as follows
|
||||
|
||||
<Document>
|
||||
{context}
|
||||
</Document>
|
||||
|
||||
'''
|
||||
else:
|
||||
raise ValueError(
|
||||
f'Unsupported quesiton_position {quesiton_position}. '
|
||||
'Position must be "End" or "Start".')
|
||||
else:
|
||||
raise ValueError(f"Language '{language}' is not supported.")
|
||||
|
||||
return prompt
|
||||
|
||||
repo_id = 'opencompass/NeedleBench'
|
||||
file_names = [
|
||||
'PaulGrahamEssays.jsonl', 'names.json', 'zh_finance.jsonl',
|
||||
'zh_game.jsonl', 'zh_general.jsonl', 'zh_government.jsonl',
|
||||
'zh_movie.jsonl', 'zh_tech.jsonl'
|
||||
]
|
||||
downloaded_files = []
|
||||
base_file_path = ''
|
||||
for file_name in file_names:
|
||||
file_path = hf_hub_download(repo_id=repo_id,
|
||||
filename=file_name,
|
||||
repo_type='dataset')
|
||||
downloaded_files.append(file_path)
|
||||
base_file_path = '/'.join(file_path.split('/')[:-1])
|
||||
|
||||
needle_file_path = os.path.join(base_file_path, needle_file_name)
|
||||
for file_path in downloaded_files:
|
||||
if file_path.split('/')[-1] not in file_list:
|
||||
continue
|
||||
|
||||
with open(file_path, 'r', encoding='utf-8') as f:
|
||||
lines_bak = [json.loads(line.strip()) for line in f]
|
||||
lines = lines_bak.copy()
|
||||
for counter in range(num_repeats_per_file):
|
||||
random.seed(counter)
|
||||
random.shuffle(lines)
|
||||
random_needle_data = get_random_needles(
|
||||
counter, needle_file_path, num_needles + 1, language)
|
||||
last_person = random_needle_data['last_person']
|
||||
needles = [
|
||||
'\n' + needle + '\n'
|
||||
for needle in random_needle_data['needles']
|
||||
]
|
||||
answer = random_needle_data['answer']
|
||||
keyword = answer
|
||||
retrieval_question = random_needle_data['retrieval_question']
|
||||
context_length = length - length_buffer
|
||||
target_length_per_record = context_length - \
|
||||
sum(len(tokens) for tokens
|
||||
in _get_tokens_from_context(needles))
|
||||
target_length_per_record = max(target_length_per_record, 0)
|
||||
accumulated_tokens = []
|
||||
for line in lines:
|
||||
tokens_current_line = _get_tokens_from_context(
|
||||
line['text'])
|
||||
accumulated_tokens.extend(tokens_current_line)
|
||||
|
||||
if len(accumulated_tokens) >= target_length_per_record:
|
||||
break
|
||||
|
||||
processed_text = _generate_context(
|
||||
accumulated_tokens[:target_length_per_record], depth,
|
||||
needles)
|
||||
|
||||
processed_prompt = _generate_prompt(processed_text,
|
||||
retrieval_question,
|
||||
last_person)
|
||||
|
||||
data['prompt'].append(processed_prompt)
|
||||
data['answer'].append(keyword)
|
||||
|
||||
dataset = Dataset.from_dict({
|
||||
'prompt': data['prompt'],
|
||||
'answer': data['answer'],
|
||||
})
|
||||
return dataset
|
222
opencompass/datasets/needlebench_v2/origin.py
Normal file
222
opencompass/datasets/needlebench_v2/origin.py
Normal file
@ -0,0 +1,222 @@
|
||||
# flake8: noqa: E501
|
||||
import json
|
||||
import os
|
||||
import random
|
||||
import re
|
||||
|
||||
import tiktoken
|
||||
from datasets import Dataset
|
||||
|
||||
from opencompass.datasets.base import BaseDataset
|
||||
from opencompass.openicl import BaseEvaluator
|
||||
from opencompass.registry import LOAD_DATASET, TEXT_POSTPROCESSORS
|
||||
from opencompass.utils import get_data_path
|
||||
|
||||
|
||||
def get_random_line_by_language(counter, file_path, language):
|
||||
with open(file_path, 'r', encoding='utf-8') as file:
|
||||
lines = [
|
||||
json.loads(line.strip()) for line in file
|
||||
if json.loads(line.strip())['language'] == language
|
||||
]
|
||||
|
||||
if lines:
|
||||
random.seed(counter)
|
||||
random_line = random.choice(lines)
|
||||
return {
|
||||
'needle': random_line['needle'],
|
||||
'retrieval_question': random_line['retrieval_question'],
|
||||
'keyword': random_line['arg2']
|
||||
}
|
||||
else:
|
||||
return None
|
||||
|
||||
|
||||
@LOAD_DATASET.register_module()
|
||||
class NeedleBenchOriginDataset(BaseDataset):
|
||||
|
||||
@staticmethod
|
||||
def load(
|
||||
path: str,
|
||||
length: int,
|
||||
depth: int,
|
||||
tokenizer_model: str,
|
||||
file_list: list[str],
|
||||
num_repeats_per_file: int,
|
||||
length_buffer: int,
|
||||
language: str,
|
||||
needle_file_name: str,
|
||||
quesiton_position: str = 'End',
|
||||
):
|
||||
data = {'prompt': [], 'answer': []}
|
||||
tokenizer = tiktoken.encoding_for_model(tokenizer_model)
|
||||
|
||||
def _generate_context(tokens_context, depth_percent, needle):
|
||||
tokens_needle = _get_tokens_from_context(needle)
|
||||
insertion_point = int(len(tokens_context) * (depth_percent / 100))
|
||||
tokens_context = (tokens_context[:insertion_point] +
|
||||
tokens_needle + tokens_context[insertion_point:])
|
||||
new_context = _decode_tokens(tokens_context)
|
||||
return new_context
|
||||
|
||||
def _get_tokens_from_context(context):
|
||||
return tokenizer.encode(context)
|
||||
|
||||
def _decode_tokens(tokens):
|
||||
return tokenizer.decode(tokens)
|
||||
|
||||
def _generate_prompt(context, retrieval_question):
|
||||
|
||||
if language == 'Chinese':
|
||||
if quesiton_position == 'End':
|
||||
prompt = f'''这是一个长文本能力的测试,你需要首先阅读下面的长文档,然后根据文档中的信息回答最后的问题。
|
||||
长文档的内容如下
|
||||
|
||||
<文档>
|
||||
{context}
|
||||
</文档>
|
||||
|
||||
根据文档中的信息,现在请问:{retrieval_question}
|
||||
'''
|
||||
elif quesiton_position == 'Start':
|
||||
prompt = f'''这是一个长文本能力的测试,你需要首先阅读下面的问题,然后根据最后长文档中的信息回答下面的问题。
|
||||
现在请问:{retrieval_question}
|
||||
|
||||
长文档内容的如下
|
||||
|
||||
<文档>
|
||||
{context}
|
||||
</文档>
|
||||
|
||||
'''
|
||||
else:
|
||||
raise ValueError('Unsupported quesiton_position. '
|
||||
'Position must be "End" or "Start".')
|
||||
elif language == 'English':
|
||||
if quesiton_position == 'End':
|
||||
prompt = f'''This is a test of long-text capability. You need to first read the long document below, and then answer the final question based on the information in the document.
|
||||
The content of the long document is as follows
|
||||
|
||||
<Document>
|
||||
{context}
|
||||
</Document>
|
||||
|
||||
Based on the information in the document, now please answer: {retrieval_question}
|
||||
'''
|
||||
elif quesiton_position == 'Start':
|
||||
prompt = f'''This is a test of long-text capability. You need to first read the question below, and then answer it based on the information in the long document that follows.
|
||||
Now please answer: {retrieval_question}
|
||||
|
||||
The content of the long document is as follows
|
||||
|
||||
<Document>
|
||||
{context}
|
||||
</Document>
|
||||
|
||||
'''
|
||||
else:
|
||||
raise ValueError(
|
||||
f'Unsupported quesiton_position {quesiton_position}. '
|
||||
'Position must be "End" or "Start".')
|
||||
else:
|
||||
raise ValueError(f"Language '{language}' is not supported.")
|
||||
|
||||
return prompt
|
||||
|
||||
file_names = [
|
||||
'en_un_asr.jsonl', 'zh_all.jsonl', 'PaulGrahamEssays.jsonl',
|
||||
'multi_needle_reasoning_en.json', 'multi_needle_reasoning_zh.json',
|
||||
'zh_finance.jsonl', 'zh_game.jsonl', 'zh_general.jsonl',
|
||||
'zh_government.jsonl', 'zh_movie.jsonl', 'zh_tech.jsonl'
|
||||
]
|
||||
path = get_data_path(path)
|
||||
if os.environ.get('DATASET_SOURCE') == 'HF':
|
||||
from huggingface_hub import snapshot_download
|
||||
path = snapshot_download(repo_id=path, repo_type='dataset')
|
||||
needle_file_path = os.path.join(path, needle_file_name)
|
||||
|
||||
for file_name in file_names:
|
||||
file_path = os.path.join(path, file_name)
|
||||
if file_name not in file_list:
|
||||
continue
|
||||
|
||||
with open(file_path, 'r', encoding='utf-8') as f:
|
||||
lines_bak = [json.loads(line.strip()) for line in f]
|
||||
lines = lines_bak.copy()
|
||||
for counter in range(num_repeats_per_file):
|
||||
random.seed(counter)
|
||||
random.shuffle(lines)
|
||||
random_needle = get_random_line_by_language(
|
||||
counter, needle_file_path, language)
|
||||
needle = '\n' + random_needle['needle'] + '\n'
|
||||
retrieval_question = random_needle['retrieval_question']
|
||||
keyword = random_needle['keyword']
|
||||
|
||||
context_length = length - length_buffer
|
||||
target_length_per_record = context_length - len(
|
||||
_get_tokens_from_context(needle))
|
||||
target_length_per_record = max(target_length_per_record, 0)
|
||||
accumulated_tokens = []
|
||||
for line in lines:
|
||||
tokens_current_line = _get_tokens_from_context(
|
||||
line['text'])
|
||||
accumulated_tokens.extend(tokens_current_line)
|
||||
|
||||
if len(accumulated_tokens) >= target_length_per_record:
|
||||
break
|
||||
|
||||
processed_text = _generate_context(
|
||||
accumulated_tokens[:target_length_per_record], depth,
|
||||
needle)
|
||||
|
||||
processed_prompt = _generate_prompt(processed_text,
|
||||
retrieval_question)
|
||||
|
||||
data['prompt'].append(processed_prompt)
|
||||
data['answer'].append(needle + '*' + keyword)
|
||||
|
||||
dataset = Dataset.from_dict({
|
||||
'prompt': data['prompt'],
|
||||
'answer': data['answer'],
|
||||
})
|
||||
return dataset
|
||||
|
||||
|
||||
class NeedleBenchOriginEvaluator(BaseEvaluator):
|
||||
|
||||
def score(self, predictions, gold):
|
||||
|
||||
if len(predictions) != len(gold):
|
||||
return {'error': 'predictions and gold have different lengths'}
|
||||
|
||||
total_score = 0
|
||||
details = []
|
||||
for prediction, reference in zip(predictions, gold):
|
||||
keyword = reference.split('*')[1]
|
||||
reference = reference.split('*')[0]
|
||||
raw_prediction = prediction
|
||||
prediction = re.sub(r'\s+', '', prediction)
|
||||
reference = re.sub(r'\s+', '', reference)
|
||||
|
||||
if keyword in raw_prediction:
|
||||
score = 100
|
||||
else:
|
||||
score = 0
|
||||
|
||||
detail = {'pred': prediction, 'answer': reference, 'score': score}
|
||||
total_score += score
|
||||
details.append(detail)
|
||||
|
||||
average_score = total_score / len(predictions) if predictions else 0
|
||||
result = {'score': average_score, 'details': details}
|
||||
return result
|
||||
|
||||
|
||||
@TEXT_POSTPROCESSORS.register_module('needlebench')
|
||||
def needlebench_postprocess(text: str) -> str:
|
||||
return text
|
||||
|
||||
|
||||
@TEXT_POSTPROCESSORS.register_module('needlebench_dataset_postprocess')
|
||||
def needlebench_dataset_postprocess(text: str) -> str:
|
||||
return text
|
308
opencompass/datasets/needlebench_v2/parallel.py
Normal file
308
opencompass/datasets/needlebench_v2/parallel.py
Normal file
@ -0,0 +1,308 @@
|
||||
# flake8: noqa: E501
|
||||
import json
|
||||
import os
|
||||
import random
|
||||
|
||||
import tiktoken
|
||||
from datasets import Dataset
|
||||
|
||||
from opencompass.datasets.base import BaseDataset
|
||||
from opencompass.openicl import BaseEvaluator
|
||||
from opencompass.registry import LOAD_DATASET
|
||||
from opencompass.utils import get_data_path
|
||||
|
||||
|
||||
def get_unique_entries(
|
||||
file_path,
|
||||
n,
|
||||
language,
|
||||
unique_arg1=False,
|
||||
unique_arg2=False,
|
||||
unique_combination=False,
|
||||
):
|
||||
seen_arg1 = set()
|
||||
seen_arg2 = set()
|
||||
seen_combinations = set()
|
||||
results = []
|
||||
|
||||
with open(file_path, 'r', encoding='utf-8') as file:
|
||||
lines = file.readlines()
|
||||
random.shuffle(lines)
|
||||
|
||||
for line in lines:
|
||||
try:
|
||||
entry = json.loads(line.strip())
|
||||
except json.JSONDecodeError:
|
||||
continue
|
||||
|
||||
if entry.get('language') != language:
|
||||
continue
|
||||
|
||||
key1 = entry.get('arg1', '') if unique_arg1 else ''
|
||||
key2 = entry.get('arg2', '') if unique_arg2 else ''
|
||||
combination = (key1, key2) if unique_combination else ''
|
||||
|
||||
if ((key1 not in seen_arg1 or not unique_arg1) # noqa: E501
|
||||
and (key2 not in seen_arg2 or not unique_arg2)
|
||||
and # noqa: E501
|
||||
(combination not in seen_combinations
|
||||
or not unique_combination)): # noqa: E501
|
||||
seen_arg1.add(key1)
|
||||
seen_arg2.add(key2)
|
||||
seen_combinations.add(combination)
|
||||
results.append(entry)
|
||||
|
||||
if len(results) == n:
|
||||
break
|
||||
|
||||
return results
|
||||
|
||||
|
||||
@LOAD_DATASET.register_module()
|
||||
class NeedleBenchParallelDataset(BaseDataset):
|
||||
|
||||
@staticmethod
|
||||
def load(
|
||||
path: str,
|
||||
needle_file_name: str,
|
||||
length: int,
|
||||
depths: list[int],
|
||||
tokenizer_model: str,
|
||||
file_list: list[str],
|
||||
num_repeats_per_file: int,
|
||||
length_buffer: int,
|
||||
language: str,
|
||||
quesiton_position: str = 'End',
|
||||
):
|
||||
data = {'prompt': [], 'answer': []}
|
||||
tokenizer = tiktoken.encoding_for_model(tokenizer_model)
|
||||
|
||||
file_names = [
|
||||
'PaulGrahamEssays.jsonl',
|
||||
'multi_needle_reasoning_en.json',
|
||||
'multi_needle_reasoning_zh.json',
|
||||
'zh_finance.jsonl',
|
||||
'zh_game.jsonl',
|
||||
'zh_general.jsonl',
|
||||
'zh_government.jsonl',
|
||||
'zh_movie.jsonl',
|
||||
'zh_tech.jsonl',
|
||||
]
|
||||
path = get_data_path(path)
|
||||
if os.environ.get('DATASET_SOURCE') == 'HF':
|
||||
from huggingface_hub import snapshot_download
|
||||
|
||||
path = snapshot_download(repo_id=path, repo_type='dataset')
|
||||
needle_file_path = os.path.join(path, needle_file_name)
|
||||
|
||||
predefined_needles_bak = get_unique_entries(
|
||||
needle_file_path,
|
||||
len(depths),
|
||||
language,
|
||||
unique_arg1=True,
|
||||
unique_arg2=True,
|
||||
unique_combination=True,
|
||||
)
|
||||
|
||||
def _generate_context(tokens_context, depths, needles):
|
||||
insertion_points = [
|
||||
int(len(tokens_context) * (depth / 100)) for depth in depths
|
||||
]
|
||||
|
||||
cumulative_inserted_length = 0
|
||||
|
||||
for i, needle in enumerate(needles):
|
||||
needle_tokens = _get_tokens_from_context(needle)
|
||||
current_insertion_point = min(
|
||||
insertion_points[i] + cumulative_inserted_length,
|
||||
len(tokens_context),
|
||||
)
|
||||
|
||||
tokens_context = (tokens_context[:current_insertion_point] +
|
||||
needle_tokens +
|
||||
tokens_context[current_insertion_point:])
|
||||
cumulative_inserted_length += len(needle_tokens)
|
||||
|
||||
new_context = _decode_tokens(tokens_context)
|
||||
return new_context
|
||||
|
||||
def _get_tokens_from_context(context):
|
||||
if isinstance(context, list):
|
||||
return [tokenizer.encode(item) for item in context]
|
||||
else:
|
||||
return tokenizer.encode(context)
|
||||
|
||||
def _decode_tokens(tokens):
|
||||
return tokenizer.decode(tokens)
|
||||
|
||||
def _generate_prompt(context, retrieval_question):
|
||||
if language == 'Chinese':
|
||||
if quesiton_position == 'End':
|
||||
prompt = f'''这是一个长文本能力的测试,你需要首先阅读下面的长文档,然后根据文档中的信息,依次回答最后的问题。
|
||||
长文档的内容如下
|
||||
|
||||
<文档>
|
||||
{context}
|
||||
</文档>
|
||||
|
||||
根据文档中的信息,现在请问:{retrieval_question}
|
||||
'''
|
||||
elif quesiton_position == 'Start':
|
||||
prompt = f'''这是一个长文本能力的测试,你需要首先阅读下面的问题,然后根据最后长文档中的信息,依次回答下面的问题。
|
||||
现在请问:{retrieval_question}
|
||||
|
||||
长文档内容的如下
|
||||
|
||||
<文档>
|
||||
{context}
|
||||
</文档>
|
||||
|
||||
'''
|
||||
else:
|
||||
raise ValueError(
|
||||
f'Unsupported quesiton_position {quesiton_position}. '
|
||||
'Position must be "End" or "Start".')
|
||||
elif language == 'English':
|
||||
if quesiton_position == 'End':
|
||||
prompt = f'''This is a test of long-text capability. You need to first read the long document below, and then answer the final questions one by one based on the information in the document.
|
||||
The content of the long document is as follows
|
||||
|
||||
<Document>
|
||||
{context}
|
||||
</Document>
|
||||
|
||||
Based on the information in the document, now please answer: {retrieval_question}
|
||||
'''
|
||||
elif quesiton_position == 'Start':
|
||||
prompt = f'''This is a test of long-text capability. You need to first read the questions below, and then answer them one by one based on the information in the long document that follows.
|
||||
Now please answer: {retrieval_question}
|
||||
|
||||
The content of the long document is as follows
|
||||
|
||||
<Document>
|
||||
{context}
|
||||
</Document>
|
||||
|
||||
'''
|
||||
else:
|
||||
raise ValueError(
|
||||
f'Unsupported quesiton_position {quesiton_position}. '
|
||||
'Position must be "End" or "Start".')
|
||||
else:
|
||||
raise ValueError(f"Language '{language}' is not supported.")
|
||||
|
||||
return prompt
|
||||
|
||||
for file_name in file_names:
|
||||
file_path = os.path.join(path, file_name)
|
||||
if file_name not in file_list:
|
||||
continue
|
||||
|
||||
with open(file_path, 'r', encoding='utf-8') as f:
|
||||
lines_bak = [json.loads(line.strip()) for line in f]
|
||||
lines = lines_bak.copy()
|
||||
for counter in range(num_repeats_per_file):
|
||||
random.seed(counter)
|
||||
random.shuffle(lines)
|
||||
predefined_needles = predefined_needles_bak.copy()
|
||||
random.seed(counter)
|
||||
random.shuffle(predefined_needles)
|
||||
|
||||
needles = [
|
||||
'\n' + item['needle'] + '\n' for item in predefined_needles
|
||||
]
|
||||
keywords = [item['arg2'] for item in predefined_needles]
|
||||
if language == 'Chinese':
|
||||
questions = '、'.join([
|
||||
item['retrieval_question'].split('?')[0] + '?'
|
||||
for item in predefined_needles
|
||||
])
|
||||
|
||||
answers_format = '、'.join([
|
||||
item['retrieval_question'].split("'")[1].split('。')[0]
|
||||
for item in predefined_needles
|
||||
])
|
||||
retrieval_question = (questions + "请按照'" + answers_format +
|
||||
"'的格式回答。")
|
||||
elif language == 'English':
|
||||
questions = '、'.join([
|
||||
item['retrieval_question'].split('?')[0] + '?'
|
||||
for item in predefined_needles
|
||||
])
|
||||
|
||||
answers_format = '、'.join([
|
||||
item['retrieval_question'].split("'")[1].split('.')[0]
|
||||
for item in predefined_needles
|
||||
])
|
||||
retrieval_question = (questions +
|
||||
"Please answer in the format of '" +
|
||||
answers_format + "'")
|
||||
|
||||
context_length = length - length_buffer
|
||||
target_length_per_record = context_length - sum(
|
||||
len(tokens)
|
||||
for tokens in _get_tokens_from_context(needles))
|
||||
target_length_per_record = max(target_length_per_record, 0)
|
||||
accumulated_tokens = []
|
||||
for line in lines:
|
||||
tokens_current_line = _get_tokens_from_context(
|
||||
line['text'])
|
||||
accumulated_tokens.extend(tokens_current_line)
|
||||
|
||||
if len(accumulated_tokens) >= target_length_per_record:
|
||||
break
|
||||
|
||||
processed_text = _generate_context(
|
||||
accumulated_tokens[:target_length_per_record], depths,
|
||||
needles)
|
||||
|
||||
processed_prompt = _generate_prompt(processed_text,
|
||||
retrieval_question)
|
||||
|
||||
data['prompt'].append(processed_prompt)
|
||||
|
||||
data['answer'].append('*'.join(keywords) + '#' +
|
||||
'*'.join(map(str, depths)))
|
||||
|
||||
dataset = Dataset.from_dict({
|
||||
'prompt': data['prompt'],
|
||||
'answer': data['answer'],
|
||||
})
|
||||
return dataset
|
||||
|
||||
|
||||
class NeedleBenchParallelEvaluator(BaseEvaluator):
|
||||
|
||||
def score(self, predictions, gold):
|
||||
if len(predictions) != len(gold):
|
||||
return {'error': 'predictions and gold have different lengths'}
|
||||
print('predictions:', predictions)
|
||||
print('gold:', gold)
|
||||
|
||||
details = []
|
||||
depths = [int(i) for i in gold[0].split('#')[1].split('*')]
|
||||
scores_by_depth = {depth: 0 for depth in depths}
|
||||
|
||||
for prediction, reference in zip(predictions, gold):
|
||||
print(reference)
|
||||
keywords = reference.split('#')[0].split('*')
|
||||
print(keywords)
|
||||
for keyword, depth in zip(keywords, depths):
|
||||
print('iterating:', keyword, depth)
|
||||
if keyword in prediction:
|
||||
print(f'{keyword} at depth {depth} is in {prediction}')
|
||||
scores_by_depth[depth] += 100 / (len(predictions))
|
||||
|
||||
average_score = sum(scores_by_depth.values()) / len(scores_by_depth)
|
||||
|
||||
flattened_scores = {
|
||||
'Depth' + str(depth): score
|
||||
for depth, score in scores_by_depth.items()
|
||||
}
|
||||
|
||||
result = {
|
||||
**flattened_scores,
|
||||
'details': details,
|
||||
'average_score': average_score,
|
||||
}
|
||||
return result
|
@ -61,15 +61,28 @@ model_name_mapping = {
|
||||
'qwen1.5-4b-chat-hf': 'Qwen-1.5-4B',
|
||||
'qwen1.5-14b-chat-hf': 'Qwen-1.5-14B',
|
||||
'qwen1.5-72b-chat-hf': 'Qwen-1.5-72B',
|
||||
'qwen1.5-1.8b-chat-vllm': 'Qwen-1.5-1.8B',
|
||||
'qwen1.5-14b-chat-vllm': 'Qwen-1.5-14B-vLLM',
|
||||
'qwen1.5-72b-chat-vllm': 'Qwen-1.5-72B-vLLM',
|
||||
'glm4_notools': 'GLM-4',
|
||||
'claude-3-opus': 'Claude-3-Opus',
|
||||
'glm-4-9b-chat-1m-vllm': 'GLM4-9B-Chat-1M',
|
||||
'internlm2_5-7b-chat-1m-turbomind': 'InternLM2.5-7B-Chat-1M',
|
||||
'internlm3-8b-instruct-turbomind': 'InternLM3-8B-Instruct',
|
||||
'llama-3.1-8b-instruct-vllm': 'LLaMA-3.1-8B',
|
||||
'qwen2.5-1.5b-instruct-vllm': 'Qwen-2.5-1.5B',
|
||||
'qwen2.5-7b-instruct-vllm': 'Qwen-2.5-7B',
|
||||
'qwen2.5-14b-instruct-vllm': 'Qwen-2.5-14B',
|
||||
'qwen2.5-32b-instruct-vllm': 'Qwen-2.5-32B',
|
||||
'qwen2_5-72b-instruct-vllm': 'Qwen-2.5-72B',
|
||||
'gemma-3-4b-it-vllm': 'Gemma-3-4B',
|
||||
'gemma-3-12b-it-vllm': 'Gemma-3-12B',
|
||||
'gemma-3-27b-it-vllm': 'Gemma-3-27B',
|
||||
'glm-4-9b-chat-vllm': 'GLM4-9B-Chat',
|
||||
'llama-3.1-8b-instruct-vllm': 'LLaMA-3.1-8B',
|
||||
'llama-3.1-70b-instruct-vllm': 'LLaMA-3.1-70B',
|
||||
# Add more mappings as necessary
|
||||
}
|
||||
|
||||
dataset_mapping_dict = {}
|
||||
|
||||
needle_counts = ['2', '3', '4', '5']
|
||||
@ -95,14 +108,19 @@ for t in types:
|
||||
dataset_mapping_dict[key] = value
|
||||
|
||||
|
||||
def calculate_elementwise_average(model_name, merged_df):
|
||||
def calculate_elementwise_average(model_name, merged_df, mean=False):
|
||||
score_columns = [col for col in merged_df.columns if col != 'dataset']
|
||||
|
||||
origin_columns = [col for col in score_columns if 'origin' in col]
|
||||
parallel_columns = [col for col in score_columns if 'parallel' in col]
|
||||
multi_columns = [col for col in score_columns if 'needle' in col]
|
||||
|
||||
if origin_columns and parallel_columns and multi_columns:
|
||||
if origin_columns and parallel_columns and multi_columns and mean:
|
||||
origin_avg = merged_df[origin_columns].mean(axis=1)
|
||||
parallel_avg = merged_df[parallel_columns].mean(axis=1)
|
||||
multi_avg = merged_df[multi_columns].mean(axis=1)
|
||||
merged_df[model_name] = (origin_avg + parallel_avg + multi_avg) / 3
|
||||
elif origin_columns and parallel_columns and multi_columns and not mean:
|
||||
origin_avg = merged_df[origin_columns].mean(axis=1) * 0.4
|
||||
parallel_avg = merged_df[parallel_columns].mean(axis=1) * 0.3
|
||||
multi_avg = merged_df[multi_columns].mean(axis=1) * 0.3
|
||||
@ -185,7 +203,7 @@ def remove_empty_subfolders(plot_path):
|
||||
if not os.listdir(folder_path):
|
||||
shutil.rmtree(folder_path)
|
||||
|
||||
def save_results_to_plots(txt_results_save_path):
|
||||
def save_results_to_plots(txt_results_save_path, mean=False):
|
||||
content = read_after_specific_line_except_last(txt_results_save_path, 'raw format', 2)
|
||||
parsed_data = parse_model_scores(content)
|
||||
model_names = get_dict_model_names(parsed_data)
|
||||
@ -228,25 +246,25 @@ def save_results_to_plots(txt_results_save_path):
|
||||
overall_dataset_abbrs = multi_dataset_abbrs + origin_dataset_abbrs + parallel_dataset_abbrs
|
||||
overall_score_pic_path = os.path.join(plot_path, f'{model_name}_overall.png')
|
||||
merged_df = merge_dataframes(model_name, overall_dataset_abbrs, parsed_data)
|
||||
averaged_df = calculate_elementwise_average(model_name, merged_df)
|
||||
averaged_df = calculate_elementwise_average(model_name, merged_df, mean=mean)
|
||||
overall_score = visualize(averaged_df, overall_score_pic_path, model_name, 'Overall Score')
|
||||
|
||||
# Single-Retrieval
|
||||
single_retrieval_score_pic_path = os.path.join(plot_path, f'{model_name}_single_retrieval_overall.png')
|
||||
single_retrieval_merged_df = merge_dataframes(model_name, origin_dataset_abbrs, parsed_data)
|
||||
single_retrieval_averaged_df = calculate_elementwise_average(model_name, single_retrieval_merged_df)
|
||||
single_retrieval_averaged_df = calculate_elementwise_average(model_name, single_retrieval_merged_df, mean=mean)
|
||||
single_retrieval_overall_score = visualize(single_retrieval_averaged_df, single_retrieval_score_pic_path, model_name, 'Single-Retrieval Overall Score')
|
||||
|
||||
# Multi-Retrieval
|
||||
multi_retrieval_score_pic_path = os.path.join(plot_path, f'{model_name}_multi_retrieval_overall.png')
|
||||
multi_retrieval_merged_df = merge_dataframes(model_name, parallel_dataset_abbrs, parsed_data)
|
||||
multi_retrieval_averaged_df = calculate_elementwise_average(model_name, multi_retrieval_merged_df)
|
||||
multi_retrieval_averaged_df = calculate_elementwise_average(model_name, multi_retrieval_merged_df, mean=mean)
|
||||
multi_retrieval_overall_score = visualize(multi_retrieval_averaged_df, multi_retrieval_score_pic_path, model_name, 'Multi-Retrieval Overall Score')
|
||||
|
||||
# Multi-Reasoning
|
||||
multi_reasoning_score_pic_path = os.path.join(plot_path, f'{model_name}_multi_reasoning_overall.png')
|
||||
multi_reasoning_merged_df = merge_dataframes(model_name, multi_dataset_abbrs, parsed_data)
|
||||
multi_reasoning_averaged_df = calculate_elementwise_average(model_name, multi_reasoning_merged_df)
|
||||
multi_reasoning_averaged_df = calculate_elementwise_average(model_name, multi_reasoning_merged_df, mean=mean)
|
||||
multi_reasoning_overall_score = visualize(multi_reasoning_averaged_df, multi_reasoning_score_pic_path, model_name, 'Multi-Reasoning Overall Score')
|
||||
|
||||
model_scores[model_name] = averaged_df
|
||||
@ -279,7 +297,7 @@ def visualize(df_raw, save_path: str,model_name: str ,dataset_type:str):
|
||||
|
||||
mean_scores = pivot_table.mean().values
|
||||
overall_score = mean_scores.mean()
|
||||
plt.figure(figsize=(10, 6))
|
||||
plt.figure(figsize=(7.5, 4.5))
|
||||
ax = plt.gca()
|
||||
cmap = LinearSegmentedColormap.from_list(
|
||||
'custom_cmap', ['#F0496E', '#EBB839', '#0CD79F'])
|
||||
@ -541,6 +559,42 @@ class NeedleBenchSummarizer(DefaultSummarizer):
|
||||
# plot to show visualize results
|
||||
save_results_to_plots(output_path)
|
||||
|
||||
class NeedleBenchSummarizerV2(NeedleBenchSummarizer):
|
||||
"""NeedleBench summarizer V2 in OpenCompass.
|
||||
|
||||
This version calls save_results_to_plots with mean=True.
|
||||
|
||||
Args:
|
||||
config (ConfigDict): The configuration object of the evaluation task. It's expected to be filled out at runtime.
|
||||
dataset_abbrs (list[str], optional): Dataset abbreviations to be listed in the summary.
|
||||
summary_groups (list): The dataset groups whose results need to be averaged out. For example, mmlu. Each item it a dict with
|
||||
'name' (str) and 'subsets' (list of dataset abbrs), and optionally
|
||||
'weights' if weighted average is needed.
|
||||
prompt_db: A deprecated field.
|
||||
"""
|
||||
|
||||
def summarize(
|
||||
self,
|
||||
output_path: str = None,
|
||||
time_str: str = datetime.now().strftime('%Y%m%d_%H%M%S')): # noqa
|
||||
|
||||
raw_results, parsed_results, dataset_metrics, dataset_eval_mode = self._pick_up_results()
|
||||
raw_results, parsed_results, dataset_metrics, dataset_eval_mode = \
|
||||
self._calculate_group_metrics(raw_results, parsed_results, dataset_metrics, dataset_eval_mode)
|
||||
table = self._format_table(parsed_results, dataset_metrics, dataset_eval_mode)
|
||||
raw_txts = self._format_raw_txt(raw_results)
|
||||
print(tabulate.tabulate(table, headers='firstrow'))
|
||||
self._output_to_file(output_path, time_str, table, raw_txts)
|
||||
if self.lark_reporter:
|
||||
content = f'{getpass.getuser()} 的'
|
||||
content += f'详细评测汇总已输出至 {osp.abspath(output_path)}'
|
||||
self.lark_reporter.post(content)
|
||||
|
||||
if output_path is None:
|
||||
output_path = osp.join(self.work_dir, 'summary', f'summary_{time_str}.txt')
|
||||
# plot to show visualize results
|
||||
save_results_to_plots(output_path, mean=True)
|
||||
|
||||
class NeedleBenchATCSummarizer(DefaultSummarizer):
|
||||
"""NeedleBench-ATC summarizer in OpenCompass.
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user