diff --git a/configs/summarizers/needlebench.py b/configs/summarizers/needlebench.py index ef6ab813..3dc5b34b 100644 --- a/configs/summarizers/needlebench.py +++ b/configs/summarizers/needlebench.py @@ -107,14 +107,14 @@ def create_summarizer(context_lengths, depths, dataset_size, 'Multi-Needle-Reasoning(M-RS)', 'Multi-Needle-Reasoning-EN', 'Multi-Needle-Reasoning-ZH', - '2-Needle-EN-4K', - '2-Needle-ZH-4K', - '3-Needle-EN-4K', - '3-Needle-ZH-4K', - '4-Needle-EN-4K', - '4-Needle-ZH-4K', - '5-Needle-EN-4K', - '5-Needle-ZH-4K', + f'2-Needle-EN-{dataset_size.upper()}', + f'2-Needle-ZH-{dataset_size.upper()}', + f'3-Needle-EN-{dataset_size.upper()}', + f'3-Needle-ZH-{dataset_size.upper()}', + f'4-Needle-EN-{dataset_size.upper()}', + f'4-Needle-ZH-{dataset_size.upper()}', + f'5-Needle-EN-{dataset_size.upper()}', + f'5-Needle-ZH-{dataset_size.upper()}', ] } return summarizer_config diff --git a/docs/en/advanced_guides/needleinahaystack_eval.md b/docs/en/advanced_guides/needleinahaystack_eval.md index a4211231..b75f6394 100644 --- a/docs/en/advanced_guides/needleinahaystack_eval.md +++ b/docs/en/advanced_guides/needleinahaystack_eval.md @@ -60,28 +60,42 @@ pip install -e . We have pre-configured datasets for common text lengths (4k, 8k, 32k, 128k, 200k, 1000k) in `configs/datasets/needlebench`, allowing you to flexibly create datasets that meet your needs by defining related parameters in the configuration files. -### Example of Evaluation +### Evaluation Example -#### Evaluating using the `InternLM2-7B` model deployed with `LMDeploy` +#### Evaluating `InternLM2-7B` Model Deployed Using `LMDeploy` -For instance, to evaluate all tasks in NeedleBench-4K using the `InternLM2-7B` model deployed with `LMDeploy`, use the following command line command that calls the predefined model and dataset configuration files, without needing to write additional configuration files: +For example, to evaluate the `InternLM2-7B` model deployed using `LMDeploy` for all tasks in NeedleBench-4K, you can directly use the following command in the command line. This command calls the pre-defined model and dataset configuration files without needing to write additional configuration files: + +##### Local Evaluation + +If you are evaluating the model locally, the command below will utilize all available GPUs on your machine. You can limit the GPU access for `OpenCompass` by setting the `CUDA_VISIBLE_DEVICES` environment variable. For instance, using `CUDA_VISIBLE_DEVICES=0,1,2,3 python run.py ...` will only expose the first four GPUs to OpenCompass, ensuring that it does not use more than these four GPUs. ```bash -python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000 +# Local Evaluation +python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer ``` -If you only want to test the original Needle In A Haystack task setup, you can change the dataset parameter to `needlebench_single_4k`, such as: +##### Evaluation on a Slurm Cluster + +If using `Slurm`, you can add parameters such as `--slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000`, as shown below: ```bash -python run.py --dataset needlebench_single_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --sl - -urm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000 +# Slurm Evaluation +python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000 ``` -You can also choose sub-datasets, such as changing the `--datasets` parameter to `needlebench_single_4k/needlebench_zh_datasets` for only testing the Chinese version of the single needle task, where the parameter after `/` represents the sub-dataset. You can find the optional sub-dataset variables in the `configs/datasets/needlebench/needlebench_4k/needlebench_single_4k.py`, such as: +##### Evaluating a Subdataset Only + +If you only want to test the original NeedleInAHaystack task setup, you could change the dataset parameter to `needlebench_single_4k`, which corresponds to the single needle version of the NeedleInAHaystack test at 4k length: ```bash -python run.py --dataset needlebench_single_4k/needlebench_zh_datasets --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000 +python run.py --dataset needlebench_single_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000 +``` + +You can also choose to evaluate a specific subdataset, such as changing the `--datasets` parameter to `needlebench_single_4k/needlebench_zh_datasets` for testing just the Chinese version of the single needle 4K length NeedleInAHaystack task. The parameter after `/` represents the subdataset, which can be found in the dataset variable of `configs/datasets/needlebench/needlebench_4k/needlebench_single_4k.py` : + +```bash +python run.py --dataset needlebench_single_4k/needlebench_zh_datasets --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000 ``` Be sure to install the [LMDeploy](https://github.com/InternLM/lmdeploy) tool before starting the evaluation: diff --git a/docs/zh_cn/advanced_guides/needleinahaystack_eval.md b/docs/zh_cn/advanced_guides/needleinahaystack_eval.md index 86b7fa05..72d21ae9 100644 --- a/docs/zh_cn/advanced_guides/needleinahaystack_eval.md +++ b/docs/zh_cn/advanced_guides/needleinahaystack_eval.md @@ -66,17 +66,33 @@ pip install -e . 例如,使用`LMDeploy`部署的 `InternLM2-7B` 模型进行评估NeedleBench-4K的所有任务,可以在命令行中直接使用以下命令,该命令会调用预定义好的模型、数据集配置文件,而无需额外书写配置文件: +##### 本地评估 + +如果您在本地评估模型,下面命令会调用机器的所有可用GPU。您可以通过设置 `CUDA_VISIBLE_DEVICES` 环境变量来限制 `OpenCompass` 的 GPU 访问。例如,使用 `CUDA_VISIBLE_DEVICES=0,1,2,3 python run.py ...` 只会向 OpenCompass 暴露前四个 GPU,确保它同时使用的 GPU 数量不超过这四个。 + ```bash +# 本地评估 +python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer +``` + +##### 在Slurm集群上评估 + +如果使用 `Slurm`,可以添加 `--slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000`等参数,例如下面: + +```bash +# Slurm评估 python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000 ``` -如果只想测试原始的大海捞针任务设定,可以更换数据集的参数为`needlebench_single_4k`,如: +##### 只评估子数据集 + +如果只想测试原始的大海捞针任务设定,比如可以更换数据集的参数为`needlebench_single_4k`,这对应于4k长度下的单针版本的大海捞针测试: ```bash python run.py --dataset needlebench_single_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000 ``` -您也可以进一步选择子数据集,如更换数据集`--datasets`的参数为`needlebench_single_4k/needlebench_zh_datasets`,仅仅进行中文版本的单针大海捞针任务测试,其中`/`后面的参数代表子数据集,您可以在`configs/datasets/needlebench/needlebench_4k/needlebench_single_4k.py`中找到可选的子数据集变量,如: +您也可以进一步选择子数据集,如更换数据集`--datasets`的参数为`needlebench_single_4k/needlebench_zh_datasets`,仅仅进行中文版本的单针4K长度下的大海捞针任务测试,其中`/`后面的参数代表子数据集,您可以在`configs/datasets/needlebench/needlebench_4k/needlebench_single_4k.py`中找到可选的子数据集变量,如: ```bash python run.py --dataset needlebench_single_4k/needlebench_zh_datasets --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000