[Update] Update method to add dataset in docs (#1827)

* create new branch * docs new_dataset.md zh * docs new_dataset.md zh and en
2025-05-30 16:03:24 +08:00 · 2025-01-17 11:07:19 +08:00 · 2025-01-17 11:07:19 +08:00 · 70da9b7776
commit 70da9b7776
parent 531643e771
2 changed files with 75 additions and 1 deletions
--- a/docs/en/advanced_guides/new_dataset.md
+++ b/docs/en/advanced_guides/new_dataset.md
@ -54,4 +54,40 @@ Although OpenCompass has already included most commonly used datasets, users nee
   ]
   ```
   
+   - To facilitate the access of your datasets to other users, you need to specify the channels for downloading the datasets in the configuration file. Specifically, you need to first fill in a dataset name given by yourself in the `path` field in the `mydataset_datasets` configuration, and this name will be mapped to the actual download path in the `opencompass/utils/datasets_info.py` file. Here's an example:
+   
+   ```python
+    mmlu_datasets = [an
+        dict(
+            ...,
+            path='opencompass/mmlu',
+            ...,
+        )
+   ]
+   ```
+   - Next, you need to create a dictionary key in `opencompass/utils/datasets_info.py` with the same name as the one you provided above. If you have already hosted the dataset on HuggingFace or Modelscope, please add a dictionary key to the `DATASETS_MAPPING` dictionary and fill in the HuggingFace or Modelscope dataset address in the `hf_id` or `ms_id` key, respectively. You can also specify a default local address. Here's an example:
+   
+   ```python
+   "opencompass/mmlu": {
+        "ms_id": "opencompass/mmlu",
+        "hf_id": "opencompass/mmlu",
+        "local": "./data/mmlu/",
+    }
+   ```
+   
+   - If you wish for the provided dataset to be directly accessible from the OpenCompass OSS repository when used by others, you need to submit the dataset files in the Pull Request phase. We will then transfer the dataset to the OSS on your behalf and create a new dictionary key in the `DATASET_URL`.
+
+   - To ensure the optionality of data sources, you need to improve the method `load` in the dataset script `mydataset.py`. Specifically, you need to implement a functionality to switch among different download sources based on the setting of the environment variable `DATASET_SOURCE`. It shoule be noted that if the environment variable `DATASET_SOURCE` is not set, the dataset will default to being downloaded from the OSS repository. Here's an example from `opencompass/dataset/cmmlu.py`:
+
+   ```python
+    def load(path: str, name: str, **kwargs):
+        ...
+        if environ.get('DATASET_SOURCE') == 'ModelScope':
+            ...
+        else:
+            ...
+        return dataset
+   ```
+   
+
   Detailed dataset configuration files and other required configuration files can be referred to in the [Configuration Files](../user_guides/config.md) tutorial. For guides on launching tasks, please refer to the [Quick Start](../get_started/quick_start.md) tutorial.
--- a/docs/zh_cn/advanced_guides/new_dataset.md
+++ b/docs/zh_cn/advanced_guides/new_dataset.md
@ -55,4 +55,42 @@
   ]
   ```
   
-   详细的数据集配置文件以及其他需要的配置文件可以参考[配置文件](../user_guides/config.md)教程，启动任务相关的教程可以参考[快速开始](../get_started/quick_start.md)教程。
+   - 为了使用户提供的数据集能够被其他使用者更方便的获取，需要用户在配置文件中给出下载数据集的渠道。具体的方式是首先在`mydataset_datasets`配置中的`path`字段填写用户指定的数据集名称，该名称将以mapping的方式映射到`opencompass/utils/datasets_info.py`中的实际下载路径。具体示例如下：
+   
+   ```python
+    mmlu_datasets = [
+        dict(
+            ...,
+            path='opencompass/mmlu',
+            ...,
+        )
+   ]
+   ```
+   
+   - 接着，需要在`opencompass/utils/datasets_info.py`中创建对应名称的字典字段。如果用户已将数据集托管到huggingface或modelscope，那么请在`DATASETS_MAPPING`字典中添加对应名称的字段，并将对应的huggingface或modelscope数据集地址填入`ms_id`和`hf_id`；另外，还允许指定一个默认的`local`地址。具体示例如下：
+   
+   ```python
+   "opencompass/mmlu": {
+        "ms_id": "opencompass/mmlu",
+        "hf_id": "opencompass/mmlu",
+        "local": "./data/mmlu/",
+    }
+   ```
+   
+   - 如果希望提供的数据集在其他用户使用时能够直接从OpenCompass官方的OSS仓库获取，则需要在Pull Request阶段向我们提交数据集文件，我们将代为传输数据集至OSS，并在`DATASET_URL`新建字段。
+
+   - 为了确保数据来源的可选择性，用户需要根据所提供数据集的下载路径类型来完善数据集脚本`mydataset.py`中的`load`方法的功能。具体而言，需要用户实现根据环境变量`DATASET_SOURCE`的不同设置来切换不同的下载数据源的功能。需要注意的是，若未设置`DATASET_SOURCE`的值，将默认从OSS仓库下载数据。`opencompass/dataset/cmmlu.py`中的具体示例如下：
+   
+   ```python
+    def load(path: str, name: str, **kwargs):
+        ...
+        if environ.get('DATASET_SOURCE') == 'ModelScope':
+            ...
+        else:
+            ...
+        return dataset
+   ```
+
+   
+
+  详细的数据集配置文件以及其他需要的配置文件可以参考[配置文件](../user_guides/config.md)教程，启动任务相关的教程可以参考[快速开始](../get_started/quick_start.md)教程。