[Doc] Update contamination docs (#698)

* update contamination docs * add citation * Update contamination_eval.md * Update contamination_eval.md --------- Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>
2025-05-30 16:03:24 +08:00 · 2023-12-13 18:03:39 +08:00 · 2023-12-13 18:03:39 +08:00 · cadab9474f
commit cadab9474f
parent a94598d921
2 changed files with 210 additions and 70 deletions
--- a/docs/en/advanced_guides/contamination_eval.md
+++ b/docs/en/advanced_guides/contamination_eval.md
@ -1,56 +1,124 @@
-# Contamination Evaluation Guidance
+# Data Contamination Assessment
-**Data contamination**, i.e.,
+**Data Contamination** refers to the phenomenon where data intended for downstream testing tasks appear in the training data of large language models (LLMs), resulting in artificially inflated performance metrics in downstream tasks (such as summarization, natural language inference, text classification), which do not accurately reflect the model's true generalization capabilities.
 the presence of test data from these downstream tasks in the pre-training data of LLMs, may inflate LLM performance observed on many downstream tasks (e.g., summarization, natural language inference, text classification).
-To evaluate LLM with contaminated data, we employed [Contamination Detector](https://github.com/liyucheng09/Contamination_Detector) to generate contamination labels.
+Since the source of data contamination lies in the training data used by LLMs, the most direct method to detect data contamination is to collide test data with training data and then report the extent of overlap between the two. The classic GPT-3 [paper](https://arxiv.org/pdf/2005.14165.pdf) reported on this in Table C.1.
-## Introduction to [Detection Tools](https://github.com/liyucheng09/Contamination_Detector)
+However, today's open-source community often only publishes model parameters, not training datasets. In such cases, how to determine the presence and extent of data contamination remains unsolved. OpenCompass offers two possible solutions.
-Contamination Detector aids in identifying and analyzing such potential contamination without requiring access to the LLMs' training data based on Internet presence verification, enabling even small teams and individuals to conduct robust evaluation.
+## Contamination Data Annotation Based on Self-Built Co-Distribution Data
-### Method
+Referencing the method mentioned in Section 5.2 of [Skywork](https://arxiv.org/pdf/2310.19341.pdf), we directly used the dataset [mock_gsm8k_test](https://huggingface.co/datasets/Skywork/mock_gsm8k_test) uploaded to HuggingFace by Skywork.
- Using the Bing Search API to check if verbatim test examples appear online, which likely indicates inclusion in Common Crawl.
+In this method, the authors used GPT-4 to synthesize data similar to the original GSM8K style, and then calculated the perplexity on the GSM8K training set (train), GSM8K test set (test), and GSM8K reference set (ref). Since the GSM8K reference set was newly generated, the authors considered it as clean, not belonging to any training set of any model. They posited:
- Specifically verifying if pages containing verbatim test examples were indexed in the 2017-2020 Common Crawl, by only searching the URLs rather than
+- If the test set's perplexity is significantly lower than the reference set's, the test set might have appeared in the model's training phase;
-  full contents.
+- If the training set's perplexity is significantly lower than the test set's, the training set might have been overfitted by the model.
-#### Construct queries
+The following configuration file can be referenced:
-for example:
+```python
-**Question**: The flaw in Anderson’s ACT
+from mmengine.config import read_base
 theory was that some considered it \_\_\_\_.
 **Choices**:
 A: ’Only applicable to a motor system’,
 B: ’Untestable and thus, of uncertain sci-
 entific value’,
 C: ’Lacking in definition for its ele-
 ments’
 D: ’Overly complex in explaining the
 operation of cognition’,
 **Answer**: B
 **Query**: The flaw in Anderson’s ACT theory was that some considered it untestable and thus, of uncertain scientific value.
-#### Improve Matching
+with read_base():
    from .datasets.gsm8k_contamination.gsm8k_contamination_ppl_ecdd22 import gsm8k_datasets  # includes training, test, and reference sets
    from .models.qwen.hf_qwen_7b import models as hf_qwen_7b_model  # model under review
    from .models.yi.hf_yi_6b import models as hf_yi_6b_model
-To avoid potential false positives, the method is configured with two key settings:
+datasets = [*gsm8k_datasets]
 models = [*hf_qwen_7b_model, *hf_yi_6b_model]
 ```
- an order penalty (gamma of 0.8) for METEOR ensures matches respect sequence;
+An example output is as follows:
 - matching is constrained to a window up
  to 2x the query length, preventing partial or out-of-
  context matches.
-#### Contamination Type
+```text
 dataset          version    metric       mode       internlm-7b-hf    qwen-7b-hf    yi-6b-hf    chatglm3-6b-base-hf    qwen-14b-hf    baichuan2-13b-base-hf    internlm-20b-hf    aquila2-34b-hf  ...
 ---------------  ---------  -----------  -------  ----------------  ------------  ----------  ---------------------  -------------  -----------------------  -----------------  ----------------  ...
 gsm8k-train-ppl  0b8e46     average_ppl  unknown              1.5           0.78        1.37                   1.16           0.5                      0.76               1.41              0.78  ...
 gsm8k-test-ppl   0b8e46     average_ppl  unknown              1.56          1.33        1.42                   1.3            1.15                     1.13               1.52              1.16  ...
 gsm8k-ref-ppl    f729ba     average_ppl  unknown              1.55          1.2         1.43                   1.35           1.27                     1.19               1.47              1.35  ...
 ```
- *input contamination* where only question is presented in the
+Currently, this solution only supports the GSM8K dataset. We welcome the community to contribute more datasets.
  matched pages but not answer;
 - *input-and-label contamination* where both question and answer occur in the matched pages.
-## Data Preparation
+Consider cite the following paper if you find it helpful:
-To be complete
+```bibtex
@misc{2023opencompass,
    title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
    author={OpenCompass Contributors},
    howpublished = {\url{https://github.com/open-compass/opencompass}},
    year={2023}
 }
@misc{wei2023skywork,
      title={Skywork: A More Open Bilingual Foundation Model},
      author={Tianwen Wei and Liang Zhao and Lichang Zhang and Bo Zhu and Lijie Wang and Haihua Yang and Biye Li and Cheng Cheng and Weiwei Lü and Rui Hu and Chenxia Li and Liu Yang and Xilin Luo and Xuejie Wu and Lunan Liu and Wenjun Cheng and Peng Cheng and Jianhao Zhang and Xiaoyu Zhang and Lei Lin and Xiaokun Wang and Yutuan Ma and Chuanhai Dong and Yanqi Sun and Yifu Chen and Yongyi Peng and Xiaojuan Liang and Shuicheng Yan and Han Fang and Yahui Zhou},
      year={2023},
      eprint={2310.19341},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
 }
 ```
-## Evaluation Configuration
+## Contamination Data Annotation Based on Classic Pre-trained Sets
-To be complete
+Thanks to [Contamination_Detector](https://github.com/liyucheng09/Contamination_Detector) and @liyucheng09 for providing this method.
 In this method, the authors search the test datasets (such as C-Eval, ARC, HellaSwag, etc.) using the Common Crawl database and Bing search engine, then mark each test sample as clean / question contaminated / both question and answer contaminated.
 During testing, OpenCompass
 will report the accuracy or perplexity of ceval on subsets composed of these three labels. Generally, the accuracy ranges from low to high: clean, question contaminated, both question and answer contaminated subsets. The authors believe:
 - If the performance of the three is relatively close, the contamination level of the model on that test set is light; otherwise, it is heavy.
 The following configuration file can be referenced [link](https://github.com/open-compass/opencompass/blob/main/configs/eval_contamination.py):
 ```python
 from mmengine.config import read_base
 with read_base():
    from .datasets.ceval.ceval_clean_ppl import ceval_datasets  # ceval dataset with contamination tags
    from .models.yi.hf_yi_6b import models as hf_yi_6b_model  # model under review
    from .models.qwen.hf_qwen_7b import models as hf_qwen_7b_model
    from .summarizers.contamination import ceval_summarizer as summarizer  # output formatting
 datasets = [*ceval_datasets]
 models = [*hf_yi_6b_model, *hf_qwen_7b_model]
 ```
 An example output is as follows:
 ```text
 dataset                                         version    mode    yi-6b-hf          -                              -                                        qwen-7b-hf        -                              -                                        ...
 ----------------------------------------------  ---------  ------  ----------------  -----------------------------  ---------------------------------------  ----------------  -----------------------------  ---------------------------------------  ...
 -                                               -          -       accuracy - clean  accuracy - input contaminated  accuracy - input-and-label contaminated  accuracy - clean  accuracy - input contaminated  accuracy - input-and-label contaminated  ...
 ...
 ceval-humanities                                -          ppl     74.42             75.00                          82.14                                    67.44             50.00                          70.54                                    ...
 ceval-stem                                      -          ppl     53.70             57.14                          85.61                                    47.41             52.38                          67.63                                    ...
 ceval-social-science                            -          ppl     81.60             84.62                          83.09                                    76.00             61.54                          72.79                                    ...
 ceval-other                                     -          ppl     72.31             73.91                          75.00                                    58.46             39.13                          61.88                                    ...
 ceval-hard                                      -          ppl     44.35             37.50                          70.00                                    41.13             25.00                          30.00                                    ...
 ceval                                           -          ppl     67.32             71.01                          81.17                                    58.97             49.28                          67.82                                    ...
 ```
 Currently, this solution only supports the C-Eval dataset. [Contamination_Detector](https://github.com/liyucheng09/Contamination_Detector) also includes ARC, CSQA, HellaSwag, MMLU, and WinoGrande, but these have not yet been implemented in OpenCompass. We welcome the community to contribute more datasets.
 Consider cite the following paper if you find it helpful:
 ```bibtex
@misc{2023opencompass,
    title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
    author={OpenCompass Contributors},
    howpublished = {\url{https://github.com/open-compass/opencompass}},
    year={2023}
 }
@article{Li2023AnOS,
  title={An Open Source Data Contamination Report for Llama Series Models},
  author={Yucheng Li},
  journal={ArXiv},
  year={2023},
  volume={abs/2310.17589},
  url={https://api.semanticscholar.org/CorpusID:264490711}
 }
 ```
--- a/docs/zh_cn/advanced_guides/contamination_eval.md
+++ b/docs/zh_cn/advanced_guides/contamination_eval.md
@ -1,50 +1,122 @@
-# 污染评估指南
+# 数据污染评估
-**数据污染**，即下游任务的测试数据存在于大型语言模型（LLMs）的预训练数据中，可能会夸大在许多下游任务（例如，摘要、自然语言推理、文本分类）上观察到的LLM性能。
+**数据污染** 是指本应用在下游测试任务重的数据出现在了大语言模型 (LLM) 的训练数据中，从而导致在下游任务 (例如，摘要、自然语言推理、文本分类) 上指标虚高，无法反映模型真实泛化能力的现象。
-为了评估LLM在污染数据下的性能，我们使用了[Contamination Detector](https://github.com/liyucheng09/Contamination_Detector)来生成污染标签。
+由于数据污染的源头是出现在 LLM 所用的训练数据中，因此最直接的检测数据污染的方法就是将测试数据与训练数据进行碰撞，然后汇报两者之间有多少语料是重叠出现的，经典的 GPT-3 [论文](https://arxiv.org/pdf/2005.14165.pdf)中的表 C.1 会报告了相关内容。
-## [检测工具](https://github.com/liyucheng09/Contamination_Detector)简介
+但如今开源社区往往只会公开模型参数而非训练数据集，在此种情况下 如何判断是否存在数据污染问题或污染程度如何，这些问题还没有被广泛接受的解决方案。OpenCompass 提供了两种可能的解决方案。
-污染检测器有助于在不需要访问LLM的训练数据的情况下，基于互联网存在验证，识别和分析此类潜在污染，使得即使是小团队和个人也能进行强大的评估。
+## 基于自建同分布数据的污染数据标注
-### 方法
+我们参考了 [Skywork](https://arxiv.org/pdf/2310.19341.pdf) 中 5.2 节提到的方法，直接使用了 Skywork 上传到 HuggingFace 上的数据集 [mock_gsm8k_test](https://huggingface.co/datasets/Skywork/mock_gsm8k_test)。
- 使用必应搜索API检查逐字测试样例是否在线出现，这可能表明其包含在Common Crawl中。
+在该方法中，作者使用 GPT-4 合成了一批与原始 GSM8K 风格类似的数据，然后使用模型分别计算在 GSM8K 训练集 (train)，GSM8K 测试集 (test)，GSM8K 参考集 (ref) 上的困惑度。由于 GSM8K 参考集是最新生成的，作者认为它必然不属于任何模型的任何训练集中，即它是干净的。作者认为：
- 具体来说，是通过仅搜索URL而不是完整内容，来验证包含逐字测试样例的页面是否在2017-2020年的Common Crawl中被索引。
+- 若 测试集 的困惑度远小于 参考集 的困惑度，那么 测试集 可能出现在了模型的训练阶段；
 - 若 训练集 的困惑度远小于 测试集 的困惑度，那么 训练集 可能被模型过拟合了。
-#### 构造查询
+我们可以参考使用以下配置文件:
-例如：
+```python
-**问题**：The flaw in Anderson’s ACT theory was that some considered it \_\_\_\_.
+from mmengine.config import read_base
 **选项**：
 A: ’Only applicable to a motor system’,
 B: ’Untestable and thus, of uncertain sci-
 entific value’,
 C: ’Lacking in definition for its ele-
 ments’
 D: ’Overly complex in explaining the
 operation of cognition’,
 **答案**：B
 **查询**：The flaw in Anderson’s ACT theory was that some considered it untestable and thus, of uncertain scientific value.
-#### 提高匹配度
+with read_base():
    from .datasets.gsm8k_contamination.gsm8k_contamination_ppl_ecdd22 import gsm8k_datasets  # 包含训练、测试、参考集
    from .models.qwen.hf_qwen_7b import models as hf_qwen_7b_model  # 待审查的模型
    from .models.yi.hf_yi_6b import models as hf_yi_6b_model
-为避免可能的误报，该方法配置了两个关键设置：
+datasets = [*gsm8k_datasets]
 models = [*hf_qwen_7b_model, *hf_yi_6b_model]
 ```
- 用于METEOR的排序罚分（gamma为0.8）确保匹配遵循序列；
+其样例输出如下：
 - 匹配被限制在最多2倍查询长度的窗口内，防止部分或脱离上下文的匹配。
-#### 污染类型
+```text
 dataset          version    metric       mode       internlm-7b-hf    qwen-7b-hf    yi-6b-hf    chatglm3-6b-base-hf    qwen-14b-hf    baichuan2-13b-base-hf    internlm-20b-hf    aquila2-34b-hf  ...
 ---------------  ---------  -----------  -------  ----------------  ------------  ----------  ---------------------  -------------  -----------------------  -----------------  ----------------  ...
 gsm8k-train-ppl  0b8e46     average_ppl  unknown              1.5           0.78        1.37                   1.16           0.5                      0.76               1.41              0.78  ...
 gsm8k-test-ppl   0b8e46     average_ppl  unknown              1.56          1.33        1.42                   1.3            1.15                     1.13               1.52              1.16  ...
 gsm8k-ref-ppl    f729ba     average_ppl  unknown              1.55          1.2         1.43                   1.35           1.27                     1.19               1.47              1.35  ...
 ```
- *input contamination*，其中只有问题出现在匹配页面中，但没有答案；
+目前该方案仅支持 GSM8K 数据集，我们欢迎社区贡献更多的数据集。
 - *input-and-label contamination*，其中问题和答案都出现在匹配页面中。
-## 数据准备
+如果使用了该方法，请添加引用:
-待完成
+```bibtex
@misc{2023opencompass,
    title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
    author={OpenCompass Contributors},
    howpublished = {\url{https://github.com/open-compass/opencompass}},
    year={2023}
 }
@misc{wei2023skywork,
      title={Skywork: A More Open Bilingual Foundation Model},
      author={Tianwen Wei and Liang Zhao and Lichang Zhang and Bo Zhu and Lijie Wang and Haihua Yang and Biye Li and Cheng Cheng and Weiwei Lü and Rui Hu and Chenxia Li and Liu Yang and Xilin Luo and Xuejie Wu and Lunan Liu and Wenjun Cheng and Peng Cheng and Jianhao Zhang and Xiaoyu Zhang and Lei Lin and Xiaokun Wang and Yutuan Ma and Chuanhai Dong and Yanqi Sun and Yifu Chen and Yongyi Peng and Xiaojuan Liang and Shuicheng Yan and Han Fang and Yahui Zhou},
      year={2023},
      eprint={2310.19341},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
 }
 ```
-## 评估配置
+## 基于经典预训练集的污染数据标注
-待完成
+感谢 [Contamination_Detector](https://github.com/liyucheng09/Contamination_Detector) 以及 @liyucheng09 提供了本方法。
 在该方法中，作者将测试数据集 (例如 C-Eval, ARC, HellaSwag 等) 使用 Common Crawl 数据库和 Bing 搜索引擎来进行检索，然后依次标记每条测试样本是 干净的 / 题目被污染的 / 题目和答案均被污染的。
 测试时，OpenCompass 会分别汇报 ceval 在三种标签所组成的子集上的准确率或困惑度。一般来说，准确率从低到高依次是 干净的，题目被污染的，题目和答案均被污染的 子集。作者认为：
 - 若三者性能较为接近，则模型在该测试集上的污染程度较轻；反之则污染程度较重。
 我们可以参考使用以下配置文件 [link](https://github.com/open-compass/opencompass/blob/main/configs/eval_contamination.py)：
 ```python
 from mmengine.config import read_base
 with read_base():
    from .datasets.ceval.ceval_clean_ppl import ceval_datasets  # 有污染标记的 ceval 数据集
    from .models.yi.hf_yi_6b import models as hf_yi_6b_model  # 待审查的模型
    from .models.qwen.hf_qwen_7b import models as hf_qwen_7b_model
    from .summarizers.contamination import ceval_summarizer as summarizer  # 输出格式整理
 datasets = [*ceval_datasets]
 models = [*hf_yi_6b_model, *hf_qwen_7b_model]
 ```
 其样例输出如下：
 ```text
 dataset                                         version    mode    yi-6b-hf          -                              -                                        qwen-7b-hf        -                              -                                        ...
 ----------------------------------------------  ---------  ------  ----------------  -----------------------------  ---------------------------------------  ----------------  -----------------------------  ---------------------------------------  ...
 -                                               -          -       accuracy - clean  accuracy - input contaminated  accuracy - input-and-label contaminated  accuracy - clean  accuracy - input contaminated  accuracy - input-and-label contaminated  ...
 ...
 ceval-humanities                                -          ppl     74.42             75.00                          82.14                                    67.44             50.00                          70.54                                    ...
 ceval-stem                                      -          ppl     53.70             57.14                          85.61                                    47.41             52.38                          67.63                                    ...
 ceval-social-science                            -          ppl     81.60             84.62                          83.09                                    76.00             61.54                          72.79                                    ...
 ceval-other                                     -          ppl     72.31             73.91                          75.00                                    58.46             39.13                          61.88                                    ...
 ceval-hard                                      -          ppl     44.35             37.50                          70.00                                    41.13             25.00                          30.00                                    ...
 ceval                                           -          ppl     67.32             71.01                          81.17                                    58.97             49.28                          67.82                                    ...
 ```
 目前该方案仅支持 C-Eval 数据集，[Contamination_Detector](https://github.com/liyucheng09/Contamination_Detector) 中还包含了 ARC, CSQA, HellaSwag, MMLU 和 WinoGrande，但目前还没有在 OpenCompass 中实现。我们欢迎社区贡献更多的数据集。
 如果使用了该方法，请添加引用:
 ```bibtex
@misc{2023opencompass,
    title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
    author={OpenCompass Contributors},
    howpublished = {\url{https://github.com/open-compass/opencompass}},
    year={2023}
 }
@article{Li2023AnOS,
  title={An Open Source Data Contamination Report for Llama Series Models},
  author={Yucheng Li},
  journal={ArXiv},
  year={2023},
  volume={abs/2310.17589},
  url={https://api.semanticscholar.org/CorpusID:264490711}
 }
 ```