
* add contamination analysis to ceval * fix bugs * add contamination docs * to pass CI check * update --------- Co-authored-by: zhangyifan1 <zhangyifan1@pjlab.org.cn> Co-authored-by: Leymore <zfz-960727@163.com>
2.2 KiB
Contamination Evaluation Guidance
Data contamination, i.e., the presence of test data from these downstream tasks in the pre-training data of LLMs, may inflate LLM performance observed on many downstream tasks (e.g., summarization, natural language inference, text classification).
To evaluate LLM with contaminated data, we employed Contamination Detector to generate contamination labels.
Introduction to Detection Tools
Contamination Detector aids in identifying and analyzing such potential contamination without requiring access to the LLMs' training data based on Internet presence verification, enabling even small teams and individuals to conduct robust evaluation.
Method
-
Using the Bing Search API to check if verbatim test examples appear online, which likely indicates inclusion in Common Crawl.
-
Specifically verifying if pages containing verbatim test examples were indexed in the 2017-2020 Common Crawl, by only searching the URLs rather than full contents.
Construct queries
for example: Question: The flaw in Anderson’s ACT theory was that some considered it ____. Choices: A: ’Only applicable to a motor system’, B: ’Untestable and thus, of uncertain sci- entific value’, C: ’Lacking in definition for its ele- ments’ D: ’Overly complex in explaining the operation of cognition’, Answer: B Query: The flaw in Anderson’s ACT theory was that some considered it untestable and thus, of uncertain scientific value.
Improve Matching
To avoid potential false positives, the method is configured with two key settings:
- an order penalty (gamma of 0.8) for METEOR ensures matches respect sequence;
- matching is constrained to a window up to 2x the query length, preventing partial or out-of- context matches.
Contamination Type
- input contamination where only question is presented in the matched pages but not answer;
- input-and-label contamination where both question and answer occur in the matched pages.
Data Preparation
To be complete
Evaluation Configuration
To be complete