The Needle In A Haystack test, inspired by [NeedleInAHaystack](https://github.com/gkamradt/LLMTest_NeedleInAHaystack/blob/main/LLMNeedleHaystackTester.py), is a method to evaluate the long-text information extraction ability of Large Language Models (LLMs). It involves randomly inserting key information at various points in a long text to form a prompt for LLMs. This test assesses the fundamental ability of LLMs to understand long texts by extracting critical information from them.
The `Skywork/ChineseDomainModelingEval` dataset includes high-quality Chinese articles published between September and October 2023, covering multiple domains. These articles ensure a fair and challenging benchmark for testing.
In the latest version, datasets are no longer generated by running scripts but dynamically defined and loaded through configuration files. Users need to specify dataset parameters in the configuration file according to their needs, offering greater flexibility and customization options.
Here is an example of dataset configuration, showing how to define a dataset in the `configs/datasets/cdme/cdme8k.py` configuration file. This example demonstrates a Chinese dataset configuration with a length of 8000 tokens:
By defining these parameters in the configuration file, you can flexibly create datasets that suit your needs. Configuration files offer a highly customizable and scalable way to manage the generation and use of datasets.
### Multi-Needle Needle In A Haystack Test
The latest version introduces the multi-needle Needle In A Haystack test, allowing multiple different needles (text snippets) to be inserted into the same dataset. These needles are inserted in sequence according to a given depth parameter. Compared to the single-needle test, the multi-needle test provides a more complex data processing scenario.
#### Multi-Needle Dataset Configuration Example
Here is an example of configuring a multi-needle dataset, showing how to define a multi-needle dataset in the `configs/datasets/cdme/multi_needle/cdme8k_cot3_italy.py` configuration file. This example demonstrates a dataset configuration with three needles:
In this configuration, in addition to the standard parameters, the main new parameters include:
-`needles`: A list containing multiple strings, each representing a needle to be inserted.
-`diff`: Defines the depth increment for subsequent needles relative to the first needle.
-`keyword`: A keyword used for score correction during the evaluation process.
#### Change in Scoring Mechanism
In the source code of `opencompass/datasets/cdme/cdme_multi.py`, the scoring mechanism for multi-needle datasets differs. The following code segment has been added to adjust the scores based on the `keyword` in the predictions:
```python
if keyword in prediction:
print(f'{keyword} is in {prediction}')
score = 100
else:
print(f'{keyword} is not in {prediction}')
score = 0.2 * score
```
This code means that if the keyword is present in the prediction, it will be awarded a high score (e.g., 100). If not, the score will be significantly reduced (20% of the original score). This scoring mechanism places more emphasis on the accuracy of keywords, supplementing the traditional scoring methods.
This command initiates the evaluation process, where the model attempts to find the specified "needle" in the generated dataset. The parameters `-p partition_name -q auto` and `--max-num-workers 32` specify the Slurm queue and the maximum number of worker processes, respectively.
When evaluating especially long texts (e.g., 200k tokens), conventional methods might lead to memory overload. In such cases, quantized models can be used for evaluation. This can be achieved using the `LMDeploy` tool ([LMDeploy](https://github.com/InternLM/lmdeploy)).
Detailed information about installing and configuring `LMDeploy` can be found on its GitHub page. Once installed, the `TurboMindModel` defined in the `configs/eval_needleinahaystack_turbomind.py` configuration file can be used for evaluation.
Below is an example configuration in the `configs/eval_needleinahaystack_turbomind.py` file:
```python
from opencompass.models.turbomind import TurboMindModel
In this configuration, the `TurboMindModel` combines the functionality of `LMDeploy`, suitable for handling large-scale text datasets and effectively reducing memory usage.
In the `CDMEEvaluator` class, we use two main methods to calculate scores: `levenshtein_distance` and `score`. Here are detailed explanations and implementations of these methods.
Levenshtein distance is a measure of the difference between two strings. It represents the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into the other.
The `score` calculation method accepts two lists of predictions and references and calculates the edit distance and score for each pair of prediction and reference.
This scoring method first removes all whitespace characters from both predictions and references and then calculates the Levenshtein distance between them. The score is calculated as 100 minus the percentage loss based on edit distance. Finally, it returns detailed scores for each prediction and the average score overall.
The `tools_needleinahaystack.py` script can be used to visualize CSV files. This script supports specifying one or more CSV file paths through the `--path` parameter and can use the `--dataset_length` parameter to specify the length of the dataset.
author={Tianwen Wei and Liang Zhao and Lichang Zhang and Bo Zhu and Lijie Wang and Haihua Yang and Biye Li and Cheng Cheng and Weiwei Lü and Rui Hu and Chenxia Li and Liu Yang and Xilin Luo and Xuejie Wu and Lunan Liu and Wenjun Cheng and Peng Cheng and Jianhao Zhang and Xiaoyu Zhang and Lei Lin and Xiaokun Wang and Yutuan Ma and Chuanhai Dong and Yanqi Sun and Yifu Chen and Yongyi Peng and Xiaojuan Liang and Shuicheng Yan and Han Fang and Yahui Zhou},