add needlebench intro, fix summarizer

2025-05-30 16:03:24 +08:00 · 2024-02-28 16:29:03 +08:00 · 2024-02-28 16:29:03 +08:00 · c0d1190e39
commit c0d1190e39
parent 0965baf785
3 changed files with 80 additions and 3 deletions
--- a/opencompass/datasets/needlebench/readme.md
+++ b/opencompass/datasets/needlebench/readme.md
@ -0,0 +1,36 @@
 # Needlebench: A Benchmark for Advanced Language Model Evaluation
 English | [简体中文](readme_zh-CN.md)
 ## Overview
 Needlebench is a comprehensive benchmark suite designed to rigorously evaluate the information retrieval and reasoning capabilities of large language models (LLMs). Drawing inspiration from the NeedleInAHaystack experiment, Needlebench broadens the scope to include a variety of tasks, each aimed at testing different facets of LLMs' abilities to process, recall, and reason with information embedded in lengthy texts.
 ### Directory Structure
 ```
 opencompass/datasets/needlebench/
 ├── atc.py          # Ancestral Trace Challenge
 ├── multi.py        # Multi-Needles Reasoning
 ├── origin.py       # Single-Needle Retrieval
 ├── parallel.py     # Multi-Needles Retrieval
 └── readme.md
 ```
 ## Task Descriptions
 ### Single-Needle Retrieval (`origin.py`)
 The Single-Needle Retrieval task is foundational to the Needlebench suite, focusing on the LLM's ability to recall a single piece of crucial information from a haystack text of a specific length. This task mirrors the original NeedleInAHaystack test, assessing the model's precision in identifying and recalling specific information within large text bodies.
 ### Multi-Needles Retrieval (`parallel.py`)
 The Multi-Needles Retrieval task challenges the LLM's ability to identify and extract multiple key pieces of information from extensive texts. It simulates real-world scenarios where multiple data points, facts, or figures need to be retrieved from documents or reports, evaluating the model's efficiency in navigating and extracting relevant information from dense texts.
 ### Multi-Needles Reasoning (`multi.py`)
 Building on the retrieval tasks, the Multi-Needles Reasoning task emphasizes the LLM's capacity for complex reasoning with the retrieved information. The model must not only recall multiple pieces of information but also engage in logical reasoning, synthesizing answers that reflect an understanding of the relationships between various information points.
 ### Ancestral Trace Challenge (ATC) (`atc.py`)
 The Ancestral Trace Challenge is the most complex task in the Needlebench suite, requiring models to recall and analyze every detail in long texts for problem-solving that demands an understanding of complex relationships, such as genealogical inquiries or detailed case analysis. This task highlights the need for models to process and reason with information at a detailed level, mirroring the demands of sophisticated real-world analytical tasks.
--- a/opencompass/datasets/needlebench/readme_zh-CN.md
+++ b/opencompass/datasets/needlebench/readme_zh-CN.md
@ -0,0 +1,36 @@
 # Needlebench：大海捞针测试评估基准
 [English](readme.md) | 简体中文
 ## 概览
 Needlebench是一个全面的基准测试，旨在严格评估大型语言模型（LLMs）的信息检索和推理能力。借鉴了NeedleInAHaystack实验的灵感，Needlebench扩展了范围，包括多种任务，每个任务都旨在测试LLMs处理、回忆和与长文本中嵌入的信息进行推理的不同方面的能力。
 ### 目录结构
 ```
 opencompass/datasets/needlebench/
 ├── atc.py          # 祖源追溯挑战
 ├── multi.py        # 多针信息推理
 ├── origin.py       # 单针信息检索
 ├── parallel.py     # 多针信息检索
 └── readme.md
 ```
 ## 任务描述
 ### 单针信息检索 (`origin.py`)
 单针信息检索任务是Needlebench套件的基础，关注LLM从特定长度的干草堆文本中回忆单个重要信息的能力。这个任务反映了原始的NeedleInAHaystack测试的目标，评估模型在大型文本体中识别和回忆特定信息的精确性。
 ### 多针信息检索 (`parallel.py`)
 多针信息检索任务挑战LLM识别和提取广泛文本中的多个关键信息点的能力。它模拟了现实世界中的场景，其中需要从文档或报告中检索多个数据点、事实或数字，评估模型在浏览和从密集文本中提取相关信息的效率。
 ### 多针信息推理 (`multi.py`)
 在检索任务的基础上，多针理由推理任务强调LLM使用检索到的信息进行复杂推理的能力。模型不仅需要回忆多个信息点，还需要进行逻辑推理，综合回答反映对不同信息点之间复杂关系理解的答案。
 ### 祖源追溯挑战 (ATC) (`atc.py`)
 祖源追溯挑战是Needlebench套件中最复杂的任务，要求模型回忆和分析长文本中的每个细节，以解决需要理解复杂关系的问题，如家谱查询或详细案例分析。这个任务突出了模型处理和推理详细信息的需要，反映了现实世界中对复杂实际任务的要求。
--- a/opencompass/summarizers/needlebench.py
+++ b/opencompass/summarizers/needlebench.py
@ -241,6 +241,8 @@ def save_results_to_plots(txt_results_save_path):
        # Process and visualize the overall score
        overall_score_pic_path = os.path.join(plot_path, f'{model_name}_overall.png')
        merged_df = merge_dataframes(model_name, dataset_abbrs, parsed_data)
        print(merge_dataframes)
        averaged_df = calculate_elementwise_average(merged_df)
        # Assume visualize returns the average score for the overall visualization
@ -277,13 +279,16 @@ def merge_dataframes(model_name, dataset_abbrs, parsed_data):
        dfs.append(df)
    # 沿着列方向合并DataFrame
-    merged_df = pd.concat(dfs, axis=1)
+    # 使用reduce函数和merge来按'dataset_name'合并所有DataFrame
-
+    from functools import reduce
    merged_df = reduce(lambda left, right: pd.merge(left, right, on='dataset', how='outer'), dfs)
    # merged_df.to_csv("dropbefore.csv")
    # Check for NaN values and filter out rows with NaN
    if merged_df.isnull().any().any():
        print('Warning: Some rows were filtered out due to NaN values. This is often due to mismatched row counts among DataFrames.')
        merged_df = merged_df.dropna()
-
+    # merged_df.to_csv("dropafter.csv")
    return merged_df
 def calculate_elementwise_average(merged_df):