OpenCompass/configs/datasets/needlebench
2024-02-28 16:58:56 +08:00
..
atc add needlebench 2024-02-23 14:48:58 +08:00
needlebench_4k add needlebench 2024-02-23 14:48:58 +08:00
needlebench_8k add needlebench 2024-02-23 14:48:58 +08:00
needlebench_32k simplify needlebench 32k, 128k, 200k for eval 2024-02-23 17:31:32 +08:00
needlebench_128k simplify needlebench 32k, 128k, 200k for eval 2024-02-23 17:31:32 +08:00
needlebench_200k simplify needlebench 32k, 128k, 200k for eval 2024-02-23 17:31:32 +08:00
needlebench.py add needlebench 2024-02-23 14:48:58 +08:00
readme_zh-CN.md update readme for needlebench 2024-02-28 16:58:56 +08:00
readme.md update readme for needlebench 2024-02-28 16:58:56 +08:00

Needlebench: A Benchmark for Needle-In-A-Haystack Evaluations

English | 简体中文

Overview

Needlebench is an exhaustive benchmark designed to rigorously assess the information retrieval and reasoning capabilities of large language models (LLMs). Drawing inspiration from the NeedleInAHaystack experiment, Needlebench broadens the scope to include a variety of tasks, each aimed at testing different facets of LLMs' abilities to process, recall, and reason with information embedded in lengthy texts.

Directory Structure

configs/datasets/needlebench/
├── atc
├── needlebench_4k
├── needlebench_8k
├── needlebench_32k
├── needlebench_128k
├── needlebench_200k
├── needlebench.py
├── readme.md
└── readme_zh-CN.md

Within each configuration directory (e.g., needlebench_4k), there are scripts tailored for testing within that specific length setting:

needlebench_4k/
├── needlebench_multi_reasoning.py
├── needlebench_multi_retrieval.py
├── needlebench.py
└── needlebench_single.py

Task Descriptions and Length Configurations

Needlebench offers tasks in various length configurations, such as 4k, 8k, etc., to accommodate different scales of language model evaluation needs. Each length configuration provides specialized test scripts for the following tasks:

Single-Needle Retrieval (needlebench_single.py)

The Single-Needle Retrieval task evaluates the LLM's ability to recall a single piece of crucial information from a haystack text of a specific length. This task mirrors the original NeedleInAHaystack test's objective, assessing the model's precision in identifying and recalling specific information within large bodies of text.

Multi-Needle Retrieval (needlebench_multi_retrieval.py)

The Multi-Needle Retrieval task challenges the LLM's ability to identify and extract multiple key information points from extensive texts. It simulates real-world scenarios where multiple data points, facts, or figures need to be retrieved from documents or reports, evaluating the model's efficiency in navigating and extracting relevant information from dense texts.

Multi-Needle Reasoning (needlebench_multi_reasoning.py)

Building on the retrieval tasks, the Multi-Needle Reasoning task emphasizes the LLM's capacity for complex reasoning with the retrieved information. The model must not only recall multiple pieces of information but also engage in logical reasoning, synthesizing answers that reflect an understanding of the intricate relationships between various information points.

Ancestral Trace Challenge (ATC)

The Ancestral Trace Challenge is Needlebench's most complex task, requiring models to recall and analyze every detail in long texts for problem-solving that demands an understanding of complex relationships, such as genealogical inquiries or detailed case analysis. This task highlights the need for models to process and reason with information at a granular level, mirroring the demands of sophisticated real-world analytical tasks.