OpenCompass

mirror of https://github.com/open-compass/opencompass.git synced 2025-05-30 16:03:24 +08:00

History

chenzihong-gavin db04df78d4 refactor: delete unnecessary comment		2025-04-14 21:20:29 +08:00
..
README.md	refactor: delete unnecessary comment	2025-04-14 21:20:29 +08:00
seedbench_gen_5d5ea1.py	refactor: delete unnecessary comment	2025-04-14 21:20:29 +08:00
seedbench_gen.py	[Dataset] Add SeedBench Dataset	2025-04-14 14:23:29 +08:00

README.md

🌾 About SeedBench

SeedBench is the first multi-task benchmark designed to evaluate large language models (LLMs) in seed science, focusing on seed breeding. This repository includes the dataset, evaluation code, and documentation to support research in this domain.

SeedBench assesses LLMs across three core seed breeding stages:

Gene Information Retrieval
Gene Function and Regulation Analysis
Variety Breeding with Agronomic Trait Optimization

Built with domain experts, SeedBench features 2,264 expert-validated questions across 11 task types and 10 subcategories, initially targeting rice breeding. Future updates will include other crops like maize, soybean, and wheat.

🔎 Dataset Details

Corpus: 308,727 publications cleaned to 1.1 billion tokens; 279 segments from 113 documents.
Questions: 2,264 across 11 task types, bilingual (English/Chinese), expert-validated.

Focus: Rice breeding as a representative case.

Types and metrics:

Type ID	Question Type	Metric	Count
Q&A
QA-1	Multiple Choice	Accuracy	200
QA-2	Multiple Answer	Macro-F1	187
QA-3	Fill-in-the-Blank	ROUGE-L	224
QA-4	Generation	ROUGE-L	242
Summarization
SUM-1	Simple Summarization	ROUGE-L	225
SUM-2	Key Information Extraction	ROUGE-L	225
Reading Comprehension
RC-1	Multiple Choice	Accuracy	113
RC-2	Multiple Answer	Macro-F1	108
RC-3	Fill-in-the-Blank	ROUGE-L	221
RC-4	Generation	ROUGE-L	240
RC-5	Subcategory Classification	Accuracy	279

📂 Dataset Links

☀️ Key Results

We evaluated 26 LLMs, including proprietary, open-source, and domain-specific models. Highlights:

Performance by Question Type

Top Performers: DeepSeek-V3 (68.37), GPT-4 (67.88).

Performance by Task Types

Model	QA-1	QA-2	QA-3	QA-4	SUM-1	SUM-2	RC-1	RC-2	RC-3	RC-4	RC-5	Avg
GPT-4	60.50	73.87	21.35	36.07	58.73	62.89	100.00	96.44	87.86	62.29	86.74	67.88
DeepSeek-V3	72.50	79.84	29.29	40.63	48.06	54.67	100.00	97.22	87.89	55.19	86.74	68.37
Qwen2-72B	59.50	75.98	19.55	31.62	31.08	63.09	99.12	94.24	72.20	51.58	89.96	62.54

Performance by Subcategory

Model	C1	C2	C3	C4	C5	C6	C7	C8	C9	C10	Avg
GPT-4	59.59	60.55	76.32	61.16	56.34	59.35	63.67	64.74	60.65	67.66	62.06
DeepSeek-V3-671B	56.03	62.42	74.81	63.17	55.23	58.84	68.23	69.04	66.46	68.48	63.30
Qwen2-72B	51.16	58.10	74.07	59.72	51.58	57.76	58.85	61.63	56.69	59.11	57.62

Top Performers: DeepSeek-V3-671B (63.30), GPT-4 (62.06).