[Feature] add lveval benchmark (#914)

* add lveval benchmark * add LVEval readme file * update LVEval readme file * Update configs/eval_bluelm_32k_lveval.py * Update configs/eval_llama2_7b_lveval.py --------- Co-authored-by: yuantao <yuantao@infini-ai.com> Co-authored-by: Mo Li <82895469+DseidLi@users.noreply.github.com>
2025-05-30 16:03:24 +08:00 · 2024-03-04 11:22:03 +08:00 · 2024-03-04 11:22:03 +08:00 · bbec7d8733
commit bbec7d8733
parent 8142f399a8
42 changed files with 1880 additions and 0 deletions
--- a/configs/datasets/lveval/lveval.md
+++ b/configs/datasets/lveval/lveval.md
@ -0,0 +1,165 @@
+# LVEval
+## Introduction
+The following introduction comes from the introduction in [LVEval](https://github.com/infinigence/LVEval)
+
+```
+LV-Eval是一个具备5个长度等级（16k、32k、64k、128k和256k）、最大文本测试长度达到256k的长文本评测基准。LV-Eval的平均文本长度达到102,380字，最小/最大文本长度为11,896/387,406字。LV-Eval主要有两类评测任务——单跳QA和多跳QA，共包含11个涵盖中英文的评测数据子集。LV-Eval设计时引入3个关键技术：干扰事实插入（Confusiong Facts Insertion，CFI）提高挑战性，关键词和短语替换（Keyword and Phrase Replacement，KPR）减少信息泄漏，以及基于关键词召回的评测指标（Answer Keywords，AK，指代结合答案关键词和字词黑名单的评价指标）提高评测数值客观性。我们希望LV-Eval为未来长文本大语言模型的研究发展提供有价值的性能参考。
+LV-Eval is a challenging long-context benchmark with five length levels (16k, 32k, 64k, 128k, and 256k) reaching up to 256k words. The average number of words is 102,380, and the Min/Max number of words is 11,896/387,406. LV-Eval features two main tasks, single-hop QA and multi-hop QA, comprising 11 bilingual datasets. The design of LV-Eval has incorporated three key techniques, namely confusing facts insertion (CFI), keyword and phrase replacement (KPR), and keyword-recall-based metrics (AK, short for metics with Answer Keywords and word blacklist) design, which jointly provide a challenging, mitigated-knowledge-leakege, and more accurate evaluation of the long-context capability of LLMs. We anticipate that LV-Eval will serve as a valuable resource for supporting future research on long-context LLMs.
+```
+
+## Official link
+
+### Paper
+
+[_LV_-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K](https://arxiv.org/abs/2402.05136)
+
+### Repository
+
+[LVEval](https://github.com/infinigence/LVEval)
+
+## Use cases
+
+In evaluation scripts, add LVEval dataset as other datasets by using
+```
+from .datasets.lveval.lveval import LVEval_datasets as datasets
+```
+
+## Examples
+Input example I (from lic_mixup datasets):
+```
+请根据下面给定的文章回答问题，问题和答案只与其中一篇文章有关。
+
+文章：......文章 9\n\n标题：腐质酸\n内容：腐植酸是自然界中广泛存在的大分子有机物质，广泛应用于农林牧、石油、化工、建材、医药卫生、环保等各个领域。横跨几十个行业。特别是眼下提倡生态农业建设、无公害农业生产、绿色食品、无污染环保产品等，更使\"腐植酸\"备受推崇，事实证明，人类的生活和生存离不开腐植酸，它的确是一个发展中的有希望的朝阳产业，属于一个新型的特殊行业......
+
+请现在基于上述文章回答下面的问题，问题和答案只与其中一篇文章有关。
+
+问题：中国的文学受到印度哪些方面的影响？
+回答：
+```
+Output example I (from chatglm3-6b-32k):
+```
+中国文学自印度文学大量吸收营养，在佛教东流之后，从语汇到修辞，从题材到体裁，即便审美取向也深受佛教与印度文学的感染。
+```
+Input example II (from factrecall_zh datasets):
+```
+请基于给定的文章回答下述问题。
+
+文章：......庚子年间，贝多芬，乃一德裔美籍学士，研究于物理理学。彼其良图，探求相对论、量子力学，尤有大进。质能等价公式 E=mc²，千古独步，声名于当世。诺贝尔物理学奖、以资尊荣，兹矣荣耀之大典。论其学术，涉时空能量，影响深远，以其义非常人，广为当世所知，声名播于天下，实乃现代物理学之奠基者......
+
+现在请基于上述文章回答下面的问题。
+
+问题：被世人广泛推崇为现代物理学奠基人的科学家叫什么名字？
+回答：
+```
+Output example II (from chatglm3-6b-32k):
+```
+贝多芬
+```
+## Evaluation results
+
+```
+dataset                                    version    metric         mode    bluelm-7b-chat-32k-hf
+-----------------------------------------  ---------  -------------  ------  -----------------------
+----------------------------------------   -          -              -       -
+--------- LVEval All ---------             -          -              -       -
+----------------------------------------   -          -              -       -
+LVEval_qa                                  -          naive_average  gen     12.00
+----------------------------------------   -          -              -       -
+--------- LVEval Tasks All ---------       -          -              -       -
+----------------------------------------   -          -              -       -
+LVEval_single_hop_qa                       -          naive_average  gen     15.11
+LVEval_single_hop_cqa                      -          naive_average  gen     9.21
+LVEval_multi_hop_qa                        -          naive_average  gen     6.99
+LVEval_multi_hop_cqa                       -          naive_average  gen     9.90
+LVEval_factrecall_cqa                      -          naive_average  gen     21.28
+----------------------------------------   -          -              -       -
+--------- LVEval Datasets All ---------    -          -              -       -
+----------------------------------------   -          -              -       -
+LVEval_loogle_SD_mixup                     -          naive_average  gen     12.81
+LVEval_cmrc_mixup                          -          naive_average  gen     17.41
+LVEval_multifieldqa_en_mixup               -          naive_average  gen     7.10
+LVEval_multifieldqa_zh_mixup               -          naive_average  gen     11.31
+LVEval_dureader_mixup                      -          naive_average  gen     13.19
+LVEval_loogle_CR_mixup                     -          naive_average  gen     5.17
+LVEval_loogle_MIR_mixup                    -          naive_average  gen     2.60
+LVEval_hotpotwikiqa_mixup                  -          naive_average  gen     10.20
+LVEval_lic_mixup                           -          naive_average  gen     9.60
+LVEval_factrecall_en                       -          naive_average  gen     23.67
+LVEval_factrecall_zh                       -          naive_average  gen     18.90
+----------------------------------------   -          -              -       -
+--------- LVEval Single_Hop QA ---------   -          -              -       -
+----------------------------------------   -          -              -       -
+LVEval_loogle_SD_mixup_16k                 83bc25     LVEval_f1      gen     35.05
+LVEval_loogle_SD_mixup_32k                 83bc25     LVEval_f1      gen     13.37
+LVEval_loogle_SD_mixup_64k                 83bc25     LVEval_f1      gen     6.32
+LVEval_loogle_SD_mixup_128k                83bc25     LVEval_f1      gen     5.28
+LVEval_loogle_SD_mixup_256k                83bc25     LVEval_f1      gen     4.00
+----------------------------------------   -          -              -       -
+LVEval_cmrc_mixup_16k                      8bac4e     LVEval_f1      gen     46.45
+LVEval_cmrc_mixup_32k                      8bac4e     LVEval_f1      gen     19.41
+LVEval_cmrc_mixup_64k                      8bac4e     LVEval_f1      gen     11.10
+LVEval_cmrc_mixup_128k                     8bac4e     LVEval_f1      gen     5.89
+LVEval_cmrc_mixup_256k                     8bac4e     LVEval_f1      gen     4.22
+----------------------------------------   -          -              -       -
+--------- LVEval Single_Hop CQA ---------  -          -              -       -
+----------------------------------------   -          -              -       -
+LVEval_multifieldqa_en_mixup_16k           83bc25     LVEval_f1      gen     12.28
+LVEval_multifieldqa_en_mixup_32k           83bc25     LVEval_f1      gen     4.64
+LVEval_multifieldqa_en_mixup_64k           83bc25     LVEval_f1      gen     8.30
+LVEval_multifieldqa_en_mixup_128k          83bc25     LVEval_f1      gen     5.63
+LVEval_multifieldqa_en_mixup_256k          83bc25     LVEval_f1      gen     4.64
+----------------------------------------   -          -              -       -
+LVEval_multifieldqa_zh_mixup_16k           ac4a0d     LVEval_f1      gen     22.30
+LVEval_multifieldqa_zh_mixup_32k           ac4a0d     LVEval_f1      gen     17.46
+LVEval_multifieldqa_zh_mixup_64k           ac4a0d     LVEval_f1      gen     6.27
+LVEval_multifieldqa_zh_mixup_128k          ac4a0d     LVEval_f1      gen     5.84
+LVEval_multifieldqa_zh_mixup_256k          ac4a0d     LVEval_f1      gen     4.71
+----------------------------------------   -          -              -       -
+--------- LVEval Multi_Hop QA ---------    -          -              -       -
+----------------------------------------   -          -              -       -
+LVEval_dureader_mixup_16k                  8bac4e     LVEval_rouge   gen     18.04
+LVEval_dureader_mixup_32k                  8bac4e     LVEval_rouge   gen     18.33
+LVEval_dureader_mixup_64k                  8bac4e     LVEval_rouge   gen     12.56
+LVEval_dureader_mixup_128k                 8bac4e     LVEval_rouge   gen     10.33
+LVEval_dureader_mixup_256k                 8bac4e     LVEval_rouge   gen     6.69
+----------------------------------------   -          -              -       -
+LVEval_loogle_CR_mixup_16k                 83bc25     LVEval_f1      gen     9.35
+LVEval_loogle_CR_mixup_32k                 83bc25     LVEval_f1      gen     7.42
+LVEval_loogle_CR_mixup_64k                 83bc25     LVEval_f1      gen     3.18
+LVEval_loogle_CR_mixup_128k                83bc25     LVEval_f1      gen     2.65
+LVEval_loogle_CR_mixup_256k                83bc25     LVEval_f1      gen     3.27
+----------------------------------------   -          -              -       -
+LVEval_loogle_MIR_mixup_16k                83bc25     LVEval_f1      gen     4.50
+LVEval_loogle_MIR_mixup_32k                83bc25     LVEval_f1      gen     3.19
+LVEval_loogle_MIR_mixup_64k                83bc25     LVEval_f1      gen     2.34
+LVEval_loogle_MIR_mixup_128k               83bc25     LVEval_f1      gen     1.76
+LVEval_loogle_MIR_mixup_256k               83bc25     LVEval_f1      gen     1.20
+----------------------------------------   -          -              -       -
+--------- LVEval Multi_Hop CQA ---------   -          -              -       -
+----------------------------------------   -          -              -       -
+LVEval_hotpotwikiqa_mixup_16k              e3c368     LVEval_f1      gen     19.80
+LVEval_hotpotwikiqa_mixup_32k              e3c368     LVEval_f1      gen     12.59
+LVEval_hotpotwikiqa_mixup_64k              e3c368     LVEval_f1      gen     7.33
+LVEval_hotpotwikiqa_mixup_128k             e3c368     LVEval_f1      gen     7.85
+LVEval_hotpotwikiqa_mixup_256k             e3c368     LVEval_f1      gen     3.42
+----------------------------------------   -          -              -       -
+LVEval_lic_mixup_16k                       fdd540     LVEval_f1      gen     21.36
+LVEval_lic_mixup_32k                       fdd540     LVEval_f1      gen     12.92
+LVEval_lic_mixup_64k                       fdd540     LVEval_f1      gen     4.62
+LVEval_lic_mixup_128k                      fdd540     LVEval_f1      gen     4.25
+LVEval_lic_mixup_256k                      fdd540     LVEval_f1      gen     4.85
+----------------------------------------   -          -              -       -
+--------- LVEval Factrecall CQA ---------  -          -              -       -
+----------------------------------------   -          -              -       -
+LVEval_factrecall_en_16k                   fba966     f1             gen     58.33
+LVEval_factrecall_en_32k                   fba966     f1             gen     32.17
+LVEval_factrecall_en_64k                   fba966     f1             gen     15.33
+LVEval_factrecall_en_128k                  fba966     f1             gen     8.50
+LVEval_factrecall_en_256k                  fba966     f1             gen     4.00
+----------------------------------------   -          -              -       -
+LVEval_factrecall_zh_16k                   ef3320     f1             gen     20.00
+LVEval_factrecall_zh_32k                   ef3320     f1             gen     38.00
+LVEval_factrecall_zh_64k                   ef3320     f1             gen     20.50
+LVEval_factrecall_zh_128k                  ef3320     f1             gen     11.00
+LVEval_factrecall_zh_256k                  ef3320     f1             gen     5.00
+```
--- a/configs/datasets/lveval/lveval.py
+++ b/configs/datasets/lveval/lveval.py
@ -0,0 +1,38 @@
+from mmengine.config import read_base
+
+with read_base():
+    from .lvevalcmrc_mixup.lveval_cmrc_mixup_gen import (
+        LVEval_cmrc_mixup_datasets,
+    )
+    from .lvevaldureader_mixup.lveval_dureader_mixup_gen import (
+        LVEval_dureader_mixup_datasets,
+    )
+    from .lvevalfactrecall_en.lveval_factrecall_en_gen import (
+        LVEval_factrecall_en_datasets,
+    )
+    from .lvevalfactrecall_zh.lveval_factrecall_zh_gen import (
+        LVEval_factrecall_zh_datasets,
+    )
+    from .lvevalhotpotwikiqa_mixup.lveval_hotpotwikiqa_mixup_gen import (
+        LVEval_hotpotwikiqa_mixup_datasets,
+    )
+    from .lvevallic_mixup.lveval_lic_mixup_gen import LVEval_lic_mixup_datasets
+    from .lvevalloogle_CR_mixup.lveval_loogle_CR_mixup_gen import (
+        LVEval_loogle_CR_mixup_datasets,
+    )
+    from .lvevalloogle_MIR_mixup.lveval_loogle_MIR_mixup_gen import (
+        LVEval_loogle_MIR_mixup_datasets,
+    )
+    from .lvevalloogle_SD_mixup.lveval_loogle_SD_mixup_gen import (
+        LVEval_loogle_SD_mixup_datasets,
+    )
+    from .lvevalmultifieldqa_en_mixup.lveval_multifieldqa_en_mixup_gen import (
+        LVEval_multifieldqa_en_mixup_datasets,
+    )
+    from .lvevalmultifieldqa_zh_mixup.lveval_multifieldqa_zh_mixup_gen import (
+        LVEval_multifieldqa_zh_mixup_datasets,
+    )
+
+LVEval_datasets = sum(
+    (v for k, v in locals().items() if k.endswith("_datasets")), []
+)
--- a/configs/datasets/lveval/lvevalcmrc_mixup/lveval_cmrc_mixup_gen.py
+++ b/configs/datasets/lveval/lvevalcmrc_mixup/lveval_cmrc_mixup_gen.py
@ -0,0 +1,6 @@
+from mmengine.config import read_base
+
+with read_base():
+    from .lveval_cmrc_mixup_gen_465823 import (
+        LVEval_cmrc_mixup_datasets,
+    )  # noqa: F401, F403
--- a/configs/datasets/lveval/lvevalcmrc_mixup/lveval_cmrc_mixup_gen_465823.py
+++ b/configs/datasets/lveval/lvevalcmrc_mixup/lveval_cmrc_mixup_gen_465823.py
@ -0,0 +1,54 @@
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.datasets import LVEvalOPTF1Evaluator, LVEvalcmrcDataset
+
+LVEval_cmrc_mixup_reader_cfg = dict(
+    input_columns=["context", "input"],
+    output_column="answers",
+    train_split="test",
+    test_split="test",
+)
+
+LVEval_cmrc_mixup_infer_cfg = dict(
+    prompt_template=dict(
+        type=PromptTemplate,
+        template=dict(
+            round=[
+                dict(
+                    role="HUMAN",
+                    prompt="请根据下面给定的文章回答问题，问题和答案只与其中一篇文章有关。\n\n文章：{context}\n\n现在请基于上述文章回答下面的问题，问题和答案只与其中一篇文章有关。\n\n问题：{input}\n回答：",
+                ),
+            ],
+        ),
+    ),
+    retriever=dict(type=ZeroRetriever),
+    inferencer=dict(type=GenInferencer, max_out_len=64),
+)
+
+LVEval_cmrc_mixup_eval_cfg = dict(
+    evaluator=dict(type=LVEvalOPTF1Evaluator, language="zh"), pred_role="BOT"
+)
+
+DATASET_LENGTH_LEVEL = ["16k", "32k", "64k", "128k", "256k"]
+
+
+def get_dataset_names(dataset_name, length_levels):
+    datasets = []
+    for length in length_levels:
+        datasets.append(f"{dataset_name}_{length}")
+    return datasets
+
+
+LVEval_cmrc_mixup_datasets = [
+    dict(
+        type=LVEvalcmrcDataset,
+        abbr="LVEval_" + name_len,
+        path="Infinigence/LVEval",
+        name=name_len,
+        reader_cfg=LVEval_cmrc_mixup_reader_cfg,
+        infer_cfg=LVEval_cmrc_mixup_infer_cfg,
+        eval_cfg=LVEval_cmrc_mixup_eval_cfg,
+    )
+    for name_len in get_dataset_names("cmrc_mixup", DATASET_LENGTH_LEVEL)
+]
--- a/configs/datasets/lveval/lvevaldureader_mixup/lveval_dureader_mixup_gen.py
+++ b/configs/datasets/lveval/lvevaldureader_mixup/lveval_dureader_mixup_gen.py
@ -0,0 +1,6 @@
+from mmengine.config import read_base
+
+with read_base():
+    from .lveval_dureader_mixup_gen_465823 import (
+        LVEval_dureader_mixup_datasets,
+    )  # noqa: F401, F403
--- a/configs/datasets/lveval/lvevaldureader_mixup/lveval_dureader_mixup_gen_465823.py
+++ b/configs/datasets/lveval/lvevaldureader_mixup/lveval_dureader_mixup_gen_465823.py
@ -0,0 +1,55 @@
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.datasets import LVEvalOPTRougeEvaluator, LVEvaldureaderDataset
+
+LVEval_dureader_mixup_reader_cfg = dict(
+    input_columns=["context", "input"],
+    output_column="answers",
+    train_split="test",
+    test_split="test",
+)
+
+LVEval_dureader_mixup_infer_cfg = dict(
+    prompt_template=dict(
+        type=PromptTemplate,
+        template=dict(
+            round=[
+                dict(
+                    role="HUMAN",
+                    prompt="请根据下面给定的文章回答问题，问题和答案只与其中一篇文章有关。\n\n文章：{context}\n\n现在请基于上述文章回答下面的问题，问题和答案只与其中一篇文章有关。\n\n问题：{input}\n回答：",
+                ),
+            ],
+        ),
+    ),
+    retriever=dict(type=ZeroRetriever),
+    inferencer=dict(type=GenInferencer, max_out_len=64),
+)
+
+LVEval_dureader_mixup_eval_cfg = dict(
+    evaluator=dict(type=LVEvalOPTRougeEvaluator, language="zh"),
+    pred_role="BOT",
+)
+
+DATASET_LENGTH_LEVEL = ["16k", "32k", "64k", "128k", "256k"]
+
+
+def get_dataset_names(dataset_name, length_levels):
+    datasets = []
+    for length in length_levels:
+        datasets.append(f"{dataset_name}_{length}")
+    return datasets
+
+
+LVEval_dureader_mixup_datasets = [
+    dict(
+        type=LVEvaldureaderDataset,
+        abbr="LVEval_" + name_len,
+        path="Infinigence/LVEval",
+        name=name_len,
+        reader_cfg=LVEval_dureader_mixup_reader_cfg,
+        infer_cfg=LVEval_dureader_mixup_infer_cfg,
+        eval_cfg=LVEval_dureader_mixup_eval_cfg,
+    )
+    for name_len in get_dataset_names("dureader_mixup", DATASET_LENGTH_LEVEL)
+]
--- a/configs/datasets/lveval/lvevalfactrecall_en/lveval_factrecall_en_gen.py
+++ b/configs/datasets/lveval/lvevalfactrecall_en/lveval_factrecall_en_gen.py
@ -0,0 +1,6 @@
+from mmengine.config import read_base
+
+with read_base():
+    from .lveval_factrecall_en_gen_9a836f import (
+        LVEval_factrecall_en_datasets,
+    )  # noqa: F401, F403
--- a/configs/datasets/lveval/lvevalfactrecall_en/lveval_factrecall_en_gen_9a836f.py
+++ b/configs/datasets/lveval/lvevalfactrecall_en/lveval_factrecall_en_gen_9a836f.py
@ -0,0 +1,54 @@
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.datasets import LVEvalF1Evaluator, LVEvalfactrecallenDataset
+
+LVEval_factrecall_en_reader_cfg = dict(
+    input_columns=["context", "input"],
+    output_column="answers",
+    train_split="test",
+    test_split="test",
+)
+
+LVEval_factrecall_en_infer_cfg = dict(
+    prompt_template=dict(
+        type=PromptTemplate,
+        template=dict(
+            round=[
+                dict(
+                    role="HUMAN",
+                    prompt="Please answer the following questions based on the given article.\n\nArticle: {context}\n\nPlease answer the following questions based on the above article.\n\nQuestion: {input}\nAnswer:",
+                ),
+            ],
+        ),
+    ),
+    retriever=dict(type=ZeroRetriever),
+    inferencer=dict(type=GenInferencer, max_out_len=16),
+)
+
+LVEval_factrecall_en_eval_cfg = dict(
+    evaluator=dict(type=LVEvalF1Evaluator, language="en"), pred_role="BOT"
+)
+
+DATASET_LENGTH_LEVEL = ["16k", "32k", "64k", "128k", "256k"]
+
+
+def get_dataset_names(dataset_name, length_levels):
+    datasets = []
+    for length in length_levels:
+        datasets.append(f"{dataset_name}_{length}")
+    return datasets
+
+
+LVEval_factrecall_en_datasets = [
+    dict(
+        type=LVEvalfactrecallenDataset,
+        abbr="LVEval_" + name_len,
+        path="Infinigence/LVEval",
+        name=name_len,
+        reader_cfg=LVEval_factrecall_en_reader_cfg,
+        infer_cfg=LVEval_factrecall_en_infer_cfg,
+        eval_cfg=LVEval_factrecall_en_eval_cfg,
+    )
+    for name_len in get_dataset_names("factrecall_en", DATASET_LENGTH_LEVEL)
+]
--- a/configs/datasets/lveval/lvevalfactrecall_zh/lveval_factrecall_zh_gen.py
+++ b/configs/datasets/lveval/lvevalfactrecall_zh/lveval_factrecall_zh_gen.py
@ -0,0 +1,6 @@
+from mmengine.config import read_base
+
+with read_base():
+    from .lveval_factrecall_zh_gen_dbee70 import (
+        LVEval_factrecall_zh_datasets,
+    )  # noqa: F401, F403
--- a/configs/datasets/lveval/lvevalfactrecall_zh/lveval_factrecall_zh_gen_dbee70.py
+++ b/configs/datasets/lveval/lvevalfactrecall_zh/lveval_factrecall_zh_gen_dbee70.py
@ -0,0 +1,54 @@
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.datasets import LVEvalF1Evaluator, LVEvalfactrecallzhDataset
+
+LVEval_factrecall_zh_reader_cfg = dict(
+    input_columns=["context", "input"],
+    output_column="answers",
+    train_split="test",
+    test_split="test",
+)
+
+LVEval_factrecall_zh_infer_cfg = dict(
+    prompt_template=dict(
+        type=PromptTemplate,
+        template=dict(
+            round=[
+                dict(
+                    role="HUMAN",
+                    prompt="请基于给定的文章回答下述问题。\n\n文章：{context}\n\n现在请基于上述文章回答下面的问题。\n\n问题：{input}\n回答：",
+                ),
+            ],
+        ),
+    ),
+    retriever=dict(type=ZeroRetriever),
+    inferencer=dict(type=GenInferencer, max_out_len=16),
+)
+
+LVEval_factrecall_zh_eval_cfg = dict(
+    evaluator=dict(type=LVEvalF1Evaluator, language="zh"), pred_role="BOT"
+)
+
+DATASET_LENGTH_LEVEL = ["16k", "32k", "64k", "128k", "256k"]
+
+
+def get_dataset_names(dataset_name, length_levels):
+    datasets = []
+    for length in length_levels:
+        datasets.append(f"{dataset_name}_{length}")
+    return datasets
+
+
+LVEval_factrecall_zh_datasets = [
+    dict(
+        type=LVEvalfactrecallzhDataset,
+        abbr="LVEval_" + name_len,
+        path="Infinigence/LVEval",
+        name=name_len,
+        reader_cfg=LVEval_factrecall_zh_reader_cfg,
+        infer_cfg=LVEval_factrecall_zh_infer_cfg,
+        eval_cfg=LVEval_factrecall_zh_eval_cfg,
+    )
+    for name_len in get_dataset_names("factrecall_zh", DATASET_LENGTH_LEVEL)
+]
--- a/configs/datasets/lveval/lvevalhotpotwikiqa_mixup/lveval_hotpotwikiqa_mixup_gen.py
+++ b/configs/datasets/lveval/lvevalhotpotwikiqa_mixup/lveval_hotpotwikiqa_mixup_gen.py
@ -0,0 +1,6 @@
+from mmengine.config import read_base
+
+with read_base():
+    from .lveval_hotpotwikiqa_mixup_gen_77ce82 import (
+        LVEval_hotpotwikiqa_mixup_datasets,
+    )  # noqa: F401, F403
--- a/configs/datasets/lveval/lvevalhotpotwikiqa_mixup/lveval_hotpotwikiqa_mixup_gen_77ce82.py
+++ b/configs/datasets/lveval/lvevalhotpotwikiqa_mixup/lveval_hotpotwikiqa_mixup_gen_77ce82.py
@ -0,0 +1,59 @@
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.datasets import (
+    LVEvalOPTF1Evaluator,
+    LVEvalhotpotwikiqaDataset,
+)
+
+LVEval_hotpotwikiqa_mixup_reader_cfg = dict(
+    input_columns=["context", "input"],
+    output_column="answers",
+    train_split="test",
+    test_split="test",
+)
+
+LVEval_hotpotwikiqa_mixup_infer_cfg = dict(
+    prompt_template=dict(
+        type=PromptTemplate,
+        template=dict(
+            round=[
+                dict(
+                    role="HUMAN",
+                    prompt="Answer the question based on the given passages. Questions and answers are only relevant to some passages. Only give me the answer and do not output any other explanation and evidence.\n\nArticle: {context}\n\nPlease answer the following question based on the above passages. Questions and answers are only relevant to some passages. Only give me the answer and do not output any other explanation and evidence.\n\nQuestion: {input}\nAnswer:",
+                ),
+            ],
+        ),
+    ),
+    retriever=dict(type=ZeroRetriever),
+    inferencer=dict(type=GenInferencer, max_out_len=64),
+)
+
+LVEval_hotpotwikiqa_mixup_eval_cfg = dict(
+    evaluator=dict(type=LVEvalOPTF1Evaluator, language="en"), pred_role="BOT"
+)
+
+DATASET_LENGTH_LEVEL = ["16k", "32k", "64k", "128k", "256k"]
+
+
+def get_dataset_names(dataset_name, length_levels):
+    datasets = []
+    for length in length_levels:
+        datasets.append(f"{dataset_name}_{length}")
+    return datasets
+
+
+LVEval_hotpotwikiqa_mixup_datasets = [
+    dict(
+        type=LVEvalhotpotwikiqaDataset,
+        abbr="LVEval_" + name_len,
+        path="Infinigence/LVEval",
+        name=name_len,
+        reader_cfg=LVEval_hotpotwikiqa_mixup_reader_cfg,
+        infer_cfg=LVEval_hotpotwikiqa_mixup_infer_cfg,
+        eval_cfg=LVEval_hotpotwikiqa_mixup_eval_cfg,
+    )
+    for name_len in get_dataset_names(
+        "hotpotwikiqa_mixup", DATASET_LENGTH_LEVEL
+    )
+]
--- a/configs/datasets/lveval/lvevallic_mixup/lveval_lic_mixup_gen.py
+++ b/configs/datasets/lveval/lvevallic_mixup/lveval_lic_mixup_gen.py
@ -0,0 +1,6 @@
+from mmengine.config import read_base
+
+with read_base():
+    from .lveval_lic_mixup_gen_01eb0c import (
+        LVEval_lic_mixup_datasets,
+    )  # noqa: F401, F403
--- a/configs/datasets/lveval/lvevallic_mixup/lveval_lic_mixup_gen_01eb0c.py
+++ b/configs/datasets/lveval/lvevallic_mixup/lveval_lic_mixup_gen_01eb0c.py
@ -0,0 +1,54 @@
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.datasets import LVEvalOPTF1Evaluator, LVEvallicDataset
+
+LVEval_lic_mixup_reader_cfg = dict(
+    input_columns=["context", "input"],
+    output_column="answers",
+    train_split="test",
+    test_split="test",
+)
+
+LVEval_lic_mixup_infer_cfg = dict(
+    prompt_template=dict(
+        type=PromptTemplate,
+        template=dict(
+            round=[
+                dict(
+                    role="HUMAN",
+                    prompt="请根据下面给定的文章回答问题，问题和答案只与其中一篇文章有关。\n\n文章：{context}\n\n请现在基于上述文章回答下面的问题，问题和答案只与其中一篇文章有关。\n\n问题：{input}\n回答：",
+                ),
+            ],
+        ),
+    ),
+    retriever=dict(type=ZeroRetriever),
+    inferencer=dict(type=GenInferencer, max_out_len=64),
+)
+
+LVEval_lic_mixup_eval_cfg = dict(
+    evaluator=dict(type=LVEvalOPTF1Evaluator, language="zh"), pred_role="BOT"
+)
+
+DATASET_LENGTH_LEVEL = ["16k", "32k", "64k", "128k", "256k"]
+
+
+def get_dataset_names(dataset_name, length_levels):
+    datasets = []
+    for length in length_levels:
+        datasets.append(f"{dataset_name}_{length}")
+    return datasets
+
+
+LVEval_lic_mixup_datasets = [
+    dict(
+        type=LVEvallicDataset,
+        abbr="LVEval_" + name_len,
+        path="Infinigence/LVEval",
+        name=name_len,
+        reader_cfg=LVEval_lic_mixup_reader_cfg,
+        infer_cfg=LVEval_lic_mixup_infer_cfg,
+        eval_cfg=LVEval_lic_mixup_eval_cfg,
+    )
+    for name_len in get_dataset_names("lic_mixup", DATASET_LENGTH_LEVEL)
+]
--- a/configs/datasets/lveval/lvevalloogle_CR_mixup/lveval_loogle_CR_mixup_gen.py
+++ b/configs/datasets/lveval/lvevalloogle_CR_mixup/lveval_loogle_CR_mixup_gen.py
@ -0,0 +1,6 @@
+from mmengine.config import read_base
+
+with read_base():
+    from .lveval_loogle_CR_mixup_gen_d7ea36 import (
+        LVEval_loogle_CR_mixup_datasets,
+    )  # noqa: F401, F403
--- a/configs/datasets/lveval/lvevalloogle_CR_mixup/lveval_loogle_CR_mixup_gen_d7ea36.py
+++ b/configs/datasets/lveval/lvevalloogle_CR_mixup/lveval_loogle_CR_mixup_gen_d7ea36.py
@ -0,0 +1,54 @@
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.datasets import LVEvalOPTF1Evaluator, LVEvallooglecrDataset
+
+LVEval_loogle_CR_mixup_reader_cfg = dict(
+    input_columns=["context", "input"],
+    output_column="answers",
+    train_split="test",
+    test_split="test",
+)
+
+LVEval_loogle_CR_mixup_infer_cfg = dict(
+    prompt_template=dict(
+        type=PromptTemplate,
+        template=dict(
+            round=[
+                dict(
+                    role="HUMAN",
+                    prompt="Please answer the following question based on the given passages. Questions and answers are only relevant to one passage. Only give me the answer and do not output any other explanation and evidence.\n\nArticle: {context}\n\nPlease answer the following question based on the above passages. Questions and answers are only relevant to one passage. Only give me the answer and do not output any other explanation and evidence.\n\nQuestion: {input}\nAnswer:",
+                ),
+            ],
+        ),
+    ),
+    retriever=dict(type=ZeroRetriever),
+    inferencer=dict(type=GenInferencer, max_out_len=64),
+)
+
+LVEval_loogle_CR_mixup_eval_cfg = dict(
+    evaluator=dict(type=LVEvalOPTF1Evaluator, language="en"), pred_role="BOT"
+)
+
+DATASET_LENGTH_LEVEL = ["16k", "32k", "64k", "128k", "256k"]
+
+
+def get_dataset_names(dataset_name, length_levels):
+    datasets = []
+    for length in length_levels:
+        datasets.append(f"{dataset_name}_{length}")
+    return datasets
+
+
+LVEval_loogle_CR_mixup_datasets = [
+    dict(
+        type=LVEvallooglecrDataset,
+        abbr="LVEval_" + name_len,
+        path="Infinigence/LVEval",
+        name=name_len,
+        reader_cfg=LVEval_loogle_CR_mixup_reader_cfg,
+        infer_cfg=LVEval_loogle_CR_mixup_infer_cfg,
+        eval_cfg=LVEval_loogle_CR_mixup_eval_cfg,
+    )
+    for name_len in get_dataset_names("loogle_CR_mixup", DATASET_LENGTH_LEVEL)
+]
--- a/configs/datasets/lveval/lvevalloogle_MIR_mixup/lveval_loogle_MIR_mixup_gen.py
+++ b/configs/datasets/lveval/lvevalloogle_MIR_mixup/lveval_loogle_MIR_mixup_gen.py
@ -0,0 +1,6 @@
+from mmengine.config import read_base
+
+with read_base():
+    from .lveval_loogle_MIR_mixup_gen_d7ea36 import (
+        LVEval_loogle_MIR_mixup_datasets,
+    )  # noqa: F401, F403
--- a/configs/datasets/lveval/lvevalloogle_MIR_mixup/lveval_loogle_MIR_mixup_gen_d7ea36.py
+++ b/configs/datasets/lveval/lvevalloogle_MIR_mixup/lveval_loogle_MIR_mixup_gen_d7ea36.py
@ -0,0 +1,54 @@
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.datasets import LVEvalOPTF1Evaluator, LVEvallooglemirDataset
+
+LVEval_loogle_MIR_mixup_reader_cfg = dict(
+    input_columns=["context", "input"],
+    output_column="answers",
+    train_split="test",
+    test_split="test",
+)
+
+LVEval_loogle_MIR_mixup_infer_cfg = dict(
+    prompt_template=dict(
+        type=PromptTemplate,
+        template=dict(
+            round=[
+                dict(
+                    role="HUMAN",
+                    prompt="Please answer the following question based on the given passages. Questions and answers are only relevant to one passage. Only give me the answer and do not output any other explanation and evidence.\n\nArticle: {context}\n\nPlease answer the following question based on the above passages. Questions and answers are only relevant to one passage. Only give me the answer and do not output any other explanation and evidence.\n\nQuestion: {input}\nAnswer:",
+                ),
+            ],
+        ),
+    ),
+    retriever=dict(type=ZeroRetriever),
+    inferencer=dict(type=GenInferencer, max_out_len=64),
+)
+
+LVEval_loogle_MIR_mixup_eval_cfg = dict(
+    evaluator=dict(type=LVEvalOPTF1Evaluator, language="en"), pred_role="BOT"
+)
+
+DATASET_LENGTH_LEVEL = ["16k", "32k", "64k", "128k", "256k"]
+
+
+def get_dataset_names(dataset_name, length_levels):
+    datasets = []
+    for length in length_levels:
+        datasets.append(f"{dataset_name}_{length}")
+    return datasets
+
+
+LVEval_loogle_MIR_mixup_datasets = [
+    dict(
+        type=LVEvallooglemirDataset,
+        abbr="LVEval_" + name_len,
+        path="Infinigence/LVEval",
+        name=name_len,
+        reader_cfg=LVEval_loogle_MIR_mixup_reader_cfg,
+        infer_cfg=LVEval_loogle_MIR_mixup_infer_cfg,
+        eval_cfg=LVEval_loogle_MIR_mixup_eval_cfg,
+    )
+    for name_len in get_dataset_names("loogle_MIR_mixup", DATASET_LENGTH_LEVEL)
+]
--- a/configs/datasets/lveval/lvevalloogle_SD_mixup/lveval_loogle_SD_mixup_gen.py
+++ b/configs/datasets/lveval/lvevalloogle_SD_mixup/lveval_loogle_SD_mixup_gen.py
@ -0,0 +1,6 @@
+from mmengine.config import read_base
+
+with read_base():
+    from .lveval_loogle_SD_mixup_gen_d7ea36 import (
+        LVEval_loogle_SD_mixup_datasets,
+    )  # noqa: F401, F403
--- a/configs/datasets/lveval/lvevalloogle_SD_mixup/lveval_loogle_SD_mixup_gen_d7ea36.py
+++ b/configs/datasets/lveval/lvevalloogle_SD_mixup/lveval_loogle_SD_mixup_gen_d7ea36.py
@ -0,0 +1,54 @@
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.datasets import LVEvalOPTF1Evaluator, LVEvallooglesdDataset
+
+LVEval_loogle_SD_mixup_reader_cfg = dict(
+    input_columns=["context", "input"],
+    output_column="answers",
+    train_split="test",
+    test_split="test",
+)
+
+LVEval_loogle_SD_mixup_infer_cfg = dict(
+    prompt_template=dict(
+        type=PromptTemplate,
+        template=dict(
+            round=[
+                dict(
+                    role="HUMAN",
+                    prompt="Please answer the following question based on the given passages. Questions and answers are only relevant to one passage. Only give me the answer and do not output any other explanation and evidence.\n\nArticle: {context}\n\nPlease answer the following question based on the above passages. Questions and answers are only relevant to one passage. Only give me the answer and do not output any other explanation and evidence.\n\nQuestion: {input}\nAnswer:",
+                ),
+            ],
+        ),
+    ),
+    retriever=dict(type=ZeroRetriever),
+    inferencer=dict(type=GenInferencer, max_out_len=64),
+)
+
+LVEval_loogle_SD_mixup_eval_cfg = dict(
+    evaluator=dict(type=LVEvalOPTF1Evaluator, language="en"), pred_role="BOT"
+)
+
+DATASET_LENGTH_LEVEL = ["16k", "32k", "64k", "128k", "256k"]
+
+
+def get_dataset_names(dataset_name, length_levels):
+    datasets = []
+    for length in length_levels:
+        datasets.append(f"{dataset_name}_{length}")
+    return datasets
+
+
+LVEval_loogle_SD_mixup_datasets = [
+    dict(
+        type=LVEvallooglesdDataset,
+        abbr="LVEval_" + name_len,
+        path="Infinigence/LVEval",
+        name=name_len,
+        reader_cfg=LVEval_loogle_SD_mixup_reader_cfg,
+        infer_cfg=LVEval_loogle_SD_mixup_infer_cfg,
+        eval_cfg=LVEval_loogle_SD_mixup_eval_cfg,
+    )
+    for name_len in get_dataset_names("loogle_SD_mixup", DATASET_LENGTH_LEVEL)
+]
--- a/configs/datasets/lveval/lvevalmultifieldqa_en_mixup/lveval_multifieldqa_en_mixup_gen.py
+++ b/configs/datasets/lveval/lvevalmultifieldqa_en_mixup/lveval_multifieldqa_en_mixup_gen.py
@ -0,0 +1,6 @@
+from mmengine.config import read_base
+
+with read_base():
+    from .lveval_multifieldqa_en_mixup_gen_d7ea36 import (
+        LVEval_multifieldqa_en_mixup_datasets,
+    )  # noqa: F401, F403
--- a/configs/datasets/lveval/lvevalmultifieldqa_en_mixup/lveval_multifieldqa_en_mixup_gen_d7ea36.py
+++ b/configs/datasets/lveval/lvevalmultifieldqa_en_mixup/lveval_multifieldqa_en_mixup_gen_d7ea36.py
@ -0,0 +1,59 @@
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.datasets import (
+    LVEvalOPTF1Evaluator,
+    LVEvalmultifieldqaenDataset,
+)
+
+LVEval_multifieldqa_en_mixup_reader_cfg = dict(
+    input_columns=["context", "input"],
+    output_column="answers",
+    train_split="test",
+    test_split="test",
+)
+
+LVEval_multifieldqa_en_mixup_infer_cfg = dict(
+    prompt_template=dict(
+        type=PromptTemplate,
+        template=dict(
+            round=[
+                dict(
+                    role="HUMAN",
+                    prompt="Please answer the following question based on the given passages. Questions and answers are only relevant to one passage. Only give me the answer and do not output any other explanation and evidence.\n\nArticle: {context}\n\nPlease answer the following question based on the above passages. Questions and answers are only relevant to one passage. Only give me the answer and do not output any other explanation and evidence.\n\nQuestion: {input}\nAnswer:",
+                ),
+            ],
+        ),
+    ),
+    retriever=dict(type=ZeroRetriever),
+    inferencer=dict(type=GenInferencer, max_out_len=64),
+)
+
+LVEval_multifieldqa_en_mixup_eval_cfg = dict(
+    evaluator=dict(type=LVEvalOPTF1Evaluator, language="en"), pred_role="BOT"
+)
+
+DATASET_LENGTH_LEVEL = ["16k", "32k", "64k", "128k", "256k"]
+
+
+def get_dataset_names(dataset_name, length_levels):
+    datasets = []
+    for length in length_levels:
+        datasets.append(f"{dataset_name}_{length}")
+    return datasets
+
+
+LVEval_multifieldqa_en_mixup_datasets = [
+    dict(
+        type=LVEvalmultifieldqaenDataset,
+        abbr="LVEval_" + name_len,
+        path="Infinigence/LVEval",
+        name=name_len,
+        reader_cfg=LVEval_multifieldqa_en_mixup_reader_cfg,
+        infer_cfg=LVEval_multifieldqa_en_mixup_infer_cfg,
+        eval_cfg=LVEval_multifieldqa_en_mixup_eval_cfg,
+    )
+    for name_len in get_dataset_names(
+        "multifieldqa_en_mixup", DATASET_LENGTH_LEVEL
+    )
+]
--- a/configs/datasets/lveval/lvevalmultifieldqa_zh_mixup/lveval_multifieldqa_zh_mixup_gen.py
+++ b/configs/datasets/lveval/lvevalmultifieldqa_zh_mixup/lveval_multifieldqa_zh_mixup_gen.py
@ -0,0 +1,6 @@
+from mmengine.config import read_base
+
+with read_base():
+    from .lveval_multifieldqa_zh_mixup_gen_0fbdad import (
+        LVEval_multifieldqa_zh_mixup_datasets,
+    )  # noqa: F401, F403
--- a/configs/datasets/lveval/lvevalmultifieldqa_zh_mixup/lveval_multifieldqa_zh_mixup_gen_0fbdad.py
+++ b/configs/datasets/lveval/lvevalmultifieldqa_zh_mixup/lveval_multifieldqa_zh_mixup_gen_0fbdad.py
@ -0,0 +1,59 @@
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.datasets import (
+    LVEvalOPTF1Evaluator,
+    LVEvalmultifieldqazhDataset,
+)
+
+LVEval_multifieldqa_zh_mixup_reader_cfg = dict(
+    input_columns=["context", "input"],
+    output_column="answers",
+    train_split="test",
+    test_split="test",
+)
+
+LVEval_multifieldqa_zh_mixup_infer_cfg = dict(
+    prompt_template=dict(
+        type=PromptTemplate,
+        template=dict(
+            round=[
+                dict(
+                    role="HUMAN",
+                    prompt="请阅读以下文章并用中文回答问题，问题和答案只与其中一篇文章有关。只需要直接给出问题的答案，不要输出其他任何解释和证据。\n\n文章：{context}\n\n请基于上面的文章回答下面的问题，问题和答案只与其中一篇文章有关。只需要直接给出问题的答案，不要输出其他任何解释和证据。\n\n问题：{input}\n回答：",
+                ),
+            ],
+        ),
+    ),
+    retriever=dict(type=ZeroRetriever),
+    inferencer=dict(type=GenInferencer, max_out_len=64),
+)
+
+LVEval_multifieldqa_zh_mixup_eval_cfg = dict(
+    evaluator=dict(type=LVEvalOPTF1Evaluator, language="zh"), pred_role="BOT"
+)
+
+DATASET_LENGTH_LEVEL = ["16k", "32k", "64k", "128k", "256k"]
+
+
+def get_dataset_names(dataset_name, length_levels):
+    datasets = []
+    for length in length_levels:
+        datasets.append(f"{dataset_name}_{length}")
+    return datasets
+
+
+LVEval_multifieldqa_zh_mixup_datasets = [
+    dict(
+        type=LVEvalmultifieldqazhDataset,
+        abbr="LVEval_" + name_len,
+        path="Infinigence/LVEval",
+        name=name_len,
+        reader_cfg=LVEval_multifieldqa_zh_mixup_reader_cfg,
+        infer_cfg=LVEval_multifieldqa_zh_mixup_infer_cfg,
+        eval_cfg=LVEval_multifieldqa_zh_mixup_eval_cfg,
+    )
+    for name_len in get_dataset_names(
+        "multifieldqa_zh_mixup", DATASET_LENGTH_LEVEL
+    )
+]
--- a/configs/eval_bluelm_32k_lveval.py
+++ b/configs/eval_bluelm_32k_lveval.py
@ -0,0 +1,16 @@
+from mmengine.config import read_base
+
+with read_base():
+    from .datasets.lveval.lveval import LVEval_datasets as datasets
+    from .models.bluelm.hf_bluelm_7b_chat_32k import models
+    from .summarizers.lveval import summarizer
+
+models[0][
+    "path"
+] = "/path/to/your/huggingface_models/BlueLM-7B-Chat-32K"
+models[0][
+    "tokenizer_path"
+] = "/path/to/your/huggingface_models/BlueLM-7B-Chat-32K"
+models[0]["max_seq_len"] = 32768
+models[0]["generation_kwargs"] = dict(do_sample=False)
+models[0]["mode"] = "mid"  # truncate in the middle
--- a/configs/eval_llama2_7b_lveval.py
+++ b/configs/eval_llama2_7b_lveval.py
@ -0,0 +1,16 @@
+from mmengine.config import read_base
+
+with read_base():
+    from .datasets.lveval.lveval import LVEval_datasets as datasets
+    from .models.hf_llama.hf_llama2_7b_chat import models
+    from .summarizers.lveval import summarizer
+
+models[0][
+    "path"
+] = "/path/to/your/huggingface_models/Llama-2-7b-chat-hf"
+models[0][
+    "tokenizer_path"
+] = "/path/to/your/huggingface_models/Llama-2-7b-chat-hf"
+models[0]["max_seq_len"] = 4096
+models[0]["generation_kwargs"] = dict(do_sample=False)
+models[0]["mode"] = "mid"  # truncate in the middle
--- a/configs/summarizers/groups/lveval.py
+++ b/configs/summarizers/groups/lveval.py
@ -0,0 +1,110 @@
+len_levels = ["16k", "32k", "64k", "128k", "256k"]
+
+subsets_lveval_loogle_SD_mixup = [
+    "LVEval_loogle_SD_mixup" + "_" + len_level for len_level in len_levels
+]
+subsets_lveval_cmrc_mixup = [
+    "LVEval_cmrc_mixup" + "_" + len_level for len_level in len_levels
+]
+subsets_lveval_multifieldqa_en_mixup = [
+    "LVEval_multifieldqa_en_mixup" + "_" + len_level
+    for len_level in len_levels
+]
+subsets_lveval_multifieldqa_zh_mixup = [
+    "LVEval_multifieldqa_zh_mixup" + "_" + len_level
+    for len_level in len_levels
+]
+subsets_lveval_dureader_mixup = [
+    "LVEval_dureader_mixup" + "_" + len_level for len_level in len_levels
+]
+subsets_lveval_loogle_CR_mixup = [
+    "LVEval_loogle_CR_mixup" + "_" + len_level for len_level in len_levels
+]
+subsets_lveval_loogle_MIR_mixup = [
+    "LVEval_loogle_MIR_mixup" + "_" + len_level for len_level in len_levels
+]
+subsets_lveval_hotpotwikiqa_mixup = [
+    "LVEval_hotpotwikiqa_mixup" + "_" + len_level for len_level in len_levels
+]
+subsets_lveval_lic_mixup = [
+    "LVEval_lic_mixup" + "_" + len_level for len_level in len_levels
+]
+subsets_lveval_factrecall_en = [
+    "LVEval_factrecall_en" + "_" + len_level for len_level in len_levels
+]
+subsets_lveval_factrecall_zh = [
+    "LVEval_factrecall_zh" + "_" + len_level for len_level in len_levels
+]
+
+subsets_lveval_single_hop_qa = (
+    subsets_lveval_loogle_SD_mixup + subsets_lveval_cmrc_mixup
+)
+subsets_lveval_single_hop_cqa = (
+    subsets_lveval_multifieldqa_en_mixup + subsets_lveval_multifieldqa_zh_mixup
+)
+subsets_lveval_multi_hop_qa = (
+    subsets_lveval_dureader_mixup
+    + subsets_lveval_loogle_CR_mixup
+    + subsets_lveval_loogle_MIR_mixup
+)
+subsets_lveval_multi_hop_cqa = (
+    subsets_lveval_hotpotwikiqa_mixup + subsets_lveval_lic_mixup
+)
+subsets_lveval_factrecall_cqa = (
+    subsets_lveval_factrecall_en + subsets_lveval_factrecall_zh
+)
+
+subsets_lveval_qa = (
+    subsets_lveval_single_hop_qa
+    + subsets_lveval_single_hop_cqa
+    + subsets_lveval_multi_hop_qa
+    + subsets_lveval_multi_hop_cqa
+    + subsets_lveval_factrecall_cqa
+)
+
+lveval_summary_groups = [
+    {
+        "name": "LVEval_loogle_SD_mixup",
+        "subsets": subsets_lveval_loogle_SD_mixup,
+    },
+    {"name": "LVEval_cmrc_mixup", "subsets": subsets_lveval_cmrc_mixup},
+    {
+        "name": "LVEval_multifieldqa_en_mixup",
+        "subsets": subsets_lveval_multifieldqa_en_mixup,
+    },
+    {
+        "name": "LVEval_multifieldqa_zh_mixup",
+        "subsets": subsets_lveval_multifieldqa_zh_mixup,
+    },
+    {
+        "name": "LVEval_dureader_mixup",
+        "subsets": subsets_lveval_dureader_mixup,
+    },
+    {
+        "name": "LVEval_loogle_CR_mixup",
+        "subsets": subsets_lveval_loogle_CR_mixup,
+    },
+    {
+        "name": "LVEval_loogle_MIR_mixup",
+        "subsets": subsets_lveval_loogle_MIR_mixup,
+    },
+    {
+        "name": "LVEval_hotpotwikiqa_mixup",
+        "subsets": subsets_lveval_hotpotwikiqa_mixup,
+    },
+    {"name": "LVEval_lic_mixup", "subsets": subsets_lveval_lic_mixup},
+    {"name": "LVEval_factrecall_en", "subsets": subsets_lveval_factrecall_en},
+    {"name": "LVEval_factrecall_zh", "subsets": subsets_lveval_factrecall_zh},
+    {"name": "LVEval_single_hop_qa", "subsets": subsets_lveval_single_hop_qa},
+    {
+        "name": "LVEval_single_hop_cqa",
+        "subsets": subsets_lveval_single_hop_cqa,
+    },
+    {"name": "LVEval_multi_hop_qa", "subsets": subsets_lveval_multi_hop_qa},
+    {"name": "LVEval_multi_hop_cqa", "subsets": subsets_lveval_multi_hop_cqa},
+    {
+        "name": "LVEval_factrecall_cqa",
+        "subsets": subsets_lveval_factrecall_cqa,
+    },
+    {"name": "LVEval_qa", "subsets": subsets_lveval_qa},
+]
--- a/configs/summarizers/lveval.py
+++ b/configs/summarizers/lveval.py
@ -0,0 +1,114 @@
+from mmengine.config import read_base
+
+with read_base():
+    from .groups.lveval import lveval_summary_groups
+
+summarizer = dict(
+    dataset_abbrs=[
+        "----------------------------------------",
+        "--------- LVEval All ---------",  # category
+        "----------------------------------------",
+        "LVEval_qa",
+        "----------------------------------------",
+        "--------- LVEval Tasks All ---------",  # category
+        "----------------------------------------",
+        "LVEval_single_hop_qa",
+        "LVEval_single_hop_cqa",
+        "LVEval_multi_hop_qa",
+        "LVEval_multi_hop_cqa",
+        "LVEval_factrecall_cqa",
+        "----------------------------------------",
+        "--------- LVEval Datasets All ---------",  # category
+        "----------------------------------------",
+        "LVEval_loogle_SD_mixup",
+        "LVEval_cmrc_mixup",
+        "LVEval_multifieldqa_en_mixup",
+        "LVEval_multifieldqa_zh_mixup",
+        "LVEval_dureader_mixup",
+        "LVEval_loogle_CR_mixup",
+        "LVEval_loogle_MIR_mixup",
+        "LVEval_hotpotwikiqa_mixup",
+        "LVEval_lic_mixup",
+        "LVEval_factrecall_en",
+        "LVEval_factrecall_zh",
+        "----------------------------------------",
+        "--------- LVEval Single_Hop QA ---------",  # category
+        "----------------------------------------",
+        "LVEval_loogle_SD_mixup_16k",
+        "LVEval_loogle_SD_mixup_32k",
+        "LVEval_loogle_SD_mixup_64k",
+        "LVEval_loogle_SD_mixup_128k",
+        "LVEval_loogle_SD_mixup_256k",
+        "----------------------------------------",
+        "LVEval_cmrc_mixup_16k",
+        "LVEval_cmrc_mixup_32k",
+        "LVEval_cmrc_mixup_64k",
+        "LVEval_cmrc_mixup_128k",
+        "LVEval_cmrc_mixup_256k",
+        "----------------------------------------",
+        "--------- LVEval Single_Hop CQA ---------",  # category
+        "----------------------------------------",
+        "LVEval_multifieldqa_en_mixup_16k",
+        "LVEval_multifieldqa_en_mixup_32k",
+        "LVEval_multifieldqa_en_mixup_64k",
+        "LVEval_multifieldqa_en_mixup_128k",
+        "LVEval_multifieldqa_en_mixup_256k",
+        "----------------------------------------",
+        "LVEval_multifieldqa_zh_mixup_16k",
+        "LVEval_multifieldqa_zh_mixup_32k",
+        "LVEval_multifieldqa_zh_mixup_64k",
+        "LVEval_multifieldqa_zh_mixup_128k",
+        "LVEval_multifieldqa_zh_mixup_256k",
+        "----------------------------------------",
+        "--------- LVEval Multi_Hop QA ---------",  # category
+        "----------------------------------------",
+        "LVEval_dureader_mixup_16k",
+        "LVEval_dureader_mixup_32k",
+        "LVEval_dureader_mixup_64k",
+        "LVEval_dureader_mixup_128k",
+        "LVEval_dureader_mixup_256k",
+        "----------------------------------------",
+        "LVEval_loogle_CR_mixup_16k",
+        "LVEval_loogle_CR_mixup_32k",
+        "LVEval_loogle_CR_mixup_64k",
+        "LVEval_loogle_CR_mixup_128k",
+        "LVEval_loogle_CR_mixup_256k",
+        "----------------------------------------",
+        "LVEval_loogle_MIR_mixup_16k",
+        "LVEval_loogle_MIR_mixup_32k",
+        "LVEval_loogle_MIR_mixup_64k",
+        "LVEval_loogle_MIR_mixup_128k",
+        "LVEval_loogle_MIR_mixup_256k",
+        "----------------------------------------",
+        "--------- LVEval Multi_Hop CQA ---------",  # category
+        "----------------------------------------",
+        "LVEval_hotpotwikiqa_mixup_16k",
+        "LVEval_hotpotwikiqa_mixup_32k",
+        "LVEval_hotpotwikiqa_mixup_64k",
+        "LVEval_hotpotwikiqa_mixup_128k",
+        "LVEval_hotpotwikiqa_mixup_256k",
+        "----------------------------------------",
+        "LVEval_lic_mixup_16k",
+        "LVEval_lic_mixup_32k",
+        "LVEval_lic_mixup_64k",
+        "LVEval_lic_mixup_128k",
+        "LVEval_lic_mixup_256k",
+        "----------------------------------------",
+        "--------- LVEval Factrecall CQA ---------",  # category
+        "----------------------------------------",
+        "LVEval_factrecall_en_16k",
+        "LVEval_factrecall_en_32k",
+        "LVEval_factrecall_en_64k",
+        "LVEval_factrecall_en_128k",
+        "LVEval_factrecall_en_256k",
+        "----------------------------------------",
+        "LVEval_factrecall_zh_16k",
+        "LVEval_factrecall_zh_32k",
+        "LVEval_factrecall_zh_64k",
+        "LVEval_factrecall_zh_128k",
+        "LVEval_factrecall_zh_256k",
+    ],
+    summary_groups=sum(
+        [v for k, v in locals().items() if k.endswith("_summary_groups")], []
+    ),
+)
--- a/opencompass/datasets/init.py
+++ b/opencompass/datasets/init.py
@ -58,6 +58,7 @@ from .lawbench import *  # noqa: F401, F403
 from .lcsts import *  # noqa: F401, F403
 from .leval import *  # noqa: F401, F403
 from .longbench import *  # noqa: F401, F403
+from .lveval import *  # noqa: F401, F403
 from .mastermath2024v1 import *  # noqa: F401, F403
 from .math import *  # noqa: F401, F403
 from .math401 import *  # noqa: F401, F403
--- a/opencompass/datasets/lveval/init.py
+++ b/opencompass/datasets/lveval/init.py
@ -0,0 +1,14 @@
+from .evaluators import LVEvalF1Evaluator  # noqa: F401, F403
+from .evaluators import LVEvalOPTF1Evaluator  # noqa: F401, F403
+from .evaluators import LVEvalOPTRougeEvaluator  # noqa: F401, F403
+from .lveval_cmrc_mixup import *  # noqa: F401, F403
+from .lveval_dureader_mixup import *  # noqa: F401, F403
+from .lveval_factrecall_en import *  # noqa: F401, F403
+from .lveval_factrecall_zh import *  # noqa: F401, F403
+from .lveval_hotpotwikiqa_mixup import *  # noqa: F401, F403
+from .lveval_lic_mixup import *  # noqa: F401, F403
+from .lveval_loogle_CR_mixup import *  # noqa: F401, F403
+from .lveval_loogle_MIR_mixup import *  # noqa: F401, F403
+from .lveval_loogle_SD_mixup import *  # noqa: F401, F403
+from .lveval_multifieldqa_en_mixup import *  # noqa: F401, F403
+from .lveval_multifieldqa_zh_mixup import *  # noqa: F401, F403
--- a/opencompass/datasets/lveval/evaluators.py
+++ b/opencompass/datasets/lveval/evaluators.py
@ -0,0 +1,409 @@
+"""Functions for computing metrics.
+
+Part of following code are modified from ` https://github.com/THUDM/LongBench`
+"""
+
+import re
+import string
+from collections import Counter
+from typing import List
+
+import jieba
+from rouge import Rouge
+
+from opencompass.openicl.icl_evaluator import BaseEvaluator
+from opencompass.registry import ICL_EVALUATORS
+
+ABANDON_WORDS_EN = [
+    'and',
+    'to',
+    'of',
+    'in',
+    'her',
+    'was',
+    'with',
+    'for',
+    'it',
+    'from',
+    'is',
+    'that',
+    'his',
+    'he',
+    'by',
+    'she',
+    'they',
+    'or',
+    'at',
+    'because',
+    'be',
+    'on',
+    'are',
+    'their',
+    'what',
+    'as',
+    'had',
+    'were',
+    'about',
+    'being',
+    'this',
+    'who',
+    'but',
+    'have',
+    'has',
+    'when',
+    'which',
+    'does',
+]
+
+ABANDON_WORDS_ZH = [
+    '的',
+    '和',
+    '是',
+    '等',
+    '在',
+    '年',
+    '可以',
+    '为',
+    '与',
+    '‰',
+    '了',
+    '或',
+    '一种',
+    '月',
+    'c',
+    '至',
+    '日',
+    '有',
+    '进行',
+    '于',
+    '不',
+    '中',
+    '×',
+    '根据',
+    '小',
+    '由',
+    '亩',
+    '也',
+    '要',
+    '指',
+    '法',
+    '会',
+    '元',
+    '主要',
+    '以及',
+    '通过',
+    '首先',
+    '对',
+    '然后',
+    '号',
+    '以',
+    '所',
+    '后',
+    '丁',
+    '包括',
+    '无',
+    '将',
+    '用',
+    '能',
+    '形',
+    '方面',
+    '因素',
+    '位于',
+    '而',
+    '从',
+    '到',
+    '一定',
+    '用于',
+    '但',
+    '使用',
+    '让',
+    '具有',
+    '并',
+    '亿元',
+    '万元',
+    '上',
+    '类',
+    '基于',
+    '才',
+    '来',
+    '地',
+    '片',
+    '其他',
+    '个',
+    '或者',
+    '变得',
+    '时',
+    '给',
+    '你',
+    '使',
+    '条',
+    '受',
+    '已经',
+    '带',
+    '度',
+]
+
+
+def normalize_answer(s):
+    """Lower text and remove punctuation, articles and extra whitespace."""
+
+    def remove_articles(text):
+        return re.sub(r'\b(a|an|the)\b', ' ', text)
+
+    def white_space_fix(text):
+        return ' '.join(text.split())
+
+    def remove_punc(text):
+        exclude = set(string.punctuation)
+        return ''.join(ch for ch in text if ch not in exclude)
+
+    def lower(text):
+        return text.lower()
+
+    return white_space_fix(remove_articles(remove_punc(lower(s))))
+
+
+def normalize_zh_answer(s):
+    """Lower text and remove punctuation, extra whitespace."""
+
+    def white_space_fix(text):
+        return ''.join(text.split())
+
+    def remove_punc(text):
+        cn_punctuation = '！？｡。＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀\
+            ｛｜｝～｟｠｢｣､、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏.'
+
+        all_punctuation = set(string.punctuation + cn_punctuation)
+        return ''.join(ch for ch in text if ch not in all_punctuation)
+
+    def lower(text):
+        return text.lower()
+
+    return white_space_fix(remove_punc(lower(s)))
+
+
+@ICL_EVALUATORS.register_module()
+class LVEvalF1Evaluator(BaseEvaluator):
+
+    def __init__(self, language: str = 'en') -> None:
+        super().__init__()
+        assert language in ['en', 'zh']
+        self.language = language
+
+    def score(self, predictions: List, references: List) -> dict:
+
+        def f1_score(prediction, reference, **kwargs):
+            common = Counter(prediction) & Counter(reference)
+            num_same = sum(common.values())
+            if num_same == 0:
+                return 0
+            precision = 1.0 * num_same / len(prediction)
+            recall = 1.0 * num_same / len(reference)
+            f1 = (2 * precision * recall) / (precision + recall)
+            return f1
+
+        score = 0.0
+        for i in range(len(predictions)):
+            prediction = predictions[i]
+            reference_list = references[i]
+            task_score = 0.0
+            for reference in reference_list:
+                if self.language == 'en':
+                    normalized_prediction = normalize_answer(prediction)
+                    normalized_reference = normalize_answer(reference)
+
+                    prediction_tokens = normalized_prediction.split()
+                    reference_tokens = normalized_reference.split()
+
+                else:
+                    prediction_tokens = list(
+                        jieba.cut(prediction, cut_all=False))
+                    reference_tokens = list(jieba.cut(reference,
+                                                      cut_all=False))
+                    prediction_tokens = [
+                        normalize_zh_answer(token)
+                        for token in prediction_tokens
+                    ]
+                    reference_tokens = [
+                        normalize_zh_answer(token)
+                        for token in reference_tokens
+                    ]
+                    prediction_tokens = [
+                        token for token in prediction_tokens if len(token) > 0
+                    ]
+                    reference_tokens = [
+                        token for token in reference_tokens if len(token) > 0
+                    ]
+
+                task_score = max(task_score,
+                                 f1_score(prediction_tokens, reference_tokens))
+                break
+
+            score += task_score
+
+        score = score / len(predictions) * 100
+        return {'f1': score}
+
+
+@ICL_EVALUATORS.register_module()
+class LVEvalOPTF1Evaluator(BaseEvaluator):
+
+    def __init__(self, language: str = 'en') -> None:
+        super().__init__()
+        assert language in ['en', 'zh']
+        self.language = language
+
+    def score(self, predictions: List, references: List) -> dict:
+
+        def f1_score(prediction, reference, **kwargs):
+            common = Counter(prediction) & Counter(reference)
+            num_same = sum(common.values())
+            if num_same == 0:
+                return 0
+            precision = 1.0 * num_same / len(prediction)
+            recall = 1.0 * num_same / len(reference)
+            f1 = (2 * precision * recall) / (precision + recall)
+            return f1
+
+        score = 0.0
+        for i in range(len(predictions)):
+            prediction = predictions[i]
+            reference_list = references[i]
+            answer_keyword = reference_list[-1]
+            task_score = 0.0
+            for reference in reference_list:
+                if self.language == 'en':
+                    normalized_prediction = normalize_answer(prediction)
+                    normalized_reference = normalize_answer(reference)
+
+                    prediction_tokens = normalized_prediction.split()
+                    reference_tokens = normalized_reference.split()
+                    # answer keywords recall
+                    if answer_keyword:
+                        answer_keyword_tokens = normalize_answer(
+                            answer_keyword)
+                        answer_keyword_tokens = answer_keyword_tokens.split()
+                        common = Counter(prediction_tokens) & Counter(
+                            answer_keyword_tokens)
+                        filtered_common = {
+                            key: value
+                            for key, value in common.items()
+                            if key not in ABANDON_WORDS_EN
+                        }
+                        num_same = sum(filtered_common.values())
+                        recall = 1.0 * num_same / len(answer_keyword_tokens)
+                        if recall < 0.2:
+                            break
+                else:
+                    prediction_tokens = list(
+                        jieba.cut(prediction, cut_all=False))
+                    reference_tokens = list(jieba.cut(reference,
+                                                      cut_all=False))
+                    prediction_tokens = [
+                        normalize_zh_answer(token)
+                        for token in prediction_tokens
+                    ]
+                    reference_tokens = [
+                        normalize_zh_answer(token)
+                        for token in reference_tokens
+                    ]
+                    prediction_tokens = [
+                        token for token in prediction_tokens if len(token) > 0
+                    ]
+                    reference_tokens = [
+                        token for token in reference_tokens if len(token) > 0
+                    ]
+                    if not answer_keyword:
+                        answer_keyword = reference
+                    if answer_keyword:
+                        answer_keyword_tokens = list(
+                            jieba.cut(answer_keyword, cut_all=False))
+                        answer_keyword_tokens = [
+                            normalize_zh_answer(token)
+                            for token in answer_keyword_tokens
+                        ]
+                        answer_keyword_tokens = [
+                            token for token in answer_keyword_tokens
+                            if len(token) > 0
+                        ]
+                        common = Counter(prediction_tokens) & Counter(
+                            answer_keyword_tokens)
+                        filtered_common = {
+                            key: value
+                            for key, value in common.items()
+                            if key not in ABANDON_WORDS_ZH
+                        }
+                        num_same = sum(filtered_common.values())
+                        recall = 1.0 * num_same / len(answer_keyword_tokens)
+                        if recall < 0.4:
+                            break
+
+                task_score = max(task_score,
+                                 f1_score(prediction_tokens, reference_tokens))
+                break
+
+            score += task_score
+
+        score = score / len(predictions) * 100
+        return {'LVEval_f1': score}
+
+
+@ICL_EVALUATORS.register_module()
+class LVEvalOPTRougeEvaluator(BaseEvaluator):
+
+    def __init__(self, language: str = 'en') -> None:
+        super().__init__()
+        assert language in ['en', 'zh']
+        self.language = language
+
+    def score(self, predictions: List, references: List) -> dict:
+        score = 0.0
+        for i in range(len(predictions)):
+            prediction = predictions[i]
+            reference_list = references[i]
+            task_score = 0.0
+            for reference in reference_list:
+
+                if self.language == 'zh':
+                    word_blacklist = ABANDON_WORDS_ZH
+                    prediction_tokens = list(
+                        jieba.cut(prediction, cut_all=False))
+                    reference_tokens = list(jieba.cut(reference,
+                                                      cut_all=False))
+                    prediction_tokens = [
+                        normalize_zh_answer(token)
+                        for token in prediction_tokens
+                    ]
+                    reference_tokens = [
+                        normalize_zh_answer(token)
+                        for token in reference_tokens
+                    ]
+                else:
+                    word_blacklist = ABANDON_WORDS_EN
+                    prediction_tokens = normalize_answer(prediction)
+                    reference_tokens = normalize_answer(reference)
+                    prediction_tokens = prediction_tokens.split()
+                    reference_tokens = reference_tokens.split()
+
+                filtered_prediction_tokens = [
+                    i for i in prediction_tokens if i not in word_blacklist
+                ]
+                filtered_reference_tokens = [
+                    i for i in reference_tokens if i not in word_blacklist
+                ]
+                prediction = ' '.join(filtered_prediction_tokens)
+                reference = ' '.join(filtered_reference_tokens)
+
+                rouge = Rouge()
+                try:
+                    cur_score = rouge.get_scores([prediction], [reference],
+                                                 avg=True)['rouge-l']['f']
+                except Exception:
+                    cur_score = 0.0
+                task_score = max(task_score, cur_score)
+                break
+
+            score += task_score
+
+        score = score / len(predictions) * 100
+        return {'LVEval_rouge': score}
--- a/opencompass/datasets/lveval/lveval_cmrc_mixup.py
+++ b/opencompass/datasets/lveval/lveval_cmrc_mixup.py
@ -0,0 +1,28 @@
+from datasets import Dataset, load_dataset
+
+from opencompass.registry import LOAD_DATASET
+
+from ..base import BaseDataset
+
+
+@LOAD_DATASET.register_module()
+class LVEvalcmrcDataset(BaseDataset):
+
+    @staticmethod
+    def load(**kwargs):
+        dataset = load_dataset(**kwargs)
+        split = 'test'
+        raw_data = []
+        for i in range(len(dataset[split])):
+            question = dataset[split]['input'][i]
+            context = dataset[split]['context'][i]
+            answers = dataset[split]['answers'][i]
+            confusing_facts = dataset[split]['confusing_facts'][i]
+            raw_data.append({
+                'input': question,
+                'context': context,
+                'answers': answers,
+                'confusing_facts': confusing_facts,
+            })
+        dataset[split] = Dataset.from_list(raw_data)
+        return dataset
--- a/opencompass/datasets/lveval/lveval_dureader_mixup.py
+++ b/opencompass/datasets/lveval/lveval_dureader_mixup.py
@ -0,0 +1,26 @@
+from datasets import Dataset, load_dataset
+
+from opencompass.registry import LOAD_DATASET
+
+from ..base import BaseDataset
+
+
+@LOAD_DATASET.register_module()
+class LVEvaldureaderDataset(BaseDataset):
+
+    @staticmethod
+    def load(**kwargs):
+        dataset = load_dataset(**kwargs)
+        split = 'test'
+        raw_data = []
+        for i in range(len(dataset[split])):
+            question = dataset[split]['input'][i]
+            context = dataset[split]['context'][i]
+            answers = dataset[split]['answers'][i]
+            raw_data.append({
+                'input': question,
+                'context': context,
+                'answers': answers,
+            })
+        dataset[split] = Dataset.from_list(raw_data)
+        return dataset
--- a/opencompass/datasets/lveval/lveval_factrecall_en.py
+++ b/opencompass/datasets/lveval/lveval_factrecall_en.py
@ -0,0 +1,28 @@
+from datasets import Dataset, load_dataset
+
+from opencompass.registry import LOAD_DATASET
+
+from ..base import BaseDataset
+
+
+@LOAD_DATASET.register_module()
+class LVEvalfactrecallenDataset(BaseDataset):
+
+    @staticmethod
+    def load(**kwargs):
+        dataset = load_dataset(**kwargs)
+        split = 'test'
+        raw_data = []
+        for i in range(len(dataset[split])):
+            question = dataset[split]['input'][i]
+            context = dataset[split]['context'][i]
+            answers = dataset[split]['answers'][i]
+            confusing_facts = dataset[split]['confusing_facts'][i]
+            raw_data.append({
+                'input': question,
+                'context': context,
+                'answers': answers,
+                'confusing_facts': confusing_facts,
+            })
+        dataset[split] = Dataset.from_list(raw_data)
+        return dataset
--- a/opencompass/datasets/lveval/lveval_factrecall_zh.py
+++ b/opencompass/datasets/lveval/lveval_factrecall_zh.py
@ -0,0 +1,28 @@
+from datasets import Dataset, load_dataset
+
+from opencompass.registry import LOAD_DATASET
+
+from ..base import BaseDataset
+
+
+@LOAD_DATASET.register_module()
+class LVEvalfactrecallzhDataset(BaseDataset):
+
+    @staticmethod
+    def load(**kwargs):
+        dataset = load_dataset(**kwargs)
+        split = 'test'
+        raw_data = []
+        for i in range(len(dataset[split])):
+            question = dataset[split]['input'][i]
+            context = dataset[split]['context'][i]
+            answers = dataset[split]['answers'][i]
+            confusing_facts = dataset[split]['confusing_facts'][i]
+            raw_data.append({
+                'input': question,
+                'context': context,
+                'answers': answers,
+                'confusing_facts': confusing_facts,
+            })
+        dataset[split] = Dataset.from_list(raw_data)
+        return dataset
--- a/opencompass/datasets/lveval/lveval_hotpotwikiqa_mixup.py
+++ b/opencompass/datasets/lveval/lveval_hotpotwikiqa_mixup.py
@ -0,0 +1,31 @@
+from datasets import Dataset, load_dataset
+
+from opencompass.registry import LOAD_DATASET
+
+from ..base import BaseDataset
+
+
+@LOAD_DATASET.register_module()
+class LVEvalhotpotwikiqaDataset(BaseDataset):
+
+    @staticmethod
+    def load(**kwargs):
+        dataset = load_dataset(**kwargs)
+        split = 'test'
+        raw_data = []
+        for i in range(len(dataset[split])):
+            question = dataset[split]['input'][i]
+            context = dataset[split]['context'][i]
+            answers = dataset[split]['answers'][i]
+            confusing_facts = dataset[split]['confusing_facts'][i]
+            answer_keywords = dataset[split]['answer_keywords'][i]
+            answers_with_ak = answers + [answer_keywords]
+            raw_data.append({
+                'input': question,
+                'context': context,
+                'answers': answers_with_ak,
+                'confusing_facts': confusing_facts,
+                'answer_keywords': answer_keywords,
+            })
+        dataset[split] = Dataset.from_list(raw_data)
+        return dataset
--- a/opencompass/datasets/lveval/lveval_lic_mixup.py
+++ b/opencompass/datasets/lveval/lveval_lic_mixup.py
@ -0,0 +1,31 @@
+from datasets import Dataset, load_dataset
+
+from opencompass.registry import LOAD_DATASET
+
+from ..base import BaseDataset
+
+
+@LOAD_DATASET.register_module()
+class LVEvallicDataset(BaseDataset):
+
+    @staticmethod
+    def load(**kwargs):
+        dataset = load_dataset(**kwargs)
+        split = 'test'
+        raw_data = []
+        for i in range(len(dataset[split])):
+            question = dataset[split]['input'][i]
+            context = dataset[split]['context'][i]
+            answers = dataset[split]['answers'][i]
+            confusing_facts = dataset[split]['confusing_facts'][i]
+            answer_keywords = dataset[split]['answer_keywords'][i]
+            answers_with_ak = answers + [answer_keywords]
+            raw_data.append({
+                'input': question,
+                'context': context,
+                'answers': answers_with_ak,
+                'confusing_facts': confusing_facts,
+                'answer_keywords': answer_keywords,
+            })
+        dataset[split] = Dataset.from_list(raw_data)
+        return dataset
--- a/opencompass/datasets/lveval/lveval_loogle_CR_mixup.py
+++ b/opencompass/datasets/lveval/lveval_loogle_CR_mixup.py
@ -0,0 +1,29 @@
+from datasets import Dataset, load_dataset
+
+from opencompass.registry import LOAD_DATASET
+
+from ..base import BaseDataset
+
+
+@LOAD_DATASET.register_module()
+class LVEvallooglecrDataset(BaseDataset):
+
+    @staticmethod
+    def load(**kwargs):
+        dataset = load_dataset(**kwargs)
+        split = 'test'
+        raw_data = []
+        for i in range(len(dataset[split])):
+            question = dataset[split]['input'][i]
+            context = dataset[split]['context'][i]
+            answers = dataset[split]['answers'][i]
+            answer_keywords = dataset[split]['answer_keywords'][i]
+            answers_with_ak = answers + [answer_keywords]
+            raw_data.append({
+                'input': question,
+                'context': context,
+                'answers': answers_with_ak,
+                'answer_keywords': answer_keywords,
+            })
+        dataset[split] = Dataset.from_list(raw_data)
+        return dataset
--- a/opencompass/datasets/lveval/lveval_loogle_MIR_mixup.py
+++ b/opencompass/datasets/lveval/lveval_loogle_MIR_mixup.py
@ -0,0 +1,29 @@
+from datasets import Dataset, load_dataset
+
+from opencompass.registry import LOAD_DATASET
+
+from ..base import BaseDataset
+
+
+@LOAD_DATASET.register_module()
+class LVEvallooglemirDataset(BaseDataset):
+
+    @staticmethod
+    def load(**kwargs):
+        dataset = load_dataset(**kwargs)
+        split = 'test'
+        raw_data = []
+        for i in range(len(dataset[split])):
+            question = dataset[split]['input'][i]
+            context = dataset[split]['context'][i]
+            answers = dataset[split]['answers'][i]
+            answer_keywords = dataset[split]['answer_keywords'][i]
+            answers_with_ak = answers + [answer_keywords]
+            raw_data.append({
+                'input': question,
+                'context': context,
+                'answers': answers_with_ak,
+                'answer_keywords': answer_keywords,
+            })
+        dataset[split] = Dataset.from_list(raw_data)
+        return dataset
--- a/opencompass/datasets/lveval/lveval_loogle_SD_mixup.py
+++ b/opencompass/datasets/lveval/lveval_loogle_SD_mixup.py
@ -0,0 +1,29 @@
+from datasets import Dataset, load_dataset
+
+from opencompass.registry import LOAD_DATASET
+
+from ..base import BaseDataset
+
+
+@LOAD_DATASET.register_module()
+class LVEvallooglesdDataset(BaseDataset):
+
+    @staticmethod
+    def load(**kwargs):
+        dataset = load_dataset(**kwargs)
+        split = 'test'
+        raw_data = []
+        for i in range(len(dataset[split])):
+            question = dataset[split]['input'][i]
+            context = dataset[split]['context'][i]
+            answers = dataset[split]['answers'][i]
+            answer_keywords = dataset[split]['answer_keywords'][i]
+            answers_with_ak = answers + [answer_keywords]
+            raw_data.append({
+                'input': question,
+                'context': context,
+                'answers': answers_with_ak,
+                'answer_keywords': answer_keywords,
+            })
+        dataset[split] = Dataset.from_list(raw_data)
+        return dataset
--- a/opencompass/datasets/lveval/lveval_multifieldqa_en_mixup.py
+++ b/opencompass/datasets/lveval/lveval_multifieldqa_en_mixup.py
@ -0,0 +1,31 @@
+from datasets import Dataset, load_dataset
+
+from opencompass.registry import LOAD_DATASET
+
+from ..base import BaseDataset
+
+
+@LOAD_DATASET.register_module()
+class LVEvalmultifieldqaenDataset(BaseDataset):
+
+    @staticmethod
+    def load(**kwargs):
+        dataset = load_dataset(**kwargs)
+        split = 'test'
+        raw_data = []
+        for i in range(len(dataset[split])):
+            question = dataset[split]['input'][i]
+            context = dataset[split]['context'][i]
+            answers = dataset[split]['answers'][i]
+            confusing_facts = dataset[split]['confusing_facts'][i]
+            answer_keywords = dataset[split]['answer_keywords'][i]
+            answers_with_ak = answers + [answer_keywords]
+            raw_data.append({
+                'input': question,
+                'context': context,
+                'answers': answers_with_ak,
+                'confusing_facts': confusing_facts,
+                'answer_keywords': answer_keywords,
+            })
+        dataset[split] = Dataset.from_list(raw_data)
+        return dataset
--- a/opencompass/datasets/lveval/lveval_multifieldqa_zh_mixup.py
+++ b/opencompass/datasets/lveval/lveval_multifieldqa_zh_mixup.py
@ -0,0 +1,31 @@
+from datasets import Dataset, load_dataset
+
+from opencompass.registry import LOAD_DATASET
+
+from ..base import BaseDataset
+
+
+@LOAD_DATASET.register_module()
+class LVEvalmultifieldqazhDataset(BaseDataset):
+
+    @staticmethod
+    def load(**kwargs):
+        dataset = load_dataset(**kwargs)
+        split = 'test'
+        raw_data = []
+        for i in range(len(dataset[split])):
+            question = dataset[split]['input'][i]
+            context = dataset[split]['context'][i]
+            answers = dataset[split]['answers'][i]
+            confusing_facts = dataset[split]['confusing_facts'][i]
+            answer_keywords = dataset[split]['answer_keywords'][i]
+            answers_with_ak = answers + [answer_keywords]
+            raw_data.append({
+                'input': question,
+                'context': context,
+                'answers': answers_with_ak,
+                'confusing_facts': confusing_facts,
+                'answer_keywords': answer_keywords,
+            })
+        dataset[split] = Dataset.from_list(raw_data)
+        return dataset