[Fix] Fix some sc errors (#177)

* Update sc * Update sc doc * Apply suggestions from code review Co-authored-by: Hubert <42952108+yingfhu@users.noreply.github.com> --------- Co-authored-by: liuhongwei <liuhongwei@pjlab.org.cn> Co-authored-by: Hubert <42952108+yingfhu@users.noreply.github.com>
2025-05-30 16:03:24 +08:00 · 2023-08-10 16:40:32 +08:00 · 2023-08-10 16:40:32 +08:00 · ed248af136
commit ed248af136
parent 2931f3dcb8
6 changed files with 26 additions and 22 deletions
--- a/docs/en/index.rst
+++ b/docs/en/index.rst
@ -52,6 +52,7 @@ We always welcome *PRs* and *Issues* for the betterment of OpenCompass.
   :caption: Prompt

   prompt/few_shot.md
+   prompt/chain_of_thought.md
   prompt/prompt_template.md
   prompt/meta_template.md

--- a/docs/en/prompt/chain_of_thought.md
+++ b/docs/en/prompt/chain_of_thought.md
@ -49,13 +49,14 @@ Question: {question}\nLet's think step by step:\n{answer}

 ## 3. Self-Consistency

-The SC (Self-Consistency) method is proposed in [this paper](https://arxiv.org/abs/2203.11171), which will sample multiple reasoning paths for the question, and make majority voting to the generated answers for LLMs. This method displays remarkable proficiency among reasoning tasks with high accuracy but may consume more time and resources when inferencing, because of the majority voting strategy. In OpenCompass, you can simply set SC method in the dataset config like:
+The SC (Self-Consistency) method is proposed in [this paper](https://arxiv.org/abs/2203.11171), which will sample multiple reasoning paths for the question, and make majority voting to the generated answers for LLMs. This method displays remarkable proficiency among reasoning tasks with high accuracy but may consume more time and resources when inferencing, because of the majority voting strategy. In OpenCompass, You can easily implement the SC method by replacing `GenInferencer` with `SCInferencer` in the dataset configuration and setting the corresponding parameters like:

 ```python
+# This SC gsm8k config can be found at: opencompass.configs.datasets.gsm8k.gsm8k_gen_a3e34a.py
 gsm8k_infer_cfg = dict(
    inferencer=dict(
-        type=SCInferencer,
-        generation_kwargs=dict(do_sample=True, temperature=0.7, top_k=40),  # Set sample parameters to make sure model generate various output
+        type=SCInferencer, # Replace GenInferencer with SCInferencer.
+        generation_kwargs=dict(do_sample=True, temperature=0.7, top_k=40),  # Set sample parameters to make sure model generate various output, only works for models load from HuggingFace now.
        infer_type='SC',
        sc_size = SAMPLE_SIZE
    )
@ -64,9 +65,11 @@ gsm8k_eval_cfg = dict(sc_size=SAMPLE_SIZE)
 ```

 ```{note}
-注意，OpenCompass 默认使用默认使用 argmax 的方式采样下一个 token，因此若不指定采样参数，模型每次的推理结果将会是完全一致的，多轮评测将会失效。
+OpenCompass defaults to use argmax for sampling the next token. Therefore, if the sampling parameters are not specified, the model's inference results will be completely consistent each time, and multiple rounds of evaluation will be ineffective.
 ```

-Where `SAMPLE_SIZE` is the number of reasoning paths in Self-Consistency, higher value usually outcome higher performance. The following figure from the paper demonstrates the relation between reasoning paths and performance in several reasoning tasks:
+Where `SAMPLE_SIZE` is the number of reasoning paths in Self-Consistency, higher value usually outcome higher performance. The following figure from the original SC paper demonstrates the relation between reasoning paths and performance in several reasoning tasks:
+
 ![image](https://github.com/InternLM/opencompass/assets/28834990/05c7d850-7076-43ca-b165-e6251f9b3001)
+
 From the figure, it can be seen that in different reasoning tasks, performance tends to improve as the number of reasoning paths increases. However, for some tasks, increasing the number of reasoning paths may reach a limit, and further increasing the number of paths may not bring significant performance improvement. Therefore, it is necessary to conduct experiments and adjustments on specific tasks to find the optimal number of reasoning paths that best suit the task.
--- a/docs/zh_cn/index.rst
+++ b/docs/zh_cn/index.rst
@ -44,6 +44,7 @@ OpenCompass 上手路线
   :caption: 提示词

   prompt/few_shot.md
+   prompt/chain_of_thought.md
   prompt/prompt_template.md
   prompt/meta_template.md

--- a/docs/zh_cn/prompt/chain_of_thought.md
+++ b/docs/zh_cn/prompt/chain_of_thought.md
@ -49,13 +49,14 @@ Question: {question}\nLet's think step by step:\n{answer}

 ## 3. Self-Consistency

-SC (Self-Consistency) 方法是在 [此文章](https://arxiv.org/abs/2203.11171) 中提出的，该方法会为问题生成多个不同的推理路径，并对生成的答案进行众数投票。这种方法在复杂推理任务中表现出了显著的能力，但由于需要推理多次来采样多条推理链，所以可能会消耗很多的时间和资源。在 OpenCompass 中，您可以在数据集配置中简单地设置 SC 方法，例如：
+SC (Self-Consistency) 方法是在 [此文章](https://arxiv.org/abs/2203.11171) 中提出的，该方法会为问题生成多个不同的推理路径，并对生成的答案进行众数投票。这种方法在复杂推理任务中表现出了显著的能力，但由于需要推理多次来采样多条推理链，所以可能会消耗很多的时间和资源。在 OpenCompass 中，您可以通过在数据集配置中将 `GenInferencer` 替换为 `SCInferencer` 并设置相应的参数参数来简单地实现 SC 方法，例如：

 ```python
+# 此SC版gsm8k测试配置可以在： opencompass.configs.datasets.gsm8k.gsm8k_gen_a3e34a.py 中找到。
 gsm8k_infer_cfg = dict(
    inferencer=dict(
-        type=SCInferencer,
-        generation_kwargs=dict(do_sample=True, temperature=0.7, top_k=40),  # 设置采样参数以确保模型生成不同的输出
+        type=SCInferencer, # 替换 GenInferencer 为 SCInferencer
+        generation_kwargs=dict(do_sample=True, temperature=0.7, top_k=40),  # 设置采样参数以确保模型生成不同的输出，目前仅适用于从HuggingFace加载的模型。
        infer_type='SC',
        sc_size = SAMPLE_SIZE
    )
@ -64,9 +65,11 @@ gsm8k_eval_cfg = dict(sc_size=SAMPLE_SIZE)
 ```

 ```{note}
-注意，OpenCompass 默认使用默认使用 argmax 的方式采样下一个 token，因此若不指定采样参数，模型每次的推理结果将会是完全一致的，多轮评测将会失效。
+注意，OpenCompass 默认使用 argmax 的方式采样下一个 token，因此若不指定采样参数，模型每次的推理结果将会是完全一致的，多轮评测将会失效。
 ```

-其中 `SAMPLE_SIZE` 是推理路径的数量，较高的值通常会带来更高的性能。文章中展示了不同推理任务间推理路径数量与性能之间的关系：
+其中 `SAMPLE_SIZE` 是推理路径的数量，较高的值通常会带来更高的性能。SC方法的原论文中展示了不同推理任务间推理路径数量与性能之间的关系：
+
 ![image](https://github.com/InternLM/opencompass/assets/28834990/05c7d850-7076-43ca-b165-e6251f9b3001)
+
 从图中可以看出，在不同的推理任务中，随着推理路径数量的增加，性能呈现出增长的趋势。但是，对于某些任务，增加推理路径的数量可能达到一个极限，进一步增加推理路径的数量可能不会带来更多的性能提升。因此，需要在具体任务中进行实验和调整，找到最适合任务的推理路径数量。
--- a/opencompass/openicl/icl_inferencer/icl_sc_inferencer.py
+++ b/opencompass/openicl/icl_inferencer/icl_sc_inferencer.py
@ -135,7 +135,6 @@ class SCInferencer(BaseInferencer):
                    sc_results.append(results)
                sc_prediction = list(map(list, zip(*sc_results)))
                generated = sc_prediction
-                print(generated)

            # 5-3. Save current output
            for prompt, prediction in zip(parsed_entries, generated):
--- a/opencompass/tasks/openicl_eval.py
+++ b/opencompass/tasks/openicl_eval.py
@ -133,12 +133,17 @@ class OpenICLEvalTask(BaseTask):
                proc = TEXT_POSTPROCESSORS.get(
                    self.eval_cfg['pred_postprocessor']['type'])
                if sc_size is not None:
-                    pred_strs = [
-                        self._get_vote_out(proc, s) for s in pred_strs
-                    ]
+                    pred_strs = [[proc(s) for s in preds]
+                                 for preds in pred_strs]
                else:
                    pred_strs = [proc(s) for s in pred_strs]

+            # Get majority voting predictions if use self-consistency
+            if sc_size is not None:
+                pred_strs = [
+                    Counter(s).most_common(1)[0][0] for s in pred_strs
+                ]
+
            icl_evaluator = ICL_EVALUATORS.build(self.eval_cfg['evaluator'])
            result = icl_evaluator.score(
                predictions=pred_strs, references=test_set[self.output_column])
@ -186,14 +191,6 @@ class OpenICLEvalTask(BaseTask):

        return s[start:end]

-    def _get_vote_out(
-        self,
-        proc: Optional[callable],
-        sc_prediction: Optional[list],
-    ) -> str:
-        counter = Counter([proc(prediction) for prediction in sc_prediction])
-        return counter.most_common(1)[0][0]
-

 def parse_args():
    parser = argparse.ArgumentParser(description='Score Calculator')