judge="Please act as an impartial judge follow this instructions: In the following conversations, only the response of the 'assistant' in the last round of conversations is the output of the large language model (AI assistant) that needs to be evaluated. Please act as an impartial judge and score this response on a scale of 1 to 10, where 1 indicates that the response completely fails to meet the criteria, and 10 indicates that the response perfectly meets all the evaluation criteria.\
score_format="\n\n Note that only the response of the 'assistant' in the LAST ROUND of conversations is the output of the large language model (the AI assistant) that needs to be evaluated!! You must provide your explanation. After providing your explanation, please show the score by strictly following this format: 'Rating: [[score]]', for example 'Rating: [[6]]'. The DIALGUE need to be judged is in this format: \n *** \n DIALGUE \n ***"
eval_CM="The capacity of a large language model to recall and utilize previously mentioned information from earlier in the conversation is a critical indicator of its conversational memory abilities. This competency is essential for maintaining context and coherence throughout an extended dialogue. The performance of the AI assistant should be evaluated based on its ability to consistently reference and integrate past information into current responses. The evaluation criteria are as follows:\n\
2.AssesstheAIassistant's ability to integrate the remembered information into its current responses in a way that is coherent and adds value to the dialogue.\n\
3.ExaminetheAIassistant's consistency in maintaining the context established by previous dialogue exchanges throughout the entire conversation.\n\
4.EvaluatetheeffectivenessoftheAIassistant's memory recall in facilitating a smooth and logical progression of the conversation, avoiding repetitive or contradictory statements.\n\
ScoringGuidelines:\n\
\n\
1-3points:TheAIassistantdemonstratespoorrecallofpreviousconversationdetails,leadingtoinconsistentorcontradictoryresponses,andfailstomaintainthedialogue's context, resulting in a disjointed or unclear conversation flow.\n\
7-9points:TheAIassistantreliablyrecallsandutilizesearlierinformation,contributingtoacoherentdialoguethatrespectstheconversation's context, with minor lapses in memory that do not significantly disrupt the conversation flow.\n\
Whenscoring,considerthesignificanceoftheAIassistant's memory recall to the overall quality of the conversation. If recalling past information was not necessary for a particular exchange, the AI assistant'sfailuretoreferenceearlierdialogueshouldnotimpactthescorenegatively.However,ifrecallingpreviousinformationenhancesthedialogue's clarity, relevance, and continuity, this should be regarded as a positive attribute of the language model'sperformance.\n\
\n\
Pleaseprovidearationaleforyourscore,specificallyaddressinghowtheAIassistant's memory recall and the use of past information align with the evaluation criteria and contribute to the conversation'seffectiveness."
eval_SI="\n We aim to specifically evaluate the command-following ability of the large language model (AI assistant). The criteria for evaluation are as follows:\
Additionally,pleaseprovideabriefjustificationforthescoregiven,particularlyhighlightinghowtheAIassistant's response aligns with or deviates from the above criteria. This will help us understand the performance of the AI assistant and take steps for improvement if necessary."
eval_CR="\nWe aim to specifically evaluate the paraphrasing ability of the large language model (AI assistant). The criteria for evaluation are as follows:\n\
\n \
1.ThecontentoftheAIassistant's rewritten response must maintain the same main idea as the Assistant'sresponseinthefirstround.\n \
eval_FR="\nWe aim to specifically evaluate the paraphrasing ability of the large language model (AI assistant). The criteria for evaluation are as follows:\n\
\n \
1.ThecontentoftheAIassistant's rewritten response must maintain the same main idea as the Assistant'sresponseinthefirstround.\n \
-1-3points:TheAIassistant's response is largely influenced by previous interactions, fails to address the current question accurately, or provides false information.\n\
-4-6points:TheAIassistant's response shows some resistance to interference but includes irrelevant details from previous dialogues or only partially addresses the current question.\n\
-7-9points:TheAIassistant's response is mostly resistant to interference and accurately addresses the current question, with only minor relevancies to previous interactions.\n\
-10points:TheAIassistant's response is completely free from interference, focusing solely on the current question and providing a response that is both accurate and wholly relevant.\
\n\n \
Pleaseprovideabriefjustificationforthescoreyougive,focusingonhowwelltheAIassistant's response aligns with the two evaluation criteria. "
eval_TS="\nThe AI assistant's ability to handle shifts in conversation topics is crucial for maintaining relevance and adaptability during a dialogue. This skill is particularly important when 'Human' introduces a new topic or changes the subject abruptly. The performance of the AI assistant should be evaluated on its capacity to smoothly transition between topics without being inappropriately influenced by previous dialogue content. The evaluation criteria are as follows:\n\
2.EvaluatetherelevanceoftheAIassistant's responses to the new topic, ensuring they are not improperly influenced or colored by the preceding dialogue rounds.\n\
3.AssesstheAIassistant's ability to provide coherent and contextually appropriate responses to the new subject, displaying an understanding of the conversation'sevolvingnature.\n \
4.ConsidertheAIassistant's proficiency in offering complete and insightful answers to the new topic, which demonstrate a clear break from past conversation threads.\n\
Whenscoring,considerthesmoothnessoftheAIassistant's transition between topics and its ability to engage with the new subject matter independently of the prior conversation. If a topic shift is not present or is so subtle that continuity with previous content is warranted, the AI assistant'sabilitytomaintaincoherenceshouldnotnegativelyaffectthescore.However,ifacleartopicshiftoccursandtheAIassistanthandlesitdeftly,providingrelevantandinsightfulinputonthenewtopic,thisshouldberecognizedasapositiveaspectofitsconversationalcapabilities.\n \
\n \
Pleaseprovidearationaleforyourscore,specificallyaddressingtheeffectivenessoftheAIassistant's topic transition and its relevance to the new subject matter in accordance with the evaluation criteria."
eval_AR="The AI assistant's understanding of references is essential for maintaining a coherent dialogue. The following criteria should be used to evaluate its performance:\n\
\n \
1.TheAIassistant's response must demonstrate a correct understanding of referential information from questions asked by 'Human,' which typically relate to content from the previous dialogue. Ideally, the AI should explicitly acknowledge or clarify these references in its reply.\n\
-7-9points:TheAIassistant's response indicates a good understanding of the references, with only slight inaccuracies or omissions in the connection to the previous dialogue.\n\
Inadditiontothescore,pleaseprovideanexplanationthatspecificallyaddresseshowtheAIassistant's response demonstrates its ability or inability to understand and use referential information in accordance with the criteria above. "
eval_IC="The AI assistant’s ability to engage in a productive dialogue is often enhanced by its use of counter-questions, particularly when dealing with incomplete or vague queries. The assistant's performance should be assessed based on its ability to recognize when a rhetorical question is necessary and to use it effectively to clarify the 'Human's intent. The evaluation criteria are as follows:\n\
-4-6points:TheAIassistantrecognizessituationsrequiringrhetoricalquestionsbutusesthemsuboptimally,onlypartiallyaddressingthequery's deficiencies. Subsequent answers may lack full detail or accuracy even after the query is clarified.\n\
Whenscoring,considerwhethertheuseofacounter-questionwasessentialandwhethertheAIassistant's decision to use or not use one improved the clarity and outcome of the dialogue. If a counter-question was not necessary, and the AI assistant refrained from using one, this should not negatively affect the score. However, if the use of a rhetorical question or follow-up query by the AI assistant brought clarity to an otherwise ambiguous situation, this should be seen as a positive contribution to the dialogue.\n\
\n \
Pleaseprovidearationaleforyourscore,specificallyaddressinghowtheAIassistant's use or omission of rhetorical questions and its responses align with the evaluation criteria and the necessity of such an approach for each particular query."
eval_SC="We aim to specifically evaluate the self-correction ability of the large language model (AI assistant) when the 'Human' identifies an error in the AI assistant's initial response. The criteria for evaluation are as follows:\n\
Inadditiontoscoring,pleaseprovideajustificationforyourassessment,focusingonhoweffectivelytheAIassistant's reaction to 'Human's'identifiederrordemonstratesitsabilitytoself-correctandaddressthecriteriaoutlinedabove."
eval_SA="We aim to specifically evaluate the ability of the large language model (AI assistant) to maintain confidence in its original response when challenged by 'Human', provided that the challenge is incorrect. The criteria for evaluation are as follows:\n\
\n\
1.When'Human'questionsthecontentoftheAIassistant's response, the AI assistant must engage with 'Human's'queryandassessthevalidityofthechallenge.\n\
-1-3points:TheAIassistanteitherdoesnotengagewith'Human's' challenge appropriately or changes its original response without proper justification, even when 'Human's'challengeisincorrect.\n\
Inadditiontoscoring,pleaseprovideajustificationforyourassessment,focusingonhowtheAIassistant's reaction to the challenge reflects its understanding and confidence in its original response, and how well it meets the criteria outlined above."
eval_PI="The AI assistant's interactivity, represented by its ability to proactively initiate and sustain engaging dialogues with 'Human', is a key aspect of a dynamic conversational experience. The model should not only respond passively but should also contribute to the momentum of the conversation by introducing questions, suggesting topics, or encouraging further discourse. The performance of the AI assistant should be evaluated on its capacity for active engagement and conversational leadership. The evaluation criteria are as follows:\n\
\n\
1.ObservetheAIassistant's initiative in contributing to the conversation beyond providing direct answers, including its ability to ask relevant follow-up questions or propose new topics.\n\
2.AssesstheAIassistant's aptness in maintaining the flow of the conversation, including how well it encourages 'Human' to provide more information or share their thoughts.\n\
3.ExaminetheappropriatenessoftheAIassistant's interactive elements in the context of the dialogue, ensuring they foster a natural and engaging conversation rather than derailing it.\n\
4.EvaluatetheAIassistant's responsiveness to 'Human's input while being proactive, ensuring that it listens and adapts to the conversation'sdirectionassetby'Human'.\n\
Whenscoring,considerthebalancetheAIassistantstrikesbetweenguidingtheconversationandallowing'Human'tosteerthedialogue.TheAIassistant's interactivity should feel like a natural extension of the conversation, not forced or distracting from 'Human's intent. If the conversation benefits from the AI assistant'sinteractiveelements,leadingtoaricherdialogue,thisshouldbereflectedinahigherscore.\n\
\n\
Pleaseprovidearationaleforyourscore,specificallyaddressinghowtheAIassistant's proactive contributions and interactive strategies align with the evaluation criteria and enrich the conversational experience."
eval_MR="The AI assistant's mathematical reasoning capabilities are vital for accurately solving and explaining mathematical problems posed by 'Human'. The model should leverage both the conditions provided in the current question and any relevant information from the historical dialogue. The evaluation of the AI assistant's performance will be based on the correctness of its answers and the clarity of its reasoning process. The evaluation criteria are as follows:\n\
\n\
1.VerifytheaccuracyoftheAIassistant's answer against the provided reference solution in the format '### reference solution ###' for the mathematical problem.\n\
2.Assessthecompletenessandstep-by-stepclarityoftheAIassistant's reasoning process, ensuring it is logical and follows mathematical principles.\n\
3.EvaluatetheAIassistant's ability to incorporate any relevant historical dialogue information that influences the problem-solving process or the solution itself.\n\
4.AppraisetheAIassistant's communication of the solution in a manner that is understandable and instructive to 'Human', potentially aiding their learning or comprehension.\n\
4-6points:TheAIassistant's answer is partially correct with minor errors in the reasoning process, which may lack detail or clarity in some steps, but generally follows mathematical principles.\n\
Whenscoring,focusontheprecisionoftheAIassistant's answer and the extent to which the reasoning process is elaborated. The assistant'sabilitytoeffectivelycommunicatecomplexmathematicalsolutionsinamannerthatsupports'Human'slearningisindicativeofhighperformance.Ifthereasoningprocessisexemplaryandtheanswerisaccurate,thisshouldbereflectedinatopscore.\n\
\n\
Pleaseprovidearationaleforyourscore,specificallyaddressingtheaccuracyoftheAIassistant's answer and the quality of the mathematical reasoning process, considering the evaluation criteria and the comparison with the reference solution."
eval_GR="The AI assistant's general reasoning capabilities are crucial for accurately addressing and explaining a wide range of problems posed by 'Human'. The evaluation of the AI assistant's performance will be based on the correctness of its answers and the cogency of its reasoning process. The evaluation criteria are as follows:\n\
\n\
1.VerifytheaccuracyoftheAIassistant's answer against the provided reference solution in format ‘### reference solution ###‘ for the specific problem.\n\
2.Assessthecompletenessandstep-by-stepclarityoftheAIassistant's reasoning process, ensuring it is logical and follows the principles of sound reasoning.\n\
3.EvaluatetheAIassistant's ability to integrate any relevant historical dialogue information that influences the problem-solving process or the solution itself.\n\
4.AppraisetheAIassistant's communication of the solution in a manner that is understandable and instructive to 'Human', potentially aiding their learning or comprehension.\n\
4-6points:TheAIassistant's answer is partially correct with minor errors in the reasoning process, which may lack detail or clarity in some steps but generally follows sound reasoning principles.\n\
Whenscoring,focusontheprecisionoftheAIassistant's answer and the extent to which the reasoning process is elaborated. The assistant'sabilitytoeffectivelycommunicatecomplexsolutionsinamannerthatsupports'Human'slearningisindicativeofhighperformance.Ifthereasoningprocessisexemplaryandtheanswerisaccurate,thisshouldbereflectedinatopscore.\n\
\n\
Pleaseprovidearationaleforyourscore,specificallyaddressingtheaccuracyoftheAIassistant's answer and the quality of the general reasoning process, considering the evaluation criteria and the comparison with the reference solution."