![]() * fix pip version * fix pip version * update (#1522) Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn> * [Feature] Update Models (#1518) * Update Models * Update * Update humanevalx * Update * Update * [Feature] Dataset prompts update for ARC, BoolQ, Race (#1527) add judgerbench and reorg sub add judgerbench and reorg subeval add judgerbench and reorg subeval * add judgerbench and reorg subeval * add judgerbench and reorg subeval * add judgerbench and reorg subeval * add judgerbench and reorg subeval --------- Co-authored-by: zhulinJulia24 <145004780+zhulinJulia24@users.noreply.github.com> Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn> Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> Co-authored-by: Linchen Xiao <xxllcc1993@gmail.com> |
||
---|---|---|
.. | ||
fofo_bilingual_judge.py | ||
fofo_judge.py | ||
README.md |
Fofo
Introduction
This paper presents FoFo, a pioneering benchmark for evaluating large language models' (LLMs) ability to follow complex, domain-specific formats, a crucial yet underexamined capability for their application as AI agents. Despite LLMs' advancements, existing benchmarks fail to assess their format-following proficiency adequately. FoFo fills this gap with a diverse range of real-world formats and instructions, developed through an AI-Human collaborative method. Our evaluation across both open-source (e.g., Llama 2, WizardLM) and closed-source (e.g., GPT-4, PALM2, Gemini) LLMs highlights three key findings: open-source models significantly lag behind closed-source ones in format adherence; LLMs' format-following performance is independent of their content generation quality; and LLMs' format proficiency varies across different domains. These insights suggest the need for specialized tuning for format-following skills and highlight FoFo's role in guiding the selection of domain-specific AI agents.
Official link
https://github.com/SalesforceAIResearch/FoFo/tree/main
Paper
https://arxiv.org/abs/2402.18667
Examples
Input example I:
Create a detailed medical diagnostic report in JSON format for a hypothetical patient based on the following clinical scenario and laboratory results. \n\n**Clinical Scenario:**\n- Patient Identifier: 12345X\n- Gender: Female\n- Age: 40 years\n- Presenting Complaint: Acute onset of sharp, right lower quadrant abdominal pain that began approximately 6 hours ago\n- Past Medical History: Hypertension, well-controlled on medication; no known allergies; nonsmoker; nulliparous\n- Recent Labs: Slight leukocytosis, normal hemoglobin, elevated C-reactive protein\n- Imaging: Ultrasound indicates a thickened wall of the appendix with peri-appendiceal fluid collection\n- Surgery: The patient underwent an emergency laparoscopic appendectomy\n- Pathology Report: Confirmed acute appendicitis with peri-appendiceal abscess formation, no malignancy noted\n\nUsing the sample data provided above, generate a JSON object with the following structure:\n\n1. Basic Patient Information:\n - PatientID\n - Gender\n - Age\n\n2. Encounter details:\n - Presentation\n - Symptoms\n - DurationOfSymptoms\n - PastMedicalHistory\n - RecentLaboratoryFindings (nested object with LabName as key and the value as another object containing Result and ReferenceRange)\n - ImagingResults (array of objects with ImagingType, Findings, and Conclusion)\n\n3. Surgical Intervention:\n - SurgeryType\n - Urgency (Elective or Emergency)\n - Outcome (Success or Complication, followed by a brief description)\n\n4. Pathology Report:\n - Diagnosis\n - Findings\n - Implications\n \n5. Recommendations:\n - Follow up actions\n - Medications (array of objects with DrugName, Dosage, Frequency, and Duration)\n - Warnings (e.g., warning signs of complications to watch for post-operation)\n\nYou are allowed to make up any additional relevant data, such as reference ranges for the lab results or specific medication names and dosages for the follow-up recommendations. The JSON must strictly adhere to proper JSON formatting rules, including the use of curly brackets for objects, square brackets for arrays, and double quotes for strings. Keep key names succinct and ensure they accurately reflect their corresponding data. The output should be directly usable as a comprehensive digital diagnostic report for the hypothetical patient described.
Output example I (from GPT-4):
\n \"BasicPatientInformation\": {\n \"PatientID\": \"12345X\",\n \"Gender\": \"Female\",\n \"Age\": 40\n },\n \"EncounterDetails\": {\n \"Presentation\": \"Acute onset of sharp, right lower quadrant abdominal pain\",\n \"Symptoms\": \"Sharp abdominal pain\",\n \"DurationOfSymptoms\": \"6 hours\",\n \"PastMedicalHistory\": \"Hypertension, well-controlled on medication; no known allergies; nonsmoker; nulliparous\",\n \"RecentLaboratoryFindings\": {\n \"LeukocyteCount\": {\n \"Result\": \"Slightly elevated\",\n \"ReferenceRange\": \"4.5-11.0 x 10^9\/L\"\n },\n \"Hemoglobin\": {\n \"Result\": \"Normal\",\n \"ReferenceRange\": \"12.0-16.0 g\/dL\"\n },\n \"CReactiveProtein\": {\n \"Result\": \"Elevated\",\n \"ReferenceRange\": \"0-5 mg\/L\"\n }\n },\n \"ImagingResults\": [\n {\n \"ImagingType\": \"Ultrasound\",\n \"Findings\": \"Thickened wall of the appendix with peri-appendiceal fluid collection\",\n \"Conclusion\": \"Appendix inflammation indicative of acute appendicitis\"\n }\n ]\n },\n \"SurgicalIntervention\": {\n \"SurgeryType\": \"Laparoscopic Appendectomy\",\n \"Urgency\": \"Emergency\",\n \"Outcome\": \"Success, no post-operative complications noted\"\n },\n \"PathologyReport\": {\n \"Diagnosis\": \"Acute appendicitis with peri-appendiceal abscess formation\",\n \"Findings\": \"Inflammation and abscess formation in the appendix, no malignancy\",\n \"Implications\": \"Appendix successfully removed, no malignancy found\"\n },\n \"Recommendations\": {\n \"FollowUpActions\": \"Patient should rest and recover, and attend a follow-up appointment in 2 weeks\",\n \"Medications\": [\n {\n \"DrugName\": \"Ibuprofen\",\n \"Dosage\": \"400 mg\",\n \"Frequency\": \"Every 6 hours\",\n \"Duration\": \"1 week\"\n },\n {\n \"DrugName\": \"Paracetamol\",\n \"Dosage\": \"500 mg\",\n \"Frequency\": \"Every 4 hours\",\n \"Duration\": \"1 week\"\n }\n ],\n \"Warnings\": \"Contact healthcare provider if pain persists or worsens, if fever develops, or if there are any signs of infection at the surgical site\"\n }\n
Reference
@article{xia2024fofo,
title={FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability},
author={Xia, Congying and Xing, Chen and Du, Jiangshu and Yang, Xinyi and Feng, Yihao and Xu, Ran and Yin, Wenpeng and Xiong, Caiming},
journal={arXiv preprint arXiv:2402.18667},
year={2024}
}