👋 join us on Discord and WeChat
> \[!IMPORTANT\] > > **Star Us**, You will receive all release notifications from GitHub without any delay ~ ⭐️ ## 📣 OpenCompass 2.0 We are thrilled to introduce OpenCompass 2.0, an advanced suite featuring three key components: [CompassKit](https://github.com/open-compass), [CompassHub](https://hub.opencompass.org.cn/home), and [CompassRank](https://rank.opencompass.org.cn/home).  **CompassRank** has been significantly enhanced into the leaderboards that now incorporates both open-source benchmarks and proprietary benchmarks. This upgrade allows for a more comprehensive evaluation of models across the industry. **CompassHub** presents a pioneering benchmark browser interface, designed to simplify and expedite the exploration and utilization of an extensive array of benchmarks for researchers and practitioners alike. To enhance the visibility of your own benchmark within the community, we warmly invite you to contribute it to CompassHub. You may initiate the submission process by clicking [here](https://hub.opencompass.org.cn/dataset-submit). **CompassKit** is a powerful collection of evaluation toolkits specifically tailored for Large Language Models and Large Vision-language Models. It provides an extensive set of tools to assess and measure the performance of these complex models effectively. Welcome to try our toolkits for in your research and products.Language | Knowledge | Reasoning | Examination |
Word Definition- WiC - SummEditsIdiom Learning- CHIDSemantic Similarity- AFQMC - BUSTMCoreference Resolution- CLUEWSC - WSC - WinoGrandeTranslation- Flores - IWSLT2017Multi-language Question Answering- TyDi-QA - XCOPAMulti-language Summary- XLSum |
Knowledge Question Answering- BoolQ - CommonSenseQA - NaturalQuestions - TriviaQA |
Textual Entailment- CMNLI - OCNLI - OCNLI_FC - AX-b - AX-g - CB - RTE - ANLICommonsense Reasoning- StoryCloze - COPA - ReCoRD - HellaSwag - PIQA - SIQAMathematical Reasoning- MATH - GSM8KTheorem Application- TheoremQA - StrategyQA - SciBenchComprehensive Reasoning- BBH |
Junior High, High School, University, Professional Examinations- C-Eval - AGIEval - MMLU - GAOKAO-Bench - CMMLU - ARC - XiezhiMedical Examinations- CMB |
Understanding | Long Context | Safety | Code |
Reading Comprehension- C3 - CMRC - DRCD - MultiRC - RACE - DROP - OpenBookQA - SQuAD2.0Content Summary- CSL - LCSTS - XSum - SummScreenContent Analysis- EPRSTMT - LAMBADA - TNEWS |
Long Context Understanding- LEval - LongBench - GovReports - NarrativeQA - Qasper |
Safety- CivilComments - CrowsPairs - CValues - JigsawMultilingual - TruthfulQARobustness- AdvGLUE |
Code- HumanEval - HumanEvalX - MBPP - APPs - DS1000 |
Open-source Models | API Models |
- [Alpaca](https://github.com/tatsu-lab/stanford_alpaca) - [Baichuan](https://github.com/baichuan-inc) - [BlueLM](https://github.com/vivo-ai-lab/BlueLM) - [ChatGLM2](https://github.com/THUDM/ChatGLM2-6B) - [ChatGLM3](https://github.com/THUDM/ChatGLM3-6B) - [Gemma](https://huggingface.co/google/gemma-7b) - [InternLM](https://github.com/InternLM/InternLM) - [LLaMA](https://github.com/facebookresearch/llama) - [LLaMA3](https://github.com/meta-llama/llama3) - [Qwen](https://github.com/QwenLM/Qwen) - [TigerBot](https://github.com/TigerResearch/TigerBot) - [Vicuna](https://github.com/lm-sys/FastChat) - [WizardLM](https://github.com/nlpxucan/WizardLM) - [Yi](https://github.com/01-ai/Yi) - …… | - OpenAI - Gemini - Claude - ZhipuAI(ChatGLM) - Baichuan - ByteDance(YunQue) - Huawei(PanGu) - 360 - Baidu(ERNIEBot) - MiniMax(ABAB-Chat) - SenseTime(nova) - Xunfei(Spark) - …… |
|
---|