IFEval

python3 run.py --models hf_internlm2_chat_7b --datasets IFEval_gen_3321a3 --debug

Chat Models

model	Prompt-level-strict-accuracy	Inst-level-strict-accuracy	Prompt-level-loose-accuracy	Inst-level-loose-accuracy
qwen1.5-0.5b-chat-hf	13.12	23.26	15.71	26.38
qwen1.5-1.8b-chat-hf	16.08	26.26	18.30	29.02
qwen1.5-4b-chat-hf	25.51	35.97	28.84	39.81
qwen1.5-7b-chat-hf	38.82	50.00	42.70	53.48
qwen1.5-14b-chat-hf	42.51	54.20	49.17	59.95
qwen1.5-32b-chat-hf	49.54	60.43	53.97	64.39
qwen1.5-72b-chat-hf	51.02	61.99	57.12	67.27
qwen1.5-110b-chat-hf	55.08	65.59	61.18	70.86
internlm2-chat-1.8b-hf	18.30	28.78	21.44	32.01
internlm2-chat-1.8b-sft-hf	18.67	31.18	19.78	32.85
internlm2-chat-7b-hf	34.75	46.28	40.48	51.44
internlm2-chat-7b-sft-hf	39.19	50.12	42.33	52.76
internlm2-chat-20b-hf	36.41	48.68	40.67	53.24
internlm2-chat-20b-sft-hf	44.55	55.64	46.77	58.03
llama-3-8b-instruct-hf	68.02	76.74	75.42	82.85
llama-3-70b-instruct-hf	78.00	84.65	84.29	89.21
llama-3-8b-instruct-lmdeploy	69.13	77.46	77.26	83.93
llama-3-70b-instruct-lmdeploy	75.97	82.97	83.18	88.37
mistral-7b-instruct-v0.1-hf	40.30	50.96	41.96	53.48
mistral-7b-instruct-v0.2-hf	49.17	60.43	51.94	64.03
mixtral-8x7b-instruct-v0.1-hf	50.09	60.67	55.64	65.83