mirror of
https://github.com/open-compass/opencompass.git
synced 2025-05-30 16:03:24 +08:00

* Fix openai api tiktoken * Fix openai api tiktoken --------- Co-authored-by: liushz <liuhongwei@pjlab.rog.cn>
143 lines
5.9 KiB
Markdown
143 lines
5.9 KiB
Markdown
# Accelerate Evaluation Inference with vLLM or LMDeploy
|
|
|
|
## Background
|
|
|
|
During the OpenCompass evaluation process, the Huggingface transformers library is used for inference by default. While this is a very general solution, there are scenarios where more efficient inference methods are needed to speed up the process, such as leveraging VLLM or LMDeploy.
|
|
|
|
- [LMDeploy](https://github.com/InternLM/lmdeploy) is a toolkit designed for compressing, deploying, and serving large language models (LLMs), developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams.
|
|
- [vLLM](https://github.com/vllm-project/vllm) is a fast and user-friendly library for LLM inference and serving, featuring advanced serving throughput, efficient PagedAttention memory management, continuous batching of requests, fast model execution via CUDA/HIP graphs, quantization techniques (e.g., GPTQ, AWQ, SqueezeLLM, FP8 KV Cache), and optimized CUDA kernels.
|
|
|
|
## Preparation for Acceleration
|
|
|
|
First, check whether the model you want to evaluate supports inference acceleration using vLLM or LMDeploy. Additionally, ensure you have installed vLLM or LMDeploy as per their official documentation. Below are the installation methods for reference:
|
|
|
|
### LMDeploy Installation Method
|
|
|
|
Install LMDeploy using pip (Python 3.8+) or from [source](https://github.com/InternLM/lmdeploy/blob/main/docs/en/build.md):
|
|
|
|
```bash
|
|
pip install lmdeploy
|
|
```
|
|
|
|
### VLLM Installation Method
|
|
|
|
Install vLLM using pip or from [source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source):
|
|
|
|
```bash
|
|
pip install vllm
|
|
```
|
|
|
|
## Accelerated Evaluation Using VLLM or LMDeploy
|
|
|
|
### Method 1: Using Command Line Parameters to Change the Inference Backend
|
|
|
|
OpenCompass offers one-click evaluation acceleration. During evaluation, it can automatically convert Huggingface transformer models to VLLM or LMDeploy models for use. Below is an example code for evaluating the GSM8k dataset using the default Huggingface version of the llama3-8b-instruct model:
|
|
|
|
```python
|
|
# eval_gsm8k.py
|
|
from mmengine.config import read_base
|
|
|
|
with read_base():
|
|
# Select a dataset list
|
|
from .datasets.gsm8k.gsm8k_0shot_gen_a58960 import gsm8k_datasets as datasets
|
|
# Select an interested model
|
|
from ..models.hf_llama.hf_llama3_8b_instruct import models
|
|
```
|
|
|
|
Here, `hf_llama3_8b_instruct` specifies the original Huggingface model configuration, as shown below:
|
|
|
|
```python
|
|
from opencompass.models import HuggingFacewithChatTemplate
|
|
|
|
models = [
|
|
dict(
|
|
type=HuggingFacewithChatTemplate,
|
|
abbr='llama-3-8b-instruct-hf',
|
|
path='meta-llama/Meta-Llama-3-8B-Instruct',
|
|
max_out_len=1024,
|
|
batch_size=8,
|
|
run_cfg=dict(num_gpus=1),
|
|
stop_words=['<|end_of_text|>', '<|eot_id|>'],
|
|
)
|
|
]
|
|
```
|
|
|
|
To evaluate the GSM8k dataset using the default Huggingface version of the llama3-8b-instruct model, use:
|
|
|
|
```bash
|
|
python run.py config/eval_gsm8k.py
|
|
```
|
|
|
|
To accelerate the evaluation using vLLM or LMDeploy, you can use the following script:
|
|
|
|
```bash
|
|
python run.py config/eval_gsm8k.py -a vllm
|
|
```
|
|
|
|
or
|
|
|
|
```bash
|
|
python run.py config/eval_gsm8k.py -a lmdeploy
|
|
```
|
|
|
|
### Method 2: Accelerating Evaluation via Deployed Inference Acceleration Service API
|
|
|
|
OpenCompass also supports accelerating evaluation by deploying vLLM or LMDeploy inference acceleration service APIs. Follow these steps:
|
|
|
|
1. Install the openai package:
|
|
|
|
```bash
|
|
pip install openai
|
|
```
|
|
|
|
2. Deploy the inference acceleration service API for vLLM or LMDeploy. Below is an example for LMDeploy:
|
|
|
|
```bash
|
|
lmdeploy serve api_server meta-llama/Meta-Llama-3-8B-Instruct --model-name Meta-Llama-3-8B-Instruct --server-port 23333
|
|
```
|
|
|
|
Parameters for starting the api_server can be checked using `lmdeploy serve api_server -h`, such as --tp for tensor parallelism, --session-len for the maximum context window length, --cache-max-entry-count for adjusting the k/v cache memory usage ratio, etc.
|
|
|
|
3. Once the service is successfully deployed, modify the evaluation script by changing the model configuration path to the service address, as shown below:
|
|
|
|
```python
|
|
from opencompass.models import OpenAISDK
|
|
|
|
api_meta_template = dict(
|
|
round=[
|
|
dict(role='HUMAN', api_role='HUMAN'),
|
|
dict(role='BOT', api_role='BOT', generate=True),
|
|
],
|
|
reserved_roles=[dict(role='SYSTEM', api_role='SYSTEM')],
|
|
)
|
|
|
|
models = [
|
|
dict(
|
|
abbr='Meta-Llama-3-8B-Instruct-LMDeploy-API',
|
|
type=OpenAISDK,
|
|
key='EMPTY', # API key
|
|
openai_api_base='http://0.0.0.0:23333/v1', # Service address
|
|
path='Meta-Llama-3-8B-Instruct', # Model name for service request
|
|
tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', # The tokenizer name or path, if set to `None`, uses the default `gpt-4` tokenizer
|
|
rpm_verbose=True, # Whether to print request rate
|
|
meta_template=api_meta_template, # Service request template
|
|
query_per_second=1, # Service request rate
|
|
max_out_len=1024, # Maximum output length
|
|
max_seq_len=4096, # Maximum input length
|
|
temperature=0.01, # Generation temperature
|
|
batch_size=8, # Batch size
|
|
retry=3, # Number of retries
|
|
)
|
|
]
|
|
```
|
|
|
|
## Acceleration Effect and Performance Comparison
|
|
|
|
Below is a comparison table of the acceleration effect and performance when using VLLM or LMDeploy on a single A800 GPU for evaluating the Llama-3-8B-Instruct model on the GSM8k dataset:
|
|
|
|
| Inference Backend | Accuracy | Inference Time (minutes:seconds) | Speedup (relative to Huggingface) |
|
|
| ----------------- | -------- | -------------------------------- | --------------------------------- |
|
|
| Huggingface | 74.22 | 24:26 | 1.0 |
|
|
| LMDeploy | 73.69 | 11:15 | 2.2 |
|
|
| VLLM | 72.63 | 07:52 | 3.1 |
|