[Docs] Remove --no-batch-padding and Use --hf-num-gpus (#1205)

* [Docs] Remove --no-batch-padding and Use -hf-num-gpus

* update
This commit is contained in:
Fengzhe Zhou 2024-05-29 16:30:10 +08:00 committed by GitHub
parent 808582d952
commit d656e818f8
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
8 changed files with 25 additions and 54 deletions

View File

@ -37,9 +37,9 @@ This is a complex issue that needs to be explained from both the supply and dema
The supply side refers to how many tasks are being run. A task is a combination of a model and a dataset, and it primarily depends on how many models and datasets need to be tested. Additionally, since OpenCompass splits a larger task into multiple smaller tasks, the number of data entries per sub-task (`--max-partition-size`) also affects the number of tasks. (The `--max-partition-size` is proportional to the actual number of data entries, but the relationship is not 1:1).
The demand side refers to how many workers are running. Since OpenCompass instantiates multiple models for inference simultaneously, we use `--num-gpus` to specify how many GPUs each instance uses. Note that `--num-gpus` is a parameter specific to HuggingFace models and setting this parameter for non-HuggingFace models will not have any effect. We also use `--max-num-workers` to indicate the maximum number of instances running at the same time. Lastly, due to issues like GPU memory and insufficient load, OpenCompass also supports running multiple instances on the same GPU, which is managed by the parameter `--max-num-workers-per-gpu`. Therefore, it can be generally assumed that we will use a total of `--num-gpus` * `--max-num-workers` / `--max-num-workers-per-gpu` GPUs.
The demand side refers to how many workers are running. Since OpenCompass instantiates multiple models for inference simultaneously, we use `--hf-num-gpus` to specify how many GPUs each instance uses. Note that `--hf-num-gpus` is a parameter specific to HuggingFace models and setting this parameter for non-HuggingFace models will not have any effect. We also use `--max-num-workers` to indicate the maximum number of instances running at the same time. Lastly, due to issues like GPU memory and insufficient load, OpenCompass also supports running multiple instances on the same GPU, which is managed by the parameter `--max-num-workers-per-gpu`. Therefore, it can be generally assumed that we will use a total of `--hf-num-gpus` * `--max-num-workers` / `--max-num-workers-per-gpu` GPUs.
In summary, when tasks run slowly or the GPU load is low, we first need to check if the supply is sufficient. If not, consider reducing `--max-partition-size` to split the tasks into finer parts. Next, we need to check if the demand is sufficient. If not, consider increasing `--max-num-workers` and `--max-num-workers-per-gpu`. Generally, **we set `--num-gpus` to the minimum value that meets the demand and do not adjust it further.**
In summary, when tasks run slowly or the GPU load is low, we first need to check if the supply is sufficient. If not, consider reducing `--max-partition-size` to split the tasks into finer parts. Next, we need to check if the demand is sufficient. If not, consider increasing `--max-num-workers` and `--max-num-workers-per-gpu`. Generally, **we set `--hf-num-gpus` to the minimum value that meets the demand and do not adjust it further.**
### How do I control the number of GPUs that OpenCompass occupies?
@ -114,17 +114,8 @@ Hence, if users find that the number of tasks greatly exceeds the available GPUs
### How to use the downloaded huggingface models?
If you have already download the checkpoints of the model, you can specify the local path of the model and tokenizer, and add `trust_remote_code=True` for `--model-kwargs` and `--tokenizer-kwargs`. For example
If you have already download the checkpoints of the model, you can specify the local path of the model. For example
```bash
python run.py --datasets siqa_gen winograd_ppl \
--hf-path /path/to/model \ # HuggingFace 模型地址
--tokenizer-path /path/to/model \ # HuggingFace 模型地址
--model-kwargs device_map='auto' trust_remote_code=True \ # 构造 model 的参数
--tokenizer-kwargs padding_side='left' truncation='left' use_fast=False trust_remote_code=True \ # 构造 tokenizer 的参数
--max-out-len 100 \ # 模型能接受的最大序列长度
--max-seq-len 2048 \ # 最长生成 token 数
--batch-size 8 \ # 批次大小
--no-batch-padding \ # 不打开 batch padding通过 for loop 推理,避免精度损失
--num-gpus 1 # 所需 gpu 数
python run.py --datasets siqa_gen winograd_ppl --hf-type base --hf-path /path/to/model
```

View File

@ -87,7 +87,7 @@ python run.py --datasets siqa_gen winograd_ppl \
Note that in this way, OpenCompass only evaluates one model at a time, while other ways can evaluate multiple models at once.
```{caution}
`--num-gpus` does not stand for the actual number of GPUs to use in evaluation, but the minimum required number of GPUs for this model. [More](faq.md#how-does-opencompass-allocate-gpus)
`--hf-num-gpus` does not stand for the actual number of GPUs to use in evaluation, but the minimum required number of GPUs for this model. [More](faq.md#how-does-opencompass-allocate-gpus)
```
:::{dropdown} More detailed example
@ -103,7 +103,7 @@ python run.py --datasets siqa_gen winograd_ppl \
--max-out-len 100 \ # Maximum number of tokens to generate
--min-out-len 100 \ # Minimum number of tokens to generate
--batch-size 64 \ # Batch size
--num-gpus 1 # Number of GPUs required to run the model
--hf-num-gpus 1 # Number of GPUs required to run the model
```
```{seealso}
For all HuggingFace related parameters supported by `run.py`, please read [Launching Evaluation Task](../user_guides/experimentation.md#launching-an-evaluation-task).

View File

@ -25,15 +25,7 @@ Task Configuration (`$EXP`):
- For HuggingFace related models, users can also define a model quickly in the command line through HuggingFace parameters and then specify datasets using `--datasets DATASET1 DATASET2 ...`.
```bash
python run.py --datasets siqa_gen winograd_ppl \
--hf-path huggyllama/llama-7b \ # HuggingFace model path
--model-kwargs device_map='auto' \ # Parameters for constructing the model
--tokenizer-kwargs padding_side='left' truncation='left' use_fast=False \ # Parameters for constructing the tokenizer
--max-out-len 100 \ # Maximum sequence length the model can accept
--max-seq-len 2048 \ # Maximum generated token count
--batch-size 8 \ # Batch size
--no-batch-padding \ # Disable batch padding and infer through a for loop to avoid accuracy loss
--num-gpus 1 # Number of minimum required GPUs for this model
python run.py --datasets siqa_gen winograd_ppl --hf-type base --hf-path huggyllama/llama-7b
```
Complete HuggingFace parameter descriptions:
@ -45,9 +37,8 @@ Task Configuration (`$EXP`):
- `--tokenizer-kwargs`: Parameters for constructing the tokenizer
- `--max-out-len`: Maximum generated token count
- `--max-seq-len`: Maximum sequence length the model can accept
- `--no-batch-padding`: Disable batch padding and infer through a for loop to avoid accuracy loss
- `--batch-size`: Batch size
- `--num-gpus`: Number of GPUs required to run the model. Please note that this parameter is only used to determine the number of GPUs required to run the model, and does not affect the actual number of GPUs used for the task. Refer to [Efficient Evaluation](./evaluation.md) for more details.
- `--hf-num-gpus`: Number of GPUs required to run the model. Please note that this parameter is only used to determine the number of GPUs required to run the model, and does not affect the actual number of GPUs used for the task. Refer to [Efficient Evaluation](./evaluation.md) for more details.
Starting Methods:

View File

@ -37,9 +37,9 @@ OpenCompass 使用称为 task (任务) 的单位处理评估请求。每个任
供给侧就是运行多少任务。任务是模型和数据集的组合,它首先取决于要测多少模型和多少数据集。另外由于 OpenCompass 会将一个较大的任务拆分成多个小任务,因此每个子任务有多少条数据 `--max-partition-size` 也会影响任务的数量。(`--max-partition-size` 与真实数据条目成正比,但并不是 1:1 的关系)。
需求侧就是有多少 worker 在运行。由于 OpenCompass 会同时实例化多个模型去进行推理,因此我们用 `--num-gpus` 来指定每个实例使用多少 GPU。注意 `--num-gpus` 是一个 HuggingFace 模型专用的参数,非 HuggingFace 模型设置该参数是不会起作用的。同时我们使用 `--max-num-workers` 去表示最多有多少个实例在运行。最后由于 GPU 显存、负载不充分等问题OpenCompass 也支持在同一个 GPU 上运行多个实例,这个参数是 `--max-num-workers-per-gpu`。因此可以笼统地认为,我们总共会使用 `--num-gpus` * `--max-num-workers` / `--max-num-workers-per-gpu` 个 GPU。
需求侧就是有多少 worker 在运行。由于 OpenCompass 会同时实例化多个模型去进行推理,因此我们用 `--hf-num-gpus` 来指定每个实例使用多少 GPU。注意 `--hf-num-gpus` 是一个 HuggingFace 模型专用的参数,非 HuggingFace 模型设置该参数是不会起作用的。同时我们使用 `--max-num-workers` 去表示最多有多少个实例在运行。最后由于 GPU 显存、负载不充分等问题OpenCompass 也支持在同一个 GPU 上运行多个实例,这个参数是 `--max-num-workers-per-gpu`。因此可以笼统地认为,我们总共会使用 `--hf-num-gpus` * `--max-num-workers` / `--max-num-workers-per-gpu` 个 GPU。
综上当任务运行较慢GPU 负载不高的时候,我们首先需要检查供给是否充足,如果不充足,可以考虑调小 `--max-partition-size` 来将任务拆分地更细;其次需要检查需求是否充足,如果不充足,可以考虑增大 `--max-num-workers``--max-num-workers-per-gpu`。一般来说,**我们会将 `--num-gpus` 设定为最小的满足需求的值,并不会再进行调整**。
综上当任务运行较慢GPU 负载不高的时候,我们首先需要检查供给是否充足,如果不充足,可以考虑调小 `--max-partition-size` 来将任务拆分地更细;其次需要检查需求是否充足,如果不充足,可以考虑增大 `--max-num-workers``--max-num-workers-per-gpu`。一般来说,**我们会将 `--hf-num-gpus` 设定为最小的满足需求的值,并不会再进行调整**。
### 我如何控制 OpenCompass 占用的 GPU 数量?
@ -114,17 +114,8 @@ OpenCompass 中的每个任务代表等待评估的特定模型和数据集部
### 如何使用本地已下好的 Huggingface 模型?
如果您已经提前下载好 Huggingface 的模型文件,请手动指定模型路径,并在`--model-kwargs` 和 `--tokenizer-kwargs`中添加 `trust_remote_code=True`. 示例如下
如果您已经提前下载好 Huggingface 的模型文件,请手动指定模型路径. 示例如下
```bash
python run.py --datasets siqa_gen winograd_ppl \
--hf-path /path/to/model \ # HuggingFace 模型地址
--tokenizer-path /path/to/model \ # HuggingFace 模型地址
--model-kwargs device_map='auto' trust_remote_code=True \ # 构造 model 的参数
--tokenizer-kwargs padding_side='left' truncation='left' use_fast=False trust_remote_code=True \ # 构造 tokenizer 的参数
--max-out-len 100 \ # 模型能接受的最大序列长度
--max-seq-len 2048 \ # 最长生成 token 数
--batch-size 8 \ # 批次大小
--no-batch-padding \ # 不打开 batch padding通过 for loop 推理,避免精度损失
--num-gpus 1 # 所需 gpu 数
python run.py --datasets siqa_gen winograd_ppl --hf-type base --hf-path /path/to/model
```

View File

@ -86,7 +86,7 @@ python run.py --datasets siqa_gen winograd_ppl \
请注意通过这种方式OpenCompass 一次只评估一个模型,而其他方式可以一次评估多个模型。
```{caution}
`--num-gpus` 不代表实际用于评估的 GPU 数量,而是该模型所需的最少 GPU 数量。[更多](faq.md#opencompass-如何分配-gpu)
`--hf-num-gpus` 不代表实际用于评估的 GPU 数量,而是该模型所需的最少 GPU 数量。[更多](faq.md#opencompass-如何分配-gpu)
```
@ -104,7 +104,7 @@ python run.py --datasets siqa_gen winograd_ppl \
--max-out-len 100 \ # 生成的最大 token 数
--min-out-len 100 \ # 生成的最小 token 数
--batch-size 64 \ # 批量大小
--num-gpus 1 # 运行模型所需的 GPU 数量
--hf-num-gpus 1 # 运行模型所需的 GPU 数量
```
```{seealso}
有关 `run.py` 支持的所有与 HuggingFace 相关的参数,请阅读 [评测任务发起](../user_guides/experimentation.md#评测任务发起)

View File

@ -25,15 +25,7 @@ python run.py $EXP {--slurm | --dlc | None} [-p PARTITION] [-q QUOTATYPE] [--deb
- 对于 HuggingFace 相关模型,用户也可以通过 HuggingFace 参数快速在命令行中定义一个模型,再通过 `--datasets DATASET1 DATASET2 ...` 定义数据集。
```bash
python run.py --datasets siqa_gen winograd_ppl \
--hf-path huggyllama/llama-7b \ # HuggingFace 模型地址
--model-kwargs device_map='auto' \ # 构造 model 的参数
--tokenizer-kwargs padding_side='left' truncation='left' use_fast=False \ # 构造 tokenizer 的参数
--max-out-len 100 \ # 模型能接受的最大序列长度
--max-seq-len 2048 \ # 最长生成 token 数
--batch-size 8 \ # 批次大小
--no-batch-padding \ # 不打开 batch padding通过 for loop 推理,避免精度损失
--num-gpus 1 # 所需 gpu 数
python run.py --datasets siqa_gen winograd_ppl --hf-type base --hf-path huggyllama/llama-7b
```
HuggingFace 全量参数介绍如下:
@ -45,9 +37,8 @@ python run.py $EXP {--slurm | --dlc | None} [-p PARTITION] [-q QUOTATYPE] [--deb
- `--tokenizer-kwargs`: 构造 tokenizer 的参数
- `--max-out-len`: 最长生成 token 数
- `--max-seq-len`: 模型能接受的最大序列长度
- `--no-batch-padding`: 不打开 batch padding通过 for loop 推理,避免精度损失
- `--batch-size`: 批次大小
- `--num-gpus`: 运行模型所需的gpu数
- `--hf-num-gpus`: 运行模型所需的gpu数
启动方式:

View File

@ -186,7 +186,8 @@ def parse_hf_args(hf_parser):
hf_parser.add_argument('--max-out-len', type=int, default=256, help='The max output length for the HuggingFace model')
hf_parser.add_argument('--min-out-len', type=int, default=1, help='The min output length for the HuggingFace model')
hf_parser.add_argument('--batch-size', type=int, default=8, help='The batch size for the HuggingFace model')
hf_parser.add_argument('--num-gpus', type=int, default=1, help='The number of GPUs for **the HuggingFace model passed via cli**')
hf_parser.add_argument('--num-gpus', type=int, default=None, help='Deprecated, please use --hf-num-gpus instead')
hf_parser.add_argument('--hf-num-gpus', type=int, default=1, help='The number of GPUs for the HuggingFace model passed via cli')
hf_parser.add_argument('--pad-token-id', type=int, help='The pad token id for the HuggingFace model')
hf_parser.add_argument('--stop-words', nargs='+', default=[], help='The stop words for the HuggingFace model')
@ -205,6 +206,12 @@ def parse_custom_dataset_args(custom_dataset_parser):
def main():
args = parse_args()
if args.num_gpus is not None:
raise ValueError('The `--num-gpus` argument is deprecated, please use '
'`--hf-num-gpus` to describe number of gpus used for '
'the HuggingFace model instead.')
if args.dry_run:
args.debug = True
# initialize logger

View File

@ -151,7 +151,7 @@ def get_config_from_arg(args) -> Config:
batch_size=args.batch_size,
pad_token_id=args.pad_token_id,
stop_words=args.stop_words,
run_cfg=dict(num_gpus=args.num_gpus))
run_cfg=dict(num_gpus=args.hf_num_gpus))
logger.debug(f'Using model: {model}')
models.append(model)
# set infer accelerator if needed