-`run.py` accepts a .py configuration file as task-related parameters, which must include the `datasets` and `models` fields.
```bash
python run.py configs/eval_demo.py
```
- If no configuration file is provided, users can also specify models and datasets using `--models MODEL1 MODEL2 ...` and `--datasets DATASET1 DATASET2 ...`:
- For HuggingFace related models, users can also define a model quickly in the command line through HuggingFace parameters and then specify datasets using `--datasets DATASET1 DATASET2 ...`.
```bash
python run.py --datasets siqa_gen winograd_ppl \
--hf-path huggyllama/llama-7b \ # HuggingFace model path
--model-kwargs device_map='auto' \ # Parameters for constructing the model
--tokenizer-kwargs padding_side='left' truncation='left' use_fast=False \ # Parameters for constructing the tokenizer
--max-out-len 100 \ # Maximum sequence length the model can accept
--max-seq-len 2048 \ # Maximum generated token count
--batch-size 8 \ # Batch size
--no-batch-padding \ # Disable batch padding and infer through a for loop to avoid accuracy loss
--num-gpus 1 # Number of required GPUs
```
Complete HuggingFace parameter descriptions:
-`--hf-path`: HuggingFace model path
-`--peft-path`: PEFT model path
-`--tokenizer-path`: HuggingFace tokenizer path (if it's the same as the model path, it can be omitted)
-`--model-kwargs`: Parameters for constructing the model
-`--tokenizer-kwargs`: Parameters for constructing the tokenizer
-`--max-out-len`: Maximum generated token count
-`--max-seq-len`: Maximum sequence length the model can accept
-`--no-batch-padding`: Disable batch padding and infer through a for loop to avoid accuracy loss
-`--batch-size`: Batch size
-`--num-gpus`: Number of GPUs required to run the model
Starting Methods:
- Running on local machine: `run.py $EXP`.
- Running with slurm: `run.py $EXP --slurm -p $PARTITION_name`.
- Running with dlc: `run.py $EXP --dlc --aliyun-cfg $AliYun_Cfg`
- Customized starting: `run.py $EXP`. Here, $EXP is the configuration file which includes the `eval` and `infer` fields. For detailed configurations, please refer to [Efficient Evaluation](./evaluation.md).
-`-q`: Specify the slurm quotatype (default is None), with optional values being reserved, auto, spot. This parameter may only be used in some slurm variants;
-`--debug`: When enabled, inference and evaluation tasks will run in single-process mode, and output will be echoed in real-time for debugging;
-`-m`: Running mode, default is `all`. It can be specified as `infer` to only run inference and obtain output results; if there are already model outputs in `{WORKDIR}`, it can be specified as `eval` to only run evaluation and obtain evaluation results; if the evaluation results are ready, it can be specified as `viz` to only run visualization, which summarizes the results in tables; if specified as `all`, a full run will be performed, which includes inference, evaluation, and visualization.
-`-r`: Reuse existing inference results, and skip the finished tasks. If followed by a timestamp, the result under that timestamp in the workspace path will be reused; otherwise, the latest result in the specified workspace path will be reused.
-`-w`: Specify the working path, default is `./outputs/default`.
2. The evaluation task mainly includes three stages: inference `infer`, evaluation `eval`, and visualization `viz`. After task division by Partitioner, they are handed over to Runner for parallel execution. Individual inference and evaluation tasks are abstracted into `OpenICLInferTask` and `OpenICLEvalTask` respectively.
3. After each stage ends, the visualization stage will read the evaluation results in `results/` to generate a table.
Users can enable real-time monitoring of task status by setting up a Lark bot. Please refer to [this document](https://open.feishu.cn/document/ukTMukTMukTM/ucTM5YjL3ETO24yNxkjN?lang=zh-CN#7a28964d) for setting up the Lark bot.
Configuration method:
1. Open the `configs/lark.py` file, and add the following line:
3. To avoid frequent messages from the bot becoming a nuisance, status updates are not automatically reported by default. You can start status reporting using `-l` or `--lark` when needed: