LocalAI/gallery/vllm.yaml

---
name: "vllm"

config_file: |
    context_size: 8192
    parameters:
      max_tokens: 8192
    backend: vllm
    function:
      disable_no_action: true
      grammar:
        disable: true
        parallel_calls: true
        expect_strings_after_json: true
    template:
      use_tokenizer_template: true
    # Uncomment to specify a quantization method (optional)
    # quantization: "awq"
    # Uncomment to set dtype, choices are: "auto", "half", "float16", "bfloat16", "float", "float32". awq on vLLM does not support bfloat16
    # dtype: "float16"
    # Uncomment to limit the GPU memory utilization (vLLM default is 0.9 for 90%)
    # gpu_memory_utilization: 0.5
    # Uncomment to trust remote code from huggingface
    # trust_remote_code: true
    # Uncomment to enable eager execution
    # enforce_eager: true
    # Uncomment to specify the size of the CPU swap space per GPU (in GiB)
    # swap_space: 2
    # Uncomment to specify the maximum length of a sequence (including prompt and output)
    # max_model_len: 32768
    # Uncomment and specify the number of Tensor divisions.
    # Allows you to partition and run large models. Performance gains are limited.
    # https://github.com/vllm-project/vllm/issues/1435
    # tensor_parallel_size: 2
    # Uncomment to disable log stats
    # disable_log_stats: true
    # Uncomment to specify Multi-Model limits per prompt, defaults to 1 per modality if not specified
    # limit_mm_per_prompt:
    #   image: 2
    #   video: 2
    #   audio: 2
models(gallery): add hermes-3-llama-3.1(8B,70B,405B) with vLLM (#3360) models(gallery): add hermes-3-llama-3.1 with vLLM it adds 8b, 70b and 405b to the gallery Signed-off-by: Ettore Di Giacinto <mudler@localai.io> 2024-08-23 09:24:34 +02:00			`---`
			`name: "vllm"`

			`config_file: \|`
chore(model-gallery): add more quants for popular models (#3365) * models(gallery): add higher quants for some llama and hermes Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * models(gallery): vllm: specify a reasonable max_tokens Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> 2024-08-24 00:29:24 +02:00			`context_size: 8192`
			`parameters:`
			`max_tokens: 8192`
models(gallery): add hermes-3-llama-3.1(8B,70B,405B) with vLLM (#3360) models(gallery): add hermes-3-llama-3.1 with vLLM it adds 8b, 70b and 405b to the gallery Signed-off-by: Ettore Di Giacinto <mudler@localai.io> 2024-08-23 09:24:34 +02:00			`backend: vllm`
			`function:`
			`disable_no_action: true`
			`grammar:`
			`disable: true`
			`parallel_calls: true`
			`expect_strings_after_json: true`
			`template:`
			`use_tokenizer_template: true`
			`# Uncomment to specify a quantization method (optional)`
			`# quantization: "awq"`
feat(vllm): Additional vLLM config options (Disable logging, dtype, and Per-Prompt media limits) (#4855) * Adding the following vLLM config options: disable_log_status, dtype, limit_mm_per_prompt Signed-off-by: TheDropZone <brandonbeiler@gmail.com> * using " marks in the config.yaml file Signed-off-by: TheDropZone <brandonbeiler@gmail.com> * adding in missing colon Signed-off-by: TheDropZone <brandonbeiler@gmail.com> --------- Signed-off-by: TheDropZone <brandonbeiler@gmail.com> 2025-02-18 13:27:58 -05:00			`# Uncomment to set dtype, choices are: "auto", "half", "float16", "bfloat16", "float", "float32". awq on vLLM does not support bfloat16`
			`# dtype: "float16"`
models(gallery): add hermes-3-llama-3.1(8B,70B,405B) with vLLM (#3360) models(gallery): add hermes-3-llama-3.1 with vLLM it adds 8b, 70b and 405b to the gallery Signed-off-by: Ettore Di Giacinto <mudler@localai.io> 2024-08-23 09:24:34 +02:00			`# Uncomment to limit the GPU memory utilization (vLLM default is 0.9 for 90%)`
			`# gpu_memory_utilization: 0.5`
			`# Uncomment to trust remote code from huggingface`
			`# trust_remote_code: true`
			`# Uncomment to enable eager execution`
			`# enforce_eager: true`
			`# Uncomment to specify the size of the CPU swap space per GPU (in GiB)`
			`# swap_space: 2`
			`# Uncomment to specify the maximum length of a sequence (including prompt and output)`
			`# max_model_len: 32768`
			`# Uncomment and specify the number of Tensor divisions.`
			`# Allows you to partition and run large models. Performance gains are limited.`
			`# https://github.com/vllm-project/vllm/issues/1435`
			`# tensor_parallel_size: 2`
feat(vllm): Additional vLLM config options (Disable logging, dtype, and Per-Prompt media limits) (#4855) * Adding the following vLLM config options: disable_log_status, dtype, limit_mm_per_prompt Signed-off-by: TheDropZone <brandonbeiler@gmail.com> * using " marks in the config.yaml file Signed-off-by: TheDropZone <brandonbeiler@gmail.com> * adding in missing colon Signed-off-by: TheDropZone <brandonbeiler@gmail.com> --------- Signed-off-by: TheDropZone <brandonbeiler@gmail.com> 2025-02-18 13:27:58 -05:00			`# Uncomment to disable log stats`
			`# disable_log_stats: true`
			`# Uncomment to specify Multi-Model limits per prompt, defaults to 1 per modality if not specified`
			`# limit_mm_per_prompt:`
			`# image: 2`
			`# video: 2`
			`# audio: 2`