Commit Graph

105 Commits

Author SHA1 Message Date
f7aabf1b50 fix: bring everything onto the same GRPC version to fix tests (#2199)
fix: more places where we are installing grpc that need a version specified
fix: attempt to fix metal tests
fix: metal/brew is forcing an update, they don't have 1.58 available anymore

Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com>
2024-04-30 19:12:15 +00:00
e38610e521 feat: OpenVINO acceleration for embeddings in transformer backend (#2190)
OpenVINO acceleration for embeddings

New argument type: OVModelForFeatureExtraction
2024-04-30 10:13:04 +02:00
c4f958e11b refactor(application): introduce application global state (#2072)
* start breaking up the giant channel refactor now that it's better understood - easier to merge bites

Signed-off-by: Dave Lee <dave@gray101.com>

* add concurrency and base64 back in, along with new base64 tests.

Signed-off-by: Dave Lee <dave@gray101.com>

* Automatic rename of whisper.go's Result to TranscriptResult

Signed-off-by: Dave Lee <dave@gray101.com>

* remove pkg/concurrency - significant changes coming in split 2

Signed-off-by: Dave Lee <dave@gray101.com>

* fix comments

Signed-off-by: Dave Lee <dave@gray101.com>

* add list_model service as another low-risk service to get it out of the way

Signed-off-by: Dave Lee <dave@gray101.com>

* split backend config loader into seperate file from the actual config struct. No changes yet, just reduce cognative load with smaller files of logical blocks

Signed-off-by: Dave Lee <dave@gray101.com>

* rename state.go ==> application.go

Signed-off-by: Dave Lee <dave@gray101.com>

* fix lost import?

Signed-off-by: Dave Lee <dave@gray101.com>

---------

Signed-off-by: Dave Lee <dave@gray101.com>
2024-04-29 17:42:37 +00:00
b7ea9602f5 fix: undefined symbol: iJIT_NotifyEvent in import torch ##2153 (#2179)
* add  extra index to Intel repository

* Update install.sh
2024-04-29 15:11:09 +02:00
c9451cb604 Bump oneapi-basekit, optimum and openvino (#2139)
* Bump oneapi-basekit, optimum and openvino

* Changed PERFORMANCE HINT to CUMULATIVE_THROUGHPUT

Minor latency change for first token but about 10-15% speedup on token generation.
2024-04-26 16:20:43 +02:00
44bc540bb5 fix: security scanner dislikes runCommand function arguments (#2140)
runCommand ==> ffmpegCommand. No functional changes, but makes it clear to the security scanner and future developers that this function cannot run arbitrary commands

Signed-off-by: Dave Lee <dave@gray101.com>
2024-04-26 10:33:12 +02:00
b664edde29 feat(rerankers): Add new backend, support jina rerankers API (#2121)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2024-04-25 00:19:02 +02:00
2fb34b00b5 Incl ocv pkg for diffsusers utils (#2115)
* Update diffusers.yml

Signed-off-by: jtwolfe <jamie.t.wolfe@gmail.com>

* Update diffusers-rocm.yml

Signed-off-by: jtwolfe <jamie.t.wolfe@gmail.com>

---------

Signed-off-by: jtwolfe <jamie.t.wolfe@gmail.com>
2024-04-24 09:17:49 +02:00
f718a391c0 fix missing TrustRemoteCode in OpenVINO model load (#2114) 2024-04-24 00:45:37 +00:00
8e36fe9b6f Transformers Backend: max_tokens adherence to OpenAI API (#2108)
max token adherence to OpenAI API

improve adherence to OpenAI API when max tokens is omitted or equal to 0 in the request
2024-04-23 18:42:17 +02:00
66b002458d Transformer Backend: Implementing use_tokenizer_template and stop_prompts options (#2090)
* fix regression #1971

fixes regression #1971 introduced by intel_extension_for_transformers==1.4

* UseTokenizerTemplate and StopPrompt

Implementation of use_tokenizer_template and stopwords options
2024-04-21 16:20:25 +00:00
03adc1f60d Add tensor_parallel_size setting to vllm setting items (#2085)
Signed-off-by: Taikono-Himazin <kazu@po.harenet.ne.jp>
2024-04-20 14:37:02 +00:00
af9e5a2d05 Revert #1963 (#2056)
* Revert "fix(fncall): fix regression introduced in #1963 (#2048)"

This reverts commit 6b06d4e0af.

* Revert "fix: action-tmate back to upstream, dead code removal (#2038)"

This reverts commit fdec8a9d00.

* Revert "feat(grpc): return consumed token count and update response accordingly (#2035)"

This reverts commit e843d7df0e.

* Revert "refactor: backend/service split, channel-based llm flow (#1963)"

This reverts commit eed5706994.

* feat(grpc): return consumed token count and update response accordingly

Fixes: #1920

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2024-04-17 23:33:49 +02:00
e843d7df0e feat(grpc): return consumed token count and update response accordingly (#2035)
Fixes: #1920
2024-04-15 19:47:11 +02:00
0fdff26924 feat(parler-tts): Add new backend (#2027)
* feat(parler-tts): Add new backend

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(parler-tts): try downgrade protobuf

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(parler-tts): add parler conda env

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Revert "feat(parler-tts): try downgrade protobuf"

This reverts commit bd5941d5cfc00676b45a99f71debf3c34249cf3c.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* deps: add grpc

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix: try to gen proto with same environment

* workaround

* Revert "fix: try to gen proto with same environment"

This reverts commit 998c745e2f.

* Workaround fixup

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Dave <dave@gray101.com>
2024-04-13 18:59:21 +02:00
eed5706994 refactor: backend/service split, channel-based llm flow (#1963)
Refactor: channel based llm flow and services split

---------

Signed-off-by: Dave Lee <dave@gray101.com>
2024-04-13 09:45:34 +02:00
1981154f49 fix: dont commit generated files to git (#1993)
* fix: initial work towards not committing generated files to the repository

Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com>

* feat: improve build docs

Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com>

* fix: remove unused folder from .dockerignore and .gitignore

Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com>

* fix: attempt to fix extra backend tests

Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com>

* fix: attempt to fix other tests

Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com>

* fix: more test fixes

Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com>

* fix: fix apple tests

Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com>

* fix: more extras tests fixes

Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com>

* fix: add GOBIN to PATH in docker build

Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com>

* fix: extra tests and Dockerfile corrections

Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com>

* fix: remove build dependency checks

Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com>

* fix: add golang protobuf compilers to tests-linux action

Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com>

* fix: ensure protogen is run for extra backend installs

Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com>

* fix: use newer protobuf

Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com>

* fix: more missing protoc binaries

Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com>

* fix: missing dependencies during docker build

Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com>

* fix: don't install grpc compilers in the final stage if they aren't needed

Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com>

* fix: python-grpc-tools in 22.04 repos is too old

Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com>

* fix: add a couple of extra build dependencies to Makefile

Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com>

* fix: unbreak container rebuild functionality

Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com>

---------

Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com>
2024-04-13 09:37:32 +02:00
a8ebf6f575 fix: respect concurrency from parent build parameters when building GRPC (#2023)
Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com>
2024-04-13 09:14:32 +02:00
12c0d9443e feat: use tokenizer.apply_chat_template() in vLLM (#1990)
Use tokenizer.apply_chat_template() in vLLM

Signed-off-by: Ludovic LEROUX <ludovic@inpher.io>
2024-04-11 19:20:22 +02:00
b4548ad72d feat: add flash-attn in nvidia and rocm envs (#1995)
Signed-off-by: Ludovic LEROUX <ludovic@inpher.io>
2024-04-11 09:44:39 +02:00
36da11a0ee deps: Update version of vLLM to add support of Cohere Command_R model in vLLM inference (#1975)
* Update vLLM version to add support of Command_R

Signed-off-by: Koen Farell <hellios.dt@gmail.com>

* fix: Fixed vllm version from requirements

Signed-off-by: Koen Farell <hellios.dt@gmail.com>

* chore: Update transformers-rocm.yml

Signed-off-by: Koen Farell <hellios.dt@gmail.com>

* chore: Update transformers.yml version of vllm

Signed-off-by: Koen Farell <hellios.dt@gmail.com>

---------

Signed-off-by: Koen Farell <hellios.dt@gmail.com>
2024-04-10 11:25:26 +00:00
d23e73b118 fix(autogptq): do not use_triton with qwen-vl (#1985)
* Enhance autogptq backend to support VL models

* update dependencies for autogptq

* remove redundant auto-gptq dependency

* Convert base64 to image_url for Qwen-VL model

* implemented model inference for qwen-vl

* remove user prompt from generated answer

* fixed write image error

* fixed use_triton issue when loading Qwen-VL model

---------

Co-authored-by: Binghua Wu <bingwu@estee.com>
2024-04-10 10:36:10 +00:00
a38618db02 fix regression #1971 (#1972)
fixes regression #1971 introduced by intel_extension_for_transformers==1.4
2024-04-08 22:33:51 +02:00
8210ffcb6c feat: Token Stream support for Transformer, fix: missing package for OpenVINO (#1908)
* Streaming working

* Small fix for regression on CUDA and XPU

* use pip version of optimum[openvino]

* Update backend/python/transformers/transformers_server.py

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>

* Token streaming support

fix optimum[openvino] package in install.sh

* Token Streaming support

---------

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2024-03-27 17:50:35 +01:00
e7cbe32601 feat: Openvino runtime for transformer backend and streaming support for Openvino and CUDA (#1892)
* fixes #1775 and #1774

Add BitsAndBytes Quantization and fixes embedding on CUDA devices

* Manage 4bit and 8 bit quantization

Manage different BitsAndBytes options with the quantization: parameter in yaml

* fix compilation errors on non CUDA environment

* OpenVINO draft

First draft of OpenVINO integration in transformer backend

* first working implementation

* Streaming working

* Small fix for regression on CUDA and XPU

* use pip version of optimum[openvino]

* Update backend/python/transformers/transformers_server.py

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>

---------

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2024-03-26 23:31:43 +00:00
607586e0b7 fix: downgrade torch (#1902)
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2024-03-26 22:56:02 +01:00
b7ffe66219 Enhance autogptq backend to support VL models (#1860)
* Enhance autogptq backend to support VL models

* update dependencies for autogptq

* remove redundant auto-gptq dependency

* Convert base64 to image_url for Qwen-VL model

* implemented model inference for qwen-vl

* remove user prompt from generated answer

* fixed write image error

---------

Co-authored-by: Binghua Wu <bingwu@estee.com>
2024-03-26 18:48:14 +01:00
643d85d2cc feat(stores): Vector store backend (#1795)
Add simple vector store backend

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2024-03-22 21:14:04 +01:00
ed5734ae25 test/fix: OSX Test Repair (#1843)
* test with gguf instead of ggml. Updates testPrompt to match? Adds debugging line to Dockerfile that I've found helpful recently.

* fix testPrompt slightly

* Sad Experiment: Test GH runner without metal?

* break apart CGO_LDFLAGS

* switch runner

* upstream llama.cpp disables Metal on Github CI!

* missed a dir from clean-tests

* CGO_LDFLAGS

* tmate failure + NO_ACCELERATE

* whisper.cpp has a metal fix

* do the exact opposite of the name of this branch, but keep it around for unrelated fixes?

* add back newlines

* add tmate to linux for testing

* update fixtures

* timeout for tmate
2024-03-18 19:19:43 +01:00
fa9e330fc6 fix(llama.cpp): fix eos without cache (#1852) 2024-03-18 18:59:24 +01:00
020ce29cd8 fix(make): allow to parallelize jobs (#1845)
* fix: clean up Makefile dependencies to allow for parallel builds

* refactor: remove old unused backend from Makefile

* fix: finish removing legacy backend, update piper

* fix: I broke llama... I fixed llama

* feat: give the tests and builds a few threads

* fix: ensure libraries are replaced before build, add dropreplace target

* Fix image build workflows
2024-03-17 15:39:20 +01:00
db199f61da fix: osx build default.metallib (#1837)
fix: osx build default.metallib (#1837)
* port osx fix from refactor pr to slim pr
* manually bump llama.cpp version to unstick CI?
2024-03-15 08:18:58 +00:00
20136ca8b7 feat(tts): add Elevenlabs and OpenAI TTS compatibility layer (#1834)
* feat(elevenlabs): map elevenlabs API support to TTS

This allows elevenlabs Clients to work automatically with LocalAI by
supporting the elevenlabs API.

The elevenlabs server endpoint is implemented such as it is wired to the
TTS endpoints.

Fixes: https://github.com/mudler/LocalAI/issues/1809

* feat(openai/tts): compat layer with openai tts

Fixes: #1276

* fix: adapt tts CLI
2024-03-14 23:08:34 +01:00
45d520f913 fix: OSX Build Files for llama.cpp (#1836)
bot ate my changes, seperate branch
2024-03-14 23:07:47 +01:00
3882130911 feat: Add Bitsandbytes quantization for transformer backend enhancement #1775 and fix: Transformer backend error on CUDA #1774 (#1823)
* fixes #1775 and #1774

Add BitsAndBytes Quantization and fixes embedding on CUDA devices

* Manage 4bit and 8 bit quantization

Manage different BitsAndBytes options with the quantization: parameter in yaml

* fix compilation errors on non CUDA environment
2024-03-14 23:06:30 +01:00
b423af001d fix: the correct BUILD_TYPE for OpenCL is clblas (with no t) (#1828) 2024-03-14 08:39:21 +01:00
5d1018495f feat(intel): add diffusers/transformers support (#1746)
* feat(intel): add diffusers support

* try to consume upstream container image

* Debug

* Manually install deps

* Map transformers/hf cache dir to modelpath if not specified

* fix(compel): update initialization, pass by all gRPC options

* fix: add dependencies, implement transformers for xpu

* base it from the oneapi image

* Add pillow

* set threads if specified when launching the API

* Skip conda install if intel

* defaults to non-intel

* ci: add to pipelines

* prepare compel only if enabled

* Skip conda install if intel

* fix cleanup

* Disable compel by default

* Install torch 2.1.0 with Intel

* Skip conda on some setups

* Detect python

* Quiet output

* Do not override system python with conda

* Prefer python3

* Fixups

* exllama2: do not install without conda (overrides pytorch version)

* exllama/exllama2: do not install if not using cuda

* Add missing dataset dependency

* Small fixups, symlink to python, add requirements

* Add neural_speed to the deps

* correctly handle model offloading

* fix: device_map == xpu

* go back at calling python, fixed at dockerfile level

* Exllama2 restricted to only nvidia gpus

* Tokenizer to xpu
2024-03-07 14:37:45 +01:00
5c69dd155f feat(autogpt/transformers): consume trust_remote_code (#1799)
trusting remote code by default is a danger to our users
2024-03-05 19:47:15 +01:00
504f2e8bf4 Update Backend Dependancies (#1797)
* Update transformers.yml

Signed-off-by: TwinFin <57421631+TwinFinz@users.noreply.github.com>

* Update transformers-rocm.yml

Signed-off-by: TwinFin <57421631+TwinFinz@users.noreply.github.com>

* Update transformers-nvidia.yml

Signed-off-by: TwinFin <57421631+TwinFinz@users.noreply.github.com>

---------

Signed-off-by: TwinFin <57421631+TwinFinz@users.noreply.github.com>
2024-03-05 10:10:00 +00:00
939411300a Bump vLLM version + more options when loading models in vLLM (#1782)
* Bump vLLM version to 0.3.2

* Add vLLM model loading options

* Remove transformers-exllama

* Fix install exllama
2024-03-01 22:48:53 +01:00
31a4c9c9d3 Fix Command Injection Vulnerability (#1778)
* Added fix for command injection

* changed function name from sh to runCommand
2024-02-29 18:32:29 +00:00
bc5f5aa538 deps(llama.cpp): update (#1759)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2024-02-26 13:18:44 +01:00
0135e1e3b9 fix: vllm - use AsyncLLMEngine to allow true streaming mode (#1749)
* fix: use vllm AsyncLLMEngine to bring true stream

Current vLLM implementation uses the LLMEngine, which was designed for offline batch inference, which results in the streaming mode outputing all blobs at once at the end of the inference.

This PR reworks the gRPC server to use asyncio and gRPC.aio, in combination with vLLM's AsyncLLMEngine to bring true stream mode.

This PR also passes more parameters to vLLM during inference (presence_penalty, frequency_penalty, stop, ignore_eos, seed, ...).

* Remove unused import
2024-02-24 11:48:45 +01:00
8292781045 deps(llama.cpp): update, support Gemma models (#1734)
deps(llama.cpp): update

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2024-02-21 17:23:38 +01:00
54ec6348fa deps(llama.cpp): update (#1714)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2024-02-21 11:35:44 +01:00
255748bcba MQTT Startup Refactoring Part 1: core/ packages part 1 (#1728)
This PR specifically introduces a `core` folder and moves the following packages over, without any other changes:

- `api/backend`
- `api/config`
- `api/options`
- `api/schema`

Once this is merged and we confirm there's no regressions, I can migrate over the remaining changes piece by piece to split up application startup, backend services, http, and mqtt as was the goal of the earlier PRs!
2024-02-21 01:21:19 +00:00
594eb468df Add TTS dependency for cuda based builds fixes #1727 (#1730)
Signed-off-by: Chakib Benziane <contact@blob42.xyz>
2024-02-20 21:59:43 +01:00
fb0a4c5d9a Build docker container for ROCm (#1595)
* Dockerfile changes to build for ROCm

* Adjust linker flags for ROCm

* Update conda env for diffusers and transformers to use ROCm pytorch

* Update transformers conda env for ROCm

* ci: build hipblas images

* fixup rebase

* use self-hosted

Signed-off-by: mudler <mudler@localai.io>

* specify LD_LIBRARY_PATH only when BUILD_TYPE=hipblas

---------

Signed-off-by: mudler <mudler@localai.io>
Co-authored-by: mudler <mudler@localai.io>
2024-02-16 15:08:50 +01:00
5e155fb081 fix(python): pin exllama2 (#1711)
fix(python): pin python deps

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2024-02-14 21:44:12 +01:00
c56b6ddb1c fix(llama.cpp): disable infinite context shifting (#1704)
Infinite context loop might as well trigger an infinite loop of context
shifting if the model hallucinates and does not stop answering.
This has the unpleasant effect that the predicion never terminates,
which is the case especially on small models which tends to hallucinate.

Workarounds https://github.com/mudler/LocalAI/issues/1333 by removing
context-shifting.

See also upstream issue: https://github.com/ggerganov/llama.cpp/issues/3969
2024-02-13 21:17:21 +01:00