LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2024-12-20 05:07:54 +00:00

History

Ludovic Leroux 0135e1e3b9 fix: vllm - use AsyncLLMEngine to allow true streaming mode (#1749 ) * fix: use vllm AsyncLLMEngine to bring true stream Current vLLM implementation uses the LLMEngine, which was designed for offline batch inference, which results in the streaming mode outputing all blobs at once at the end of the inference. This PR reworks the gRPC server to use asyncio and gRPC.aio, in combination with vLLM's AsyncLLMEngine to bring true stream mode. This PR also passes more parameters to vLLM during inference (presence_penalty, frequency_penalty, stop, ignore_eos, seed, ...). * Remove unused import		2024-02-24 11:48:45 +01:00
..
cpp	deps(llama.cpp): update, support Gemma models (#1734 )	2024-02-21 17:23:38 +01:00
go	MQTT Startup Refactoring Part 1: core/ packages part 1 (#1728 )	2024-02-21 01:21:19 +00:00
python	fix: vllm - use AsyncLLMEngine to allow true streaming mode (#1749 )	2024-02-24 11:48:45 +01:00
backend_grpc.pb.go	transformers: correctly load automodels (#1643 )	2024-01-26 00:13:21 +01:00
backend.proto	transformers: correctly load automodels (#1643 )	2024-01-26 00:13:21 +01:00