Releases: abgulati/LARS
v2.0-beta6: Major HF-Waitress LLM Server Update
-
HF-Waitress: /completions_stream now implements custom TextStreamer so as to redirect only it's output to the stream buffer, while STDOUT remains unmodified thus allowing other non-blocked routes and methods to execute and output to STDOUT in parallel without interfering with the stream
-
CSS separated into a dedicate file
-
Minor QoL changes
Full Changelog: v2.0-beta5...v2.0-beta6
v2.0-beta5: UI Enhancements
- New font-family, glassmorphism and title bar
Full Changelog: v2.0-beta4...v2.0-beta5
v2.0-beta4: HQQ Fix and Minor Refinements
-
BUG FIX: HQQ quantization would error out if torch.dtype (dataType) was set to auto, it now force-sets to torch.bfloat16
-
BUG FIX: Add new LLM button re-displays when the HF-Waitress LLM list is closed and re-opened
-
Minor response-formatting adjustment
Full Changelog: v2.0-beta3...v2.0-beta4
v2.0-beta3
-
Fixed HF-Waitress streaming-response formatting!
-
Improved app load times from tuned server health-check intervals
-
Minor performance improvement to HF-Waitress streaming-output
-
Minor refinements to HF-Waitress server status outputs
Full Changelog: v2.0-beta2...v2.0-beta3
v2.0-beta2: Enhanced HF-Waitress LLM Management Features, Error-Reporting Refinements and Bug Fixes
- Enhanced HF-Waitress LLM Management: Add new model_ids, search-filter & sort the list of LLMs as well as delete LLM IDs from the HF-Waitress LLM dropdown list
- HF-Waitress server health-check reporting improvements
- Various bug fixes: Reference to
index_dir
removed, document_records SQL-DB correctly created on very first run, removed troublesome test-prints during document-chunking operation
Full Changelog: v2.0-beta1...v2.0-beta2
v2.0-beta1: New LLM Server -- HF-Waitress!
HF-Waitress is a powerful and flexible server application for deploying and interacting with HuggingFace Transformer models. It simplifies the process of running open-source Large Language Models (LLMs) locally on-device, addressing common pain points in model deployment and usage.
This server enables loading HF-Transformer & AWQ-quantized models directly off the hub, while providing on-the-fly quantization via BitsAndBytes, HQQ and Quanto for the former. It negates the need to manually download any model yourself, simply working off the models name instead. It requires no setup, and provides concurrency and streaming responses all from within a single, easily-portable, platform-agnostic Python script.
For a full list of features see: https://github.com/abgulati/hf-waitress
LARS is far easier to deploy and get working on the very first run without requiring the user to manually download and place their LLMs.
Check out the updated Dependencies, Installation and Usage Instructions in the README
Note containers are not yet updated and will be done so in the following week most likely.
Full Changelog: v1.9.1...v2.0-beta1
v1.9.1 - Re-ranker Robustness & Minor UI Tweak
- BUG FIX: Re-ranking bypassed when do_rag=False - error no longer produced due to empty document list!
- Minor UI change: Adjusted max-width of Settings modal to 75% for better use of available screenspace
Full Changelog: v1.9...v1.9.1
v1.9 - Vector Re-Ranking & No More Whoosh
- Custom document chunker appends page number data as metadata to chunks stored vectorDB
- LLM can now supply specific document names and page numbers within the response itself!
- Re-ranking and filtering applied via SentenceTransformer('all-MiniLM-L6-v2') to the vectorDB similarity search results for better contextual accuracy
- Whoosh indexing no longer necessary - far simplified book-keeping and no overhead for page-number searches at inference time
- Page number accuracy significantly increased as a result of all the above
- Default system-prompt template now instructs the LLM to include document names and page numbers whenever additional context is provided, actual output dependent on ability of the specific LLM used
- BUG FIX: PDF tabs in documnet-viewer in the response window did not open properly for consequetive questions and on chat-history load. FIXED.
Full Changelog: v1.8...v1.9
v1.8
MAJOR UPDATE:
- Google Drive Integration complete! Downloads files and folders recursively. Filtering, sorting and queued-loading of Google Drive docs is now available via the UI
- Improved highlighting: Implemented fuzzy-search logic, replacing exact matching, resulting in expanded highlighting on pages
- Improved RAG: Increased cosine similarity seach threshold to 80% for more stringent and accurate matching and passing sources data to the LLM for improved response quality
- Imporved handling of images for citations - skipping image extraction of scanned docs
- Clearer document naming in citations: The unique ID of the highlighted dodcument is no longer attached to the document name in the 'Refer to the following documents' citations block
- BUG FIX: When using the free-tier of the AzureCV OCR service, it will handle UsageLimitExceeded errors even when submitting multiple documents back-to-back, auto-waiting and resuming correctly
- BUG FIX: handle_api_error events will now actually return to the front-end!
- Refactored process_new_file method into smaller blocks that are now shared with the GoogleDrive loader and can be used by other integrations in the future too
- Increased chunk size to 500 and removed '250' from the name of the SBERT VectorDB created
- Cleaned up print and newline statements
- Improvements to accuracy and relevance of page numbers and doc names cited in response, further refinements on-going
- Replaced Whoosh indexing search opearator from the default AND to OR
- HF-Waitress local-LLM server integration begins!
Full Changelog: v1.7...v1.8
v1.7
New models supported - Google Gemma2, DeepSeek V2, Llama-3.1
Revamped Docker builds - new dockerfiles
Pre-built images shared
Various bug-fixes and enhancements
Full Changelog: v1.6...v1.7