Important
Disclaimer: The code has been tested on:
Ubuntu 22.04.2 LTS
running on a Lenovo Legion 5 Pro with twenty12th Gen Intel® Core™ i7-12700H
and anNVIDIA GeForce RTX 3060
.MacOS Sonoma 14.3.1
running on a MacBook Pro M1 (2020).
If you are using another Operating System or different hardware, and you can't load the models, please take a look at the official Llama Cpp Python's GitHub issue.
Warning
lama_cpp_pyhon
doesn't useGPU
onM1
if you are running anx86
version ofPython
. More info here.- It's important to note that the large language model sometimes generates hallucinations or false information.
- Introduction
- Prerequisites
- Bootstrap Environment
- Using the Open-Source Models Locally
- Supported Response Synthesis strategies
- Example Data
- Build the memory index
- Run the Chatbot
- Run the RAG Chatbot
- How to debug the Streamlit app on Pycharm
- References
This project combines the power of Lama.cpp, Chroma and Streamlit to build:
- a Conversation-aware Chatbot (ChatGPT like experience).
- a RAG (Retrieval-augmented generation) ChatBot.
The RAG Chatbot works by taking a collection of Markdown files as input and, when asked a question, provides the corresponding answer based on the context provided by those files.
Note
We decided to grab and refactor the RecursiveCharacterTextSplitter
class from LangChain
to effectively chunk
Markdown files without adding LangChain as a dependency.
The Memory Builder
component of the project loads Markdown pages from the docs
folder.
It then divides these pages into smaller sections, calculates the embeddings (a numerical representation) of these
sections with the all-MiniLM-L6-v2
sentence-transformer
, and saves them in an embedding database called Chroma
for later use.
When a user asks a question, the RAG ChatBot retrieves the most relevant sections from the Embedding database. Since the original question can't be always optimal to retrieve for the LLM, we first prompt an LLM to rewrite the question, then conduct retrieval-augmented reading. The most relevant sections are then used as context to generate the final answer using a local language model (LLM). Additionally, the chatbot is designed to remember previous interactions. It saves the chat history and considers the relevant context from previous conversations to provide more accurate answers.
To deal with context overflows, we implemented three approaches:
Create And Refine the Context
: synthesize a responses sequentially through all retrieved contents.Hierarchical Summarization of Context
: generate an answer for each relevant section independently, and then hierarchically combine the answers.Async Hierarchical Summarization of Context
: parallelized version of the Hierarchical Summarization of Context which lead to big speedups in response synthesis.
- Python 3.10+
- GPU supporting CUDA 12.1+
- Poetry 1.7.0
Install Poetry with the official installer by following this link.
You must use the current adopted version of Poetry defined here.
If you have poetry already installed and is not the right version, you can downgrade (or upgrade) poetry through:
poetry self update <version>
To easily install the dependencies we created a make file.
Important
Run Setup
as your init command (or after Clean
).
- Check:
make check
- Use it to check that
which pip3
andwhich python3
points to the right path.
- Use it to check that
- Setup:
- Setup with NVIDIA CUDA acceleration:
make setup_cuda
- Creates an environment and installs all dependencies with NVIDIA CUDA acceleration.
- Setup with Metal GPU acceleration:
make setup_metal
- Creates an environment and installs all dependencies with Metal GPU acceleration for macOS system only.
- Setup with NVIDIA CUDA acceleration:
- Update:
make update
- Update an environment and installs all updated dependencies.
- Tidy up the code:
make tidy
- Run Ruff check and format.
- Clean:
make clean
- Removes the environment and all cached files.
- Test:
make test
- Runs all tests.
- Using pytest
We utilize the open-source library llama-cpp-python, a binding
for llama-cpp,
allowing us to utilize it within a Python environment.
llama-cpp
serves as a C++ backend designed to work efficiently with transformer-based models.
Running the LLMs architecture on a local PC is impossible due to the large (~7 billion) number of parameters.
This library enable us to run them either on a CPU
or GPU
.
Additionally, we use the Quantization and 4-bit precision
to reduce number of bits required to represent the numbers.
The quantized models are stored in GGML/GGUF
format.
🤖 Model | Supported | Model Size | Max Context Window | Notes and link to the model card |
---|---|---|---|---|
llama-3.2 Meta Llama 3.2 Instruct |
✅ | 1B | 128k | Optimized to run locally on a mobile or edge device - Card |
llama-3.2 Meta Llama 3.2 Instruct |
✅ | 3B | 128k | Optimized to run locally on a mobile or edge device - Card |
llama-3.1 Meta Llama 3.1 Instruct |
✅ | 8B | 128k | Recommended model Card |
openchat-3.6 - OpenChat 3.6 |
✅ | 8B | 8192 | Card |
openchat-3.5 - OpenChat 3.5 |
✅ | 7B | 8192 | Card |
starling Starling Beta |
✅ | 7B | 8192 | Is trained from Openchat-3.5-0106 . It's recommended if you prefer more verbosity over OpenChat - Card |
phi-3.5 Phi-3.5 Mini Instruct |
✅ | 3.8B | 128k | Card |
stablelm-zephyr StableLM Zephyr OpenOrca |
✅ | 3B | 4096 | Card |
✨ Response Synthesis strategy | Supported | Notes |
---|---|---|
create-and-refine Create and Refine |
✅ | |
tree-summarization Tree Summarization |
✅ | |
async-tree-summarization - Recommended - Async Tree Summarization |
✅ |
You could download some Markdown pages from
the Blendle Employee Handbook
and put them under docs
.
Run:
python chatbot/memory_builder.py --chunk-size 1000 --chunk-overlap 50
To interact with a GUI type:
streamlit run chatbot/chatbot_app.py -- --model llama-3.1 --max-new-tokens 1024
To interact with a GUI type:
streamlit run chatbot/rag_chatbot_app.py -- --model llama-3.1 --k 2 --synthesis-strategy async-tree-summarization
- Large Language Models (LLMs):
- LLM Frameworks:
- llama.cpp:
- Ollama:
- Ollama
- Ollama Python Library
- On the architecture of ollama
- Analysis of Ollama Architecture and Conversation Processing Flow for AI LLM Tool
- How to Customize Ollama’s Storage Directory
- Use CodeGPT to access self-hosted models from Ollama for a code assistant in PyCharm. More info here.
- Deepval - A framework for evaluating LLMs:
- LLM Datasets:
- Agents:
- Agent Frameworks:
- Embeddings:
- To find the list of best embeddings models for the retrieval task in your language go to the Massive Text Embedding Benchmark (MTEB) Leaderboard
- all-MiniLM-L6-v2
- This is a
sentence-transformers
model: It maps sentences & paragraphs to a 384 dimensional dense vector space (Max Tokens 512) and can be used for tasks like classification or semantic search.
- This is a
- Vector Databases:
- Indexing algorithms:
- There are many algorithms for building indexes to optimize vector search. Most vector databases
implement
Hierarchical Navigable Small World (HNSW)
and/orInverted File Index (IVF)
. Here are some great articles explaining them, and the trade-off betweenspeed
,memory
andquality
:- Nearest Neighbor Indexes for Similarity Search
- Hierarchical Navigable Small World (HNSW)
- From NVIDIA - Accelerating Vector Search: Using GPU-Powered Indexes with RAPIDS RAFT
- From NVIDIA - Accelerating Vector Search: Fine-Tuning GPU Index Algorithms
-
PS: Flat indexes (i.e. no optimisation) can be used to maintain 100% recall and precision, at the expense of speed.
- There are many algorithms for building indexes to optimize vector search. Most vector databases
implement
- Chroma
- Qdrant:
- Indexing algorithms:
- Retrieval Augmented Generation (RAG):
- Building A Generative AI Platform
- Rewrite-Retrieve-Read
-
Because the original query can not be always optimal to retrieve for the LLM, especially in the real world, we first prompt an LLM to rewrite the queries, then conduct retrieval-augmented reading.
-
- Rerank
- Building Response Synthesis from Scratch
- Conversational awareness
- RAG is Dead, Again?
- Chatbot UI:
- Text Processing and Cleaning:
- Inspirational Open Source Repositories: