When we start a new GPT project and we need to create a new Redis DB and populate it with data - we need to follow these steps:
- Copy locally an existing gpt-data project. For example gpt-data-kvindekroppen
- Rename it - so it has the name of the new project you are creating. For example gpt-data-infare
- Delete the data folder with its content and create an empty data folder in the root of the project
- Delete the data-temp folder with its content and create an empty data-temp folder in the root of the project
- Delete the venv folder(if there is one)
- Go to the folder on your computer and allow the hidden files to be seen. Delete the .git folder. This will remove the link between the old github respository and the project.
- Go to Abtions github account and click Create New Repository
- Add name
- Leave all the rest as it is and click Create repository
- Go to Redis online, login with Abtions account and create a new DB
- Go back to VSCode where you have the new gpt-data project created and update the .env file with the new Redis credentials
- Follow the steps below, to create embeddings and ingest them in the redis DB
Ensure your .env variables are up to date with at least:
- REDIS_HOST
- REDIS_PORT
- REDIS_PASSWORD
- OPENAI_API_KEY
Create a venv folder, if there is no venv folder in the root directory of the project. If you have the venv folder, you do not need to create one. OBS: If you get an error in one of the following steps, delete the venv folder and create a new one with the command below:
python -m venv venv
Activate venv whenever working on this project. OBS: two different commands below, depending on your operating system:
Windows:
. venv/scripts/activate
Mac:
. venv/bin/activate
Your terminal will display the name of your venv when active: (venv).
Installing the required packages:
pip install -r requirements.txt
If any packages are installed/updated while developing, remember to freeze the package list:
pip freeze > requirements.txt
If you move to another project, close the terminal to deactive the current venv.
pip install git+https://github.com/abtion/gpt-data-core.git
# token_division_list.py
from gpt_data_core import token_division_list
division_list = token_division_list.TokenDivisionList(
model="gpt-3.5-turbo-16k",
max_tokens=8191
)
division_list.process_files("data")
# create_embeddings.py
from gpt_data_core import embedding_generator, config
openAIConfig = config.Config()
generator = embedding_generator.EmbeddingGenerator(
openAIConfig.OPENAI_API_KEY,
openAIConfig.DEFAULT_DATA_PATH,
openAIConfig.DEFAULT_TEMP_PATH
)
generator.process_file("path/to/file")
# generator.process_all_files()
# ingest_embeddings.py
from gpt_data_core import embedding_ingestor, config, base_schema, redis_client
openAIConfig = config.Config()
def process_file(pipe, embedding_path, data_path, ingestor: embedding_ingestor.EmbeddingIngestor):
embedding = ingestor.read_json_embedding(embedding_path)
data = None
with open(data_path, "r", encoding="utf8") as f:
data = f.read()
ingestor.insert_embedding(
pipe,
os.path.basename(data_path),
data,
embedding,
)
redisClient = redis_client.RedisClient(
openAIConfig.REDIS_HOST,
openAIConfig.REDIS_PORT,
openAIConfig.REDIS_PASSWORD)
ingestor = embedding_ingestor.EmbeddingIngestor(
redisClient,
openAIConfig.VECTOR_DIMENSIONS,
openAIConfig.INDEX_NAME,
openAIConfig.DOC_PREFIX,
openAIConfig.DEFAULT_DATA_PATH,
openAIConfig.DEFAULT_TEMP_PATH
)
schema = base_schema.create_base_schema(openAIConfig.VECTOR_DIMENSIONS)
ingestor.create_index(schema)
pipe = ingestor.redis_client.pipeline()
embeddings_and_data_list = ingestor.collect_embedding_and_data_paths()
for embeddings_and_data in embeddings_and_data_list:
process_file(
pipe, embeddings_and_data[0], embeddings_and_data[1], ingestor)
pipe.execute()
Sometimes we want to add additional fields to the Database which our chat application needs.
Ensure you configure the Redis schema with the correct type of field, and populate the value when inserting the embedding:
# ingest_embeddings.py
from gpt_data_core import ..., base_schema
from redis.commands.search.field import TagField
def process_file(...):
...
newFieldValue = "some-value"
ingestor.insert_embedding(
...
extraMapping={"newfield": newFieldValue}
)
schema = base_schema.create_base_schema(openAIConfig.VECTOR_DIMENSIONS)
abtionschema = schema + (TagField("newfield"),)
...