Abtion's Dataset Repository

Abtion's Dataset Repository

Welcome to Abtion's Dataset repository! This centralized hub is dedicated to experimenting with document changes as part of our exploration into embeddings-based search. Our goal is to enhance and optimize document embeddings by including them in this dataset and storing their corresponding embeddings in Redis.

Dataset Purpose

The purpose of this dataset is to facilitate experimentation and research on document modifications. We aim to explore different changes to documents and evaluate their impact on the effectiveness of embeddings-based search. By including a variety of documents in this dataset, generating their embeddings, and storing them in Redis, we can conduct thorough experiments and analyze the results.

Getting started

The following environment variables are required to run the scripts:

Variable	Description	Default
REDIS_HOST	The host of the Redis database	localhost
REDIS_PORT	The port of the Redis database	6379
REDIS_PASSWORD	The password of the Redis database	None
OPENAI_API_KEY	The API key for the OpenAI API	None

You can set these environment variables in a .env file in the root of the repository. For example:

REDIS_HOST=localhost
REDIS_PORT=6379
OPENAI_API_KEY=123456

Once you have set the environment variables, you can install the dependencies:

pip install -r requirements.txt

Working with the dataset

Setting up the project can be accomplished in two ways, depending on your specific needs:

Regenerate the Embeddings:

If you have modified the documents and you wish to update the embeddings to reflect these changes, this is the option for you. In this case, you would need to run the text-embedding generation script. This will create new embeddings based on the current state of the documents and the embedding model, replacing the existing JSON files with updated ones.
Populate Redis Database with Existing Embeddings:

If you are satisfied with the current state of the embeddings and do not wish to make any changes, you can skip the regeneration step. Instead, you directly run the Redis database population script. This script will load the existing embeddings (as they currently exist in the JSON files) into the Redis database.

Text-embedding generation script

In most cases, you will not need to run this script. The embeddings have already been generated and stored in the repository. However, if you have modified the documents and you wish to update the embeddings to reflect these changes, you can run the script to regenerate the embeddings.

Our embeddings are produced utilizing OpenAI's text-embedding-ada-v2 model. This model is powered to analyze and convert text data into meaningful embeddings, representing the nuanced semantic qualities of the content.

The script that generates the embeddings has already been executed, and as a result, our repository is populated with JSON files that contain the initial embeddings. These JSON files were created adjacent to the corresponding text or markdown files.

It's important to note that these embeddings are not static; they evolve as OpenAI trains the model or the documents change. Consequently, if the script is run again, it will regenerate the embeddings to reflect the changes brought about by further model training or document changes. This updated information will be stored in the same JSON files, ensuring that the embeddings remain current with the ongoing model development.

The script uses Python's os.walk function. It looks through all directories, starting from the root and going into subdirectories, searching for any files that end with .txt or .md. At the same time, it uses a tool called parse-gitignore. This tool ensures the script skips any directories or files that are listed in a .embedding_ignore file. This way, it only focuses on the files it's supposed to process, ignoring the rest.

Here are some examples of how to use the script:

1. Process a specific file:

If you have a specific file that you want to process, you can provide the path to that file as a command line argument to the script:

python generate_embeddings.py /path/to/your/file.txt

In this example, the script will read the contents of file.txt, create an embedding using the OpenAI API, and save that embedding to a JSON file with the same name (but with a .json extension).

2. Process all files in a specific directory:

If you have a directory of files that you want to process, you can provide the path to that directory as a command line argument:

python generate_embeddings.py /path/to/your/directory

In this example, the script will process all text and markdown files in the specified directory and its subdirectories, excluding files and directories specified in the .embedding_ignore file.

3. Process all files in the current directory:

If you want to process all files in the same directory as the script, you can run the script without any arguments:

python generate_embeddings.py

In this case, the script will process all text and markdown files in its current working directory and its subdirectories, excluding files and directories specified in the .embedding_ignore file.

Successful execution of the script will result in the creation of JSON files containing the embeddings. These files will be created adjacent to the corresponding text or markdown files.

Redis database population script

Running Without Arguments

By default, if you run the script without providing any arguments, it will process the current directory ('.'). The script will walk through all the files in this directory (and its subdirectories) and, for each '.json' file it finds, it will call the insert_embedding function to insert the file's data into the Redis database.

Running With a Specific Directory

If you run the script with a specific directory as an argument, the script will process that directory instead of the current one. It will go through all the files in the specified directory (and its subdirectories), just as it does when run without any arguments.

Running With a Specific File

The script can also be run with a specific file as an argument. When run in this mode, the script will only process that specific file. This feature is particularly useful when you have a large number of files and only need to update the Redis database with data from a single file.

To use this feature, the file provided must be a '.json' file. The script will call the insert_embedding function for this file and insert its data into the Redis database.

Example Usage

Here are some example usage scenarios for the enhanced function:

Without any arguments: Run the script in your terminal without providing any arguments.

python ingest_embeddings.py

With a directory: Provide the path to the directory as an argument when running the script.

python ingest_embeddings.py path/to/directory

With a file: Provide the path to the file as an argument when running the script.

python ingest_embeddings.py path/to/file.json

RedisInsight

RedisInsight is a GUI for Redis that allows you to interact with the Redis database. It can be used to view the contents of the database, run queries, and perform other operations. Download and install RedisInsight from here. It is free to use.

Contributions

We highly encourage contributions from everyone. However, to maintain the quality and coherence of the data, any modifications or additions to the dataset should be proposed via Pull Requests (PRs). The suggested process for making contributions is as follows:

Clone the repository to your local machine.
Create a new branch for your changes.
Make your changes and commit them with a clear and concise commit message.
Push your commits.
Make a Pull Request. Please ensure to provide a detailed description of the changes made and the reasons for them.

Guidelines

Please ensure that all text and markdown documents are structured and well-organized.
Ensure that your changes do not inadvertently introduce any sensitive or private information.

Questions?

Feel free to raise issues in case you have any questions or suggestions. We value your feedback and strive to create a positive and collaborative environment for all contributors.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
inside.abtion.com		inside.abtion.com
.embedding_ignore		.embedding_ignore
.gitignore		.gitignore
.tool-versions		.tool-versions
README.md		README.md
abtion.json		abtion.json
abtion.txt		abtion.txt
config.py		config.py
docker-compose.yml		docker-compose.yml
files.png		files.png
generate_embeddings.py		generate_embeddings.py
generate_yaml_with_abtion_links.py		generate_yaml_with_abtion_links.py
ignore_filter.py		ignore_filter.py
ingest_embeddings.py		ingest_embeddings.py
redis-insight.png		redis-insight.png
redis_client.py		redis_client.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Abtion's Dataset Repository

Dataset Purpose

Getting started

Working with the dataset

Text-embedding generation script

Redis database population script

Running Without Arguments

Running With a Specific Directory

Running With a Specific File

Example Usage

RedisInsight

Contributions

Guidelines

Questions?

About

Releases

Packages

Contributors 2

Languages

abtion/abtions-dataset

Folders and files

Latest commit

History

Repository files navigation

Abtion's Dataset Repository

Dataset Purpose

Getting started

Working with the dataset

Text-embedding generation script

Redis database population script

Running Without Arguments

Running With a Specific Directory

Running With a Specific File

Example Usage

RedisInsight

Contributions

Guidelines

Questions?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages