Skip to content

Commit

Permalink
Merge pull request #64 from topoteretes/adding_ollama
Browse files Browse the repository at this point in the history
Local embeddings + ollama
  • Loading branch information
Vasilije1990 authored Mar 29, 2024
2 parents d5f57be + fde01fe commit 6350e67
Show file tree
Hide file tree
Showing 48 changed files with 1,449 additions and 2,917 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -165,5 +165,6 @@ cython_debug/
.vscode/
database/data/
cognee/data/
cognee/cache/

.DS_Store
# .DS_Store
6 changes: 6 additions & 0 deletions .test_data/062c22df-d99b-599f-90cd-2d325c8bcf69.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
A quantum computer is a computer that takes advantage of quantum mechanical phenomena.
At small scales, physical matter exhibits properties of both particles and waves, and quantum computing leverages this behavior, specifically quantum superposition and entanglement, using specialized hardware that supports the preparation and manipulation of quantum states.
Classical physics cannot explain the operation of these quantum devices, and a scalable quantum computer could perform some calculations exponentially faster (with respect to input size scaling)[2] than any modern "classical" computer. In particular, a large-scale quantum computer could break widely used encryption schemes and aid physicists in performing physical simulations; however, the current state of the technology is largely experimental and impractical, with several obstacles to useful applications. Moreover, scalable quantum computers do not hold promise for many practical tasks, and for many important tasks quantum speedups are proven impossible.
The basic unit of information in quantum computing is the qubit, similar to the bit in traditional digital electronics. Unlike a classical bit, a qubit can exist in a superposition of its two "basis" states. When measuring a qubit, the result is a probabilistic output of a classical bit, therefore making quantum computers nondeterministic in general. If a quantum computer manipulates the qubit in a particular way, wave interference effects can amplify the desired measurement results. The design of quantum algorithms involves creating procedures that allow a quantum computer to perform calculations efficiently and quickly.
Physically engineering high-quality qubits has proven challenging. If a physical qubit is not sufficiently isolated from its environment, it suffers from quantum decoherence, introducing noise into calculations. Paradoxically, perfectly isolating qubits is also undesirable because quantum computations typically need to initialize qubits, perform controlled qubit interactions, and measure the resulting quantum states. Each of those operations introduces errors and suffers from noise, and such inaccuracies accumulate.
In principle, a non-quantum (classical) computer can solve the same computational problems as a quantum computer, given enough time. Quantum advantage comes in the form of time complexity rather than computability, and quantum complexity theory shows that some quantum algorithms for carefully selected tasks require exponentially fewer computational steps than the best known non-quantum algorithms. Such tasks can in theory be solved on a large-scale quantum computer whereas classical computers would not finish computations in any reasonable amount of time. However, quantum speedup is not universal or even typical across computational tasks, since basic tasks such as sorting are proven to not allow any asymptotic quantum speedup. Claims of quantum supremacy have drawn significant attention to the discipline, but are demonstrated on contrived tasks, while near-term practical use cases remain limited.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ pip install "cognee[weaviate]"
With poetry:

```bash
poetry add "cognee["weaviate"]"
poetry add "cognee[weaviate]"
```

## 💻 Usage
Expand Down
1 change: 1 addition & 0 deletions cognee/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
from .api.v1.config.config import config
from .api.v1.add.add import add
from .api.v1.cognify.cognify import cognify
from .api.v1.list_datasets.list_datasets import list_datasets
Expand Down
118 changes: 79 additions & 39 deletions cognee/api/v1/add/add.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,53 +3,47 @@
import asyncio
import dlt
import duckdb
from unstructured.cleaners.core import clean
from cognee.root_dir import get_absolute_path
import cognee.modules.ingestion as ingestion
from cognee.infrastructure import infrastructure_config
from cognee.infrastructure.files import get_file_metadata
from cognee.infrastructure.files.storage import LocalStorage

async def add(file_paths: Union[str, List[str]], dataset_name: str = None):
if isinstance(file_paths, str):
# Directory path provided, we need to extract the file paths and dataset name

def list_dir_files(root_dir_path: str, parent_dir: str = "root"):
datasets = {}

for file_or_dir in listdir(root_dir_path):
if path.isdir(path.join(root_dir_path, file_or_dir)):
dataset_name = file_or_dir if parent_dir == "root" else parent_dir + "." + file_or_dir
dataset_name = clean(dataset_name.replace(" ", "_"))

nested_datasets = list_dir_files(path.join(root_dir_path, file_or_dir), dataset_name)

for dataset in nested_datasets:
datasets[dataset] = nested_datasets[dataset]
else:
if parent_dir not in datasets:
datasets[parent_dir] = []

datasets[parent_dir].append(path.join(root_dir_path, file_or_dir))

return datasets

datasets = list_dir_files(file_paths)

results = []

for key in datasets:
if dataset_name is not None and not key.startswith(dataset_name):
continue
async def add(data_path: Union[str, List[str]], dataset_name: str = None):
if isinstance(data_path, str):
# data_path is a data directory path
if "data://" in data_path:
return await add_data_directory(data_path.replace("data://", ""), dataset_name)
# data_path is a file path
if "file://" in data_path:
return await add([data_path], dataset_name)
# data_path is a text
else:
return await add_text(data_path, dataset_name)

# data_path is a list of file paths
return await add_files(data_path, dataset_name)

async def add_files(file_paths: List[str], dataset_name: str):
data_directory_path = infrastructure_config.get_config()["data_path"]
db_path = get_absolute_path("./data/cognee")
db_location = f"{db_path}/cognee.duckdb"

results.append(add(datasets[key], dataset_name = key))
LocalStorage.ensure_directory_exists(db_path)

return await asyncio.gather(*results)
processed_file_paths = []

for file_path in file_paths:
file_path = file_path.replace("file://", "")

db_path = get_absolute_path("./data/cognee")
db_location = f"{db_path}/cognee.duckdb"
if data_directory_path not in file_path:
file_name = file_path.split("/")[-1]
dataset_file_path = data_directory_path + "/" + dataset_name.replace('.', "/") + "/" + file_name

LocalStorage.ensure_directory_exists(db_path)
LocalStorage.copy_file(file_path, dataset_file_path)
processed_file_paths.append(dataset_file_path)
else:
processed_file_paths.append(file_path)

db = duckdb.connect(db_location)

Expand Down Expand Up @@ -82,10 +76,56 @@ def data_resources(file_paths: str):
}

run_info = pipeline.run(
data_resources(file_paths),
data_resources(processed_file_paths),
table_name = "file_metadata",
dataset_name = dataset_name,
dataset_name = dataset_name.replace(" ", "_").replace(".", "_") if dataset_name is not None else "main_dataset",
write_disposition = "merge",
)

return run_info

def extract_datasets_from_data(root_dir_path: str, parent_dir: str = "root"):
datasets = {}

root_dir_path = root_dir_path.replace("file://", "")

for file_or_dir in listdir(root_dir_path):
if path.isdir(path.join(root_dir_path, file_or_dir)):
dataset_name = file_or_dir if parent_dir == "root" else parent_dir + "." + file_or_dir

nested_datasets = extract_datasets_from_data("file://" + path.join(root_dir_path, file_or_dir), dataset_name)

for dataset in nested_datasets.keys():
datasets[dataset] = nested_datasets[dataset]
else:
if parent_dir not in datasets:
datasets[parent_dir] = []

datasets[parent_dir].append(path.join(root_dir_path, file_or_dir))

return datasets

async def add_data_directory(data_path: str, dataset_name: str = None):
datasets = extract_datasets_from_data(data_path)

results = []

for key in datasets.keys():
if dataset_name is None or key.startswith(dataset_name):
results.append(add(datasets[key], dataset_name = key))

return await asyncio.gather(*results)

async def add_text(text: str, dataset_name: str):
data_directory_path = infrastructure_config.get_config()["data_path"]

classified_data = ingestion.classify(text)
data_id = ingestion.identify(classified_data)

storage_path = data_directory_path + "/" + dataset_name.replace(".", "/")
LocalStorage.ensure_directory_exists(storage_path)

text_file_name = str(data_id) + ".txt"
LocalStorage(storage_path).store(text_file_name, classified_data.get_data())

return await add(["file://" + storage_path + "/" + text_file_name], dataset_name)
2 changes: 1 addition & 1 deletion cognee/api/v1/add/add_standalone.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,4 +50,4 @@ async def add_standalone(


def is_data_path(data: str) -> bool:
return False if not isinstance(data, str) else data.startswith("file://")
return False if not isinstance(data, str) else data.startswith("file://")
2 changes: 1 addition & 1 deletion cognee/api/v1/add/remember.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,4 @@ async def remember(user_id: str, memory_name: str, payload: List[str]):
if await is_existing_memory(memory_name) is False:
raise MemoryException(f"Memory with the name \"{memory_name}\" doesn't exist.")

await create_information_points(memory_name, payload)
await create_information_points(memory_name, payload)
36 changes: 18 additions & 18 deletions cognee/api/v1/cognify/cognify.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,6 @@
from typing import List, Union
import instructor
from openai import OpenAI
from unstructured.cleaners.core import clean
from unstructured.partition.pdf import partition_pdf
from cognee.modules.cognify.graph.add_classification_nodes import add_classification_nodes
from cognee.modules.cognify.llm.label_content import label_content
from cognee.modules.cognify.graph.add_label_nodes import add_label_nodes
Expand All @@ -27,7 +25,8 @@
from cognee.infrastructure.databases.relational import DuckDBAdapter
from cognee.modules.cognify.graph.add_document_node import add_document_node
from cognee.modules.cognify.graph.initialize_graph import initialize_graph
from cognee.infrastructure.databases.vector import CollectionConfig, VectorConfig
from cognee.infrastructure.files.utils.guess_file_type import guess_file_type
from cognee.infrastructure.files.utils.extract_text_from_file import extract_text_from_file
from cognee.infrastructure import infrastructure_config

config = Config()
Expand All @@ -37,7 +36,7 @@

USER_ID = "default_user"

async def cognify(datasets: Union[str, List[str]] = None, graphdatamodel: object = None):
async def cognify(datasets: Union[str, List[str]] = None, graph_data_model: object = None):
"""This function is responsible for the cognitive processing of the content."""

db = DuckDBAdapter()
Expand All @@ -47,23 +46,32 @@ async def cognify(datasets: Union[str, List[str]] = None, graphdatamodel: object

awaitables = []

# datasets is a list of dataset names
if isinstance(datasets, list):
for dataset in datasets:
awaitables.append(cognify(dataset))

graphs = await asyncio.gather(*awaitables)
return graphs[0]

files_metadata = db.get_files_metadata(datasets)
# datasets is a dataset name string
added_datasets = db.get_datasets()

files_metadata = []
dataset_name = datasets.replace(".", "_").replace(" ", "_")

for added_dataset in added_datasets:
if dataset_name in added_dataset:
files_metadata.extend(db.get_files_metadata(added_dataset))

awaitables = []

await initialize_graph(USER_ID,graphdatamodel)
await initialize_graph(USER_ID, graph_data_model)

for file_metadata in files_metadata:
with open(file_metadata["file_path"], "rb") as file:
elements = partition_pdf(file = file, strategy = "fast")
text = "\n".join(map(lambda element: clean(element.text), elements))
file_type = guess_file_type(file)
text = extract_text_from_file(file, file_type)

awaitables.append(process_text(text, file_metadata))

Expand Down Expand Up @@ -159,26 +167,18 @@ async def generate_graph_per_layer(text_input: str, layers: List[str], response_

unique_layers = nodes_by_layer.keys()

collection_config = CollectionConfig(
vector_config = VectorConfig(
distance = "Cosine",
size = 3072
)
)

try:
db_engine = infrastructure_config.get_config()["vector_engine"]

for layer in unique_layers:
await db_engine.create_collection(layer, collection_config)
await db_engine.create_collection(layer)
except Exception as e:
print(e)

await add_propositions(nodes_by_layer)

results = await resolve_cross_graph_references(nodes_by_layer)


relationships = graph_ready_output(results)
# print(relationships)
await graph_client.load_graph_from_file()
Expand All @@ -202,4 +202,4 @@ async def main():
print(graph_url)


asyncio.run(main())
asyncio.run(main())
1 change: 1 addition & 0 deletions cognee/api/v1/config/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .config import config
9 changes: 9 additions & 0 deletions cognee/api/v1/config/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
from typing import Optional
from cognee.infrastructure import infrastructure_config

class config():
@staticmethod
def data_path(data_path: Optional[str] = None) -> str:
infrastructure_config.set_config({
"data_path": data_path
})
15 changes: 12 additions & 3 deletions cognee/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,14 +19,16 @@
class Config:
""" Configuration for cognee - cognitive architecture framework. """
cognee_dir: str = field(
default_factory=lambda: os.getenv("COG_ARCH_DIR", "cognitive_achitecture")
default_factory=lambda: os.getenv("COG_ARCH_DIR", "cognee")
)
config_path: str = field(
default_factory=lambda: os.path.join(
os.getenv("COG_ARCH_DIR", "cognitive_achitecture"), "config"
os.getenv("COG_ARCH_DIR", "cognee"), "config"
)
)

data_path = os.getenv("DATA_PATH", str(Path(__file__).resolve().parent.parent / ".data"))

db_path = str(Path(__file__).resolve().parent / "data/system")

vectordb: str = os.getenv("VECTORDB", "weaviate")
Expand All @@ -43,6 +45,13 @@ class Config:
graph_filename = os.getenv("GRAPH_NAME", "cognee_graph.pkl")

# Model parameters
llm_provider: str = "openai" #openai, or custom or ollama
custom_model: str = "mistralai/Mixtral-8x7B-Instruct-v0.1"
custom_endpoint: str = "https://api.endpoints.anyscale.com/v1" # pass claude endpoint
custom_key: Optional[str] = os.getenv("ANYSCALE_API_KEY")
ollama_endpoint: str = "http://localhost:11434/v1"
ollama_key: Optional[str] = "ollama"
ollama_model: str = "mistral:instruct"
model: str = "gpt-4-0125-preview"
# model: str = "gpt-3.5-turbo"
model_endpoint: str = "openai"
Expand All @@ -53,7 +62,7 @@ class Config:
graphistry_password = os.getenv("GRAPHISTRY_PASSWORD")

# Embedding parameters
embedding_model: str = "openai"
embedding_model: str = "BAAI/bge-large-en-v1.5"
embedding_dim: int = 1536
embedding_chunk_size: int = 300

Expand Down
30 changes: 25 additions & 5 deletions cognee/infrastructure/InfrastructureConfig.py
Original file line number Diff line number Diff line change
@@ -1,32 +1,52 @@
from cognee.config import Config
from .databases.relational import SqliteEngine, DatabaseEngine
from .databases.vector import WeaviateAdapter, VectorDBInterface
from .databases.vector.weaviate_db import WeaviateAdapter
from .databases.vector.vector_db_interface import VectorDBInterface
from .databases.vector.embeddings.DefaultEmbeddingEngine import DefaultEmbeddingEngine
from .llm.llm_interface import LLMInterface
from .llm.openai.adapter import OpenAIAdapter

config = Config()
config.load()

class InfrastructureConfig():
data_path: str = config.data_path
database_engine: DatabaseEngine = None
vector_engine: VectorDBInterface = None
llm_engine: LLMInterface = None

def get_config(self) -> dict:
if self.database_engine is None:
self.database_engine = SqliteEngine(config.db_path, config.db_name)

if self.llm_engine is None:
self.llm_engine = OpenAIAdapter(config.openai_key, config.model)

if self.vector_engine is None:
self.vector_engine = WeaviateAdapter(
config.weaviate_url,
config.weaviate_api_key,
config.openai_key
embedding_engine = DefaultEmbeddingEngine()
)

return {
"data_path": self.data_path,
"llm_engine": self.llm_engine,
"vector_engine": self.vector_engine,
"database_engine": self.database_engine,
"vector_engine": self.vector_engine
}

def set_config(self, new_config: dict):
self.database_engine = new_config["database_engine"]
self.vector_engine = new_config["vector_engine"]
if "data_path" in new_config:
self.data_path = new_config["data_path"]

if "database_engine" in new_config:
self.database_engine = new_config["database_engine"]

if "vector_engine" in new_config:
self.vector_engine = new_config["vector_engine"]

if "llm_engine" in new_config:
self.llm_engine = new_config["llm_engine"]

infrastructure_config = InfrastructureConfig()
Loading

0 comments on commit 6350e67

Please sign in to comment.