Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cog 793 metadata rework #460

Merged
merged 10 commits into from
Jan 23, 2025
Merged

Cog 793 metadata rework #460

merged 10 commits into from
Jan 23, 2025

Conversation

dexters1
Copy link
Collaborator

@dexters1 dexters1 commented Jan 20, 2025

Description

Simplify Foreign Metadata handling in Cognee
Show Foreign Metadata in graphs as part of the Document node
Remove unused methods and files related to metadata ingestion

Picture showing foreign metadata in graph property of document node:
image

Foreign metadata refers to metadata related do documents that were not processed by Cognee, for example if using LlamaIndex to handle documents

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin

Summary by CodeRabbit

Release Notes

  • New Features

    • Enhanced data ingestion process with more flexible data type handling.
    • Introduced support for external metadata storage in data models.
  • Breaking Changes

    • Removed metadata-specific functions and relationships.
    • Replaced metadata_id with external_metadata across document types.
    • Simplified data ingestion workflow by removing metadata-specific methods.
  • Improvements

    • Updated error handling in the embedding engine.
    • Streamlined data chunking and processing logic.
    • Improved flexibility in data and document management.
  • Refactoring

    • Simplified import and task creation processes.
    • Removed deprecated metadata-related modules and functions.

Copy link
Contributor

coderabbitai bot commented Jan 20, 2025

Walkthrough

This pull request introduces significant changes to the data and metadata handling in the Cognee system. The modifications primarily involve removing the Metadata class, updating data models, and simplifying the data ingestion process. The changes remove explicit metadata management, replacing it with a more flexible external_metadata approach across various components of the system, including document processing, task ingestion, and database models.

Changes

File/Path Change Summary
cognee/api/v1/add/add_v2.py Replaced ingest_data_with_metadata import with ingest_data and updated task creation.
cognee/api/v1/cognify/code_graph_pipeline.py Updated task instantiation to use ingest_data.
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py Simplified error handling for BadRequestError.
cognee/modules/data/models/ Removed Metadata.py, added external_metadata to Data class, removed metadata_relationship.
cognee/modules/data/processing/document_types/Document.py Replaced metadata_id with external_metadata.
cognee/tasks/ingestion/ Removed metadata-related functions, updated ingest_data.py to accept generic data input and modified related function signatures.
Test files Updated document class instantiations to use external_metadata.
notebooks/cognee_llama_index.ipynb Replaced import of ingest_data_with_metadata with ingest_data and updated related function signatures.

Possibly related PRs

Suggested reviewers

  • Vasilije1990

Poem

🐰 Metadata's dance has come to an end,
External fields now where structures would bend
No more tables of rigid design
Just flexible data that smoothly align
A rabbit's code hop, light and free! 🌈

✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@dexters1 dexters1 self-assigned this Jan 20, 2025
Resolve issue with LlamaIndex notebook after refactor
@dexters1 dexters1 marked this pull request as ready for review January 20, 2025 17:06
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Nitpick comments (5)
cognee/tasks/ingestion/ingest_data.py (2)

126-127: Consider combining permission grants into a single call

On lines 126-127, permissions are granted separately:

await give_permission_on_document(user, data_id, "read")
await give_permission_on_document(user, data_id, "write")

If the give_permission_on_document function supports multiple permissions at once, consider combining them to reduce the number of calls:

await give_permission_on_document(user, data_id, ["read", "write"])

134-135: Variable db_engine is unused after assignment

On line 134, db_engine is assigned but not used in the subsequent code:

db_engine = get_relational_engine()

If db_engine is not needed beyond this point, consider removing the assignment to clean up the code.

- db_engine = get_relational_engine()

Alternatively, if it is needed elsewhere, ensure it's utilized appropriately.

cognee/tasks/ingestion/save_data_item_to_storage.py (1)

7-7: Consider using a more specific type than Any.

While Any provides flexibility, consider creating a custom type or protocol for supported data items to maintain type safety and improve code documentation.

-from typing import Union, BinaryIO, Any
+from typing import Union, BinaryIO, Protocol
+
+class DataItem(Protocol):
+    """Protocol for supported data items in storage."""
+    def __str__(self) -> str: ...
+
+class LlamaIndexItem(DataItem):
+    """Type hint for LlamaIndex data items."""
+    pass

-async def save_data_item_to_storage(data_item: Union[BinaryIO, str, Any], dataset_name: str) -> str:
+async def save_data_item_to_storage(data_item: Union[BinaryIO, str, LlamaIndexItem], dataset_name: str) -> str:
cognee/tests/integration/documents/UnstructuredDocument_test.py (1)

42-42: Enhance test coverage for foreign_metadata.

The test currently uses empty strings for foreign_metadata across all document types. Consider adding test cases with:

  • Non-empty foreign metadata
  • Different metadata structures (e.g., JSON objects)
  • Edge cases (e.g., very large metadata)

Also applies to: 50-50, 58-58, 66-66

notebooks/cognee_llama_index.ipynb (1)

156-159: Document the new metadata handling approach.

Since this notebook serves as an example, it would be helpful to add documentation showing how Foreign Metadata from LlamaIndex Documents appears in the graph visualization. Consider adding a markdown cell after the graph rendering to explain this.

Add a markdown cell like this:

await add(documents)

# Use LLMs and cognee to create knowledge graph
await cognee.cognify()
+```
+
+## Understanding Foreign Metadata in the Graph
+
+When using LlamaIndex Documents, their metadata is now displayed as part of the Document node's properties in the graph. You can observe this in the visualization above, where each Document node includes the original LlamaIndex metadata.
+```
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bbb8e89 and bd3a5a7.

📒 Files selected for processing (23)
  • cognee/api/v1/add/add_v2.py (2 hunks)
  • cognee/api/v1/cognify/code_graph_pipeline.py (2 hunks)
  • cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (1 hunks)
  • cognee/modules/chunking/TextChunker.py (1 hunks)
  • cognee/modules/data/models/Data.py (2 hunks)
  • cognee/modules/data/models/Metadata.py (0 hunks)
  • cognee/modules/data/operations/delete_metadata.py (0 hunks)
  • cognee/modules/data/operations/get_metadata.py (0 hunks)
  • cognee/modules/data/operations/write_metadata.py (0 hunks)
  • cognee/modules/data/processing/document_types/Document.py (1 hunks)
  • cognee/tasks/documents/classify_documents.py (2 hunks)
  • cognee/tasks/ingestion/__init__.py (1 hunks)
  • cognee/tasks/ingestion/ingest_data.py (1 hunks)
  • cognee/tasks/ingestion/ingest_data_with_metadata.py (0 hunks)
  • cognee/tasks/ingestion/save_data_item_to_storage.py (1 hunks)
  • cognee/tasks/ingestion/save_data_item_with_metadata_to_storage.py (0 hunks)
  • cognee/tasks/ingestion/save_data_to_storage.py (0 hunks)
  • cognee/tests/integration/documents/AudioDocument_test.py (1 hunks)
  • cognee/tests/integration/documents/ImageDocument_test.py (1 hunks)
  • cognee/tests/integration/documents/PdfDocument_test.py (1 hunks)
  • cognee/tests/integration/documents/TextDocument_test.py (1 hunks)
  • cognee/tests/integration/documents/UnstructuredDocument_test.py (1 hunks)
  • notebooks/cognee_llama_index.ipynb (3 hunks)
💤 Files with no reviewable changes (7)
  • cognee/modules/data/operations/get_metadata.py
  • cognee/tasks/ingestion/save_data_to_storage.py
  • cognee/modules/data/models/Metadata.py
  • cognee/modules/data/operations/delete_metadata.py
  • cognee/tasks/ingestion/save_data_item_with_metadata_to_storage.py
  • cognee/tasks/ingestion/ingest_data_with_metadata.py
  • cognee/modules/data/operations/write_metadata.py
⏰ Context from checks skipped due to timeout of 90000ms (5)
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: windows-latest
  • GitHub Check: docker-compose-test
🔇 Additional comments (12)
cognee/tasks/ingestion/ingest_data.py (2)

142-152: Ensure consistent dataset names for different database engines

The dataset names differ between SQLite and other databases:

  • SQLite uses "main" as the dataset_name.
  • Other databases use "public" as the dataset_name.

Confirm that this discrepancy is intentional and that downstream processes handle these dataset names correctly.


29-36: Improve warning message and ensure consistent metadata handling

In the get_foreign_metadata_dict function, there's a couple of issues to address:

  • Ellipsis in warning message: The warning message on line 34 includes an ellipsis (...), which may be unintentional and could confuse the reader.
  • Returning an empty dictionary: When the data_item does not have a dict method, the function returns an empty dictionary. This might lead to unexpected behavior downstream if not handled properly.

Apply this diff to improve the warning message and consider if returning an empty dictionary is appropriate:

 def get_foreign_metadata_dict(data_item: Union[BinaryIO, str, Any]) -> dict[str, Any]:
     if hasattr(data_item, "dict") and inspect.ismethod(getattr(data_item, "dict")):
         return {"metadata": data_item.dict(), "origin": str(type(data_item))}
     else:
-        warnings.warn(
-            f"Data of type {type(data_item)}... does not have dict method. Returning empty metadata."
-        )
+        warnings.warn(
+            f"Data of type {type(data_item)} does not have a 'dict' method. Returning empty metadata."
+        )
         return {}

Ensure that functions consuming foreign_metadata can handle cases where it's an empty dictionary.

✅ Verification successful

Warning message improvement is safe to apply, empty metadata handling is correct

The codebase is designed to handle empty metadata safely:

  • The foreign_metadata field is optional in the Document model
  • Empty metadata is wrapped in a tuple for additional safety
  • No evidence of code requiring non-empty metadata
🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found


Script:

#!/bin/bash
# Find calls to get_foreign_metadata_dict and its usage context
ast-grep --pattern 'get_foreign_metadata_dict($_)'

# Search for direct references to the function
rg "get_foreign_metadata_dict" -A 3 -B 3

# Look for metadata dictionary access patterns
rg "foreign_metadata\[" -A 2 -B 2
rg "metadata.*dict" -A 2 -B 2

Length of output: 48549

cognee/tasks/ingestion/__init__.py (1)

2-2: Update import statements to reflect refactoring

The import of ingest_data replaces the previous imports of deprecated functions. This change aligns with the refactoring objectives to simplify the ingestion process.

cognee/modules/data/models/Data.py (1)

3-4: LGTM: Import changes align with new schema requirements.

The addition of JSON type import supports the new foreign_metadata column.

cognee/api/v1/add/add_v2.py (2)

25-25: LGTM: Task pipeline update is consistent.

The task list correctly reflects the simplified ingestion process.


5-5: Verify the impact of removing metadata-specific ingestion.

The change from ingest_data_with_metadata to ingest_data aligns with the metadata rework. However, we should verify:

  1. Handling of existing metadata during ingestion
  2. Impact on downstream processes expecting metadata

Let's check for dependencies and usages:

✅ Verification successful

Metadata handling is properly maintained in the new implementation

The consolidation of ingest_data_with_metadata into ingest_data is a safe refactoring that preserves all metadata functionality. The new implementation:

  • Captures and stores all file metadata (name, path, extension, mime_type, hash)
  • Preserves custom metadata through foreign_metadata field
  • Maintains backward compatibility with existing data structures
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for references to metadata ingestion
rg -g '!*.pyc' 'ingest_data_with_metadata'

# Check for potential impacts on downstream processes
rg -g '!*.pyc' -A 5 'metadata.*process|process.*metadata'

Length of output: 37394


Script:

#!/bin/bash
# Check implementation of ingest_data and metadata handling
ast-grep --pattern 'def ingest_data($$$)'

# Look for metadata handling patterns in ingestion module
rg -g '!*.pyc' -A 5 'metadata.*ingest|ingest.*metadata' cognee/tasks/ingestion/

Length of output: 125


Script:

#!/bin/bash
# Find ingestion implementation with broader pattern
fd . cognee/tasks/ingestion -t f -x cat {}

# Search for any metadata handling patterns
rg -g '!*.pyc' 'metadata' cognee/tasks/

Length of output: 17095

cognee/tests/integration/documents/AudioDocument_test.py (1)

32-32: Similar improvements needed as in TextDocument_test.py.

See the review comment in TextDocument_test.py about adding test cases for foreign_metadata handling.

cognee/tests/integration/documents/ImageDocument_test.py (1)

21-21: Similar improvements needed as in TextDocument_test.py.

See the review comment in TextDocument_test.py about adding test cases for foreign_metadata handling.

cognee/api/v1/cognify/code_graph_pipeline.py (1)

13-13: LGTM! Changes align with metadata simplification objective.

The transition from ingest_data_with_metadata to ingest_data is clean and maintains the existing task configuration structure.

Also applies to: 71-71

notebooks/cognee_llama_index.ipynb (3)

122-124: Cell metadata changes look good.

These changes reset the cell execution state, which is a good practice for notebooks that will be shared or committed to version control.


136-136: Verify backward compatibility of the new ingestion method.

The change from ingest_data_with_metadata to ingest_data aligns with the PR's goal of simplifying metadata handling. However, we should verify that this change maintains compatibility with existing LlamaIndex documents.

Let's check if there are any other notebooks or examples that might be affected:

✅ Verification successful

Import change is consistent and properly migrated

The change from ingest_data_with_metadata to ingest_data has been properly implemented across the codebase. The new function is consistently used in all relevant places, and there are no remaining references to the old implementation.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any remaining references to the old ingestion method
rg "ingest_data_with_metadata" --type py --type jupyter

Length of output: 55


Script:

#!/bin/bash
# Check the implementation of ingest_data and its usages
ast-grep --pattern 'def ingest_data($$$)'

# Find other references to ingest_data to understand usage patterns
rg "ingest_data" -A 3 --type py --type jupyter

Length of output: 2465


156-156: Verify preservation of LlamaIndex Document metadata.

Since this PR changes how Foreign Metadata is handled, we should verify that the LlamaIndex Document metadata is properly preserved in the Document node's graph property.

Let's check the implementation:

✅ Verification successful

Metadata handling is properly implemented and preserved

The change from ingest_data_with_metadata to ingest_data is safe as the metadata from LlamaIndex Documents is automatically extracted and stored in the foreign_metadata field through the get_foreign_metadata_dict helper function.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check how LlamaIndex Document metadata is handled in the new implementation
ast-grep --pattern 'def ingest_data($$$):
  $$$
  # Look for metadata handling
  $$$
'

Length of output: 88


Script:

#!/bin/bash
# Find the ingest_data function definition
rg "def ingest_data" -A 10

# Also search for metadata handling near ingest_data
rg "ingest_data.*metadata|metadata.*ingest_data" -A 5

Length of output: 989


Script:

#!/bin/bash
# Get complete implementation of ingest_data function
cat cognee/tasks/ingestion/ingest_data.py

# Search for LlamaIndex Document handling
rg -l "from llama_index" 
rg "Document.*metadata" -A 5

Length of output: 44109

data_point.mime_type = file_metadata["mime_type"]
data_point.owner_id = user.id
data_point.content_hash = file_metadata["content_hash"]
data_point.foreign_metadata = (get_foreign_metadata_dict(data_item),)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Assigning tuple to foreign_metadata may cause data type issues

On line 98, foreign_metadata is assigned as a tuple containing a dictionary:

data_point.foreign_metadata = (get_foreign_metadata_dict(data_item),)

This assignment creates a tuple with one element, which may not be the intended data type for foreign_metadata. Based on the DataPoint model, foreign_metadata is expected to be an optional string.

To fix this issue, remove the parentheses and comma to assign the dictionary directly:

- data_point.foreign_metadata = (get_foreign_metadata_dict(data_item),)
+ data_point.foreign_metadata = get_foreign_metadata_dict(data_item)

Or, if foreign_metadata should be a JSON string, serialize it:

+ import json
  ...
- data_point.foreign_metadata = (get_foreign_metadata_dict(data_item),)
+ data_point.foreign_metadata = json.dumps(get_foreign_metadata_dict(data_item))

mime_type=file_metadata["mime_type"],
owner_id=user.id,
content_hash=file_metadata["content_hash"],
foreign_metadata=get_foreign_metadata_dict(data_item),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Ensure correct assignment of foreign_metadata when creating Data object

Similar to the previous issue, on line 109, foreign_metadata is assigned:

foreign_metadata=get_foreign_metadata_dict(data_item),

The trailing comma may inadvertently create a tuple, leading to potential data type issues.

Apply this diff to correct the assignment:

- foreign_metadata=get_foreign_metadata_dict(data_item),
+ foreign_metadata=get_foreign_metadata_dict(data_item)

Or serialize if necessary:

+ import json
  ...
- foreign_metadata=get_foreign_metadata_dict(data_item)
+ foreign_metadata=json.dumps(get_foreign_metadata_dict(data_item))

@@ -7,7 +7,7 @@
class Document(DataPoint):
name: str
raw_data_location: str
metadata_id: UUID
foreign_metadata: Optional[str]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Potential type mismatch for foreign_metadata attribute

On line 10, foreign_metadata is defined as:

foreign_metadata: Optional[str]

However, elsewhere in the code (e.g., ingest_data.py), foreign_metadata is assigned a dictionary, which may cause type inconsistencies.

Consider changing the type annotation to accept a dictionary or ensure that foreign_metadata is always a string (e.g., by serializing the dictionary to JSON):

- foreign_metadata: Optional[str]
+ from typing import Optional, Dict, Any
+ foreign_metadata: Optional[Dict[str, Any]]

Or, if foreign_metadata should remain a string, ensure all assignments serialize the data properly:

foreign_metadata=json.dumps(get_foreign_metadata_dict(data_item))

@@ -21,6 +19,7 @@ class Data(Base):
raw_data_location = Column(String)
owner_id = Column(UUID, index=True)
content_hash = Column(String)
foreign_metadata = Column(JSON)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Consider adding schema validation for foreign_metadata - Confirmed

The codebase shows that foreign_metadata follows an implicit schema {"metadata": dict, "origin": str} but lacks validation. Current implementation silently accepts invalid data, which could lead to inconsistencies. Recommended changes:

  • Add schema validation using Pydantic or similar
  • Document the expected format in the model's docstring
  • Replace the warning with proper error handling for invalid metadata
🔗 Analysis chain

Consider adding schema validation for foreign_metadata.

While using JSON type provides flexibility, consider:

  1. Adding schema validation to ensure consistent metadata structure
  2. Documenting the expected format
  3. Adding migration strategy for existing metadata

Let's check for any existing schema validation or documentation:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for schema validation or documentation
rg -g '!*.pyc' -A 5 'foreign_metadata.*schema|metadata.*schema'

# Check for migration files
fd -g 'migrations/*.py' -x cat {}

Length of output: 2493


Script:

#!/bin/bash
# Get the complete Data.py file
cat cognee/modules/data/models/Data.py

# Search for any validation logic or usage of foreign_metadata
rg -g '!*.pyc' -A 5 'foreign_metadata'

# Look for any documentation files
fd -e md -e rst -x grep -l -i "metadata" {} \;

Length of output: 9265

Comment on lines +8 to +13
if "llama_index" in str(type(data_item)):
# Dynamic import is used because the llama_index module is optional.
from .transform_data import get_data_from_llama_index

file_path = get_data_from_llama_index(data_item, dataset_name)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling for llama_index integration.

The dynamic import and processing of llama_index data should include proper error handling:

  1. Import failures
  2. Data processing errors
  3. Invalid llama_index data
     if "llama_index" in str(type(data_item)):
-        # Dynamic import is used because the llama_index module is optional.
-        from .transform_data import get_data_from_llama_index
+        try:
+            # Dynamic import is used because the llama_index module is optional.
+            from .transform_data import get_data_from_llama_index
+        except ImportError as e:
+            raise IngestionError(f"LlamaIndex support not available: {str(e)}")
+        
+        try:
+            file_path = get_data_from_llama_index(data_item, dataset_name)
+        except Exception as e:
+            raise IngestionError(f"Failed to process LlamaIndex data: {str(e)}")
-        file_path = get_data_from_llama_index(data_item, dataset_name)

Committable suggestion skipped: line range outside the PR's diff.

@@ -20,7 +20,7 @@ def test_PdfDocument():
id=uuid.uuid4(),
name="Test document.pdf",
raw_data_location=test_file_path,
metadata_id=uuid.uuid4(),
foreign_metadata="",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance test coverage for foreign_metadata scenarios.

Current test uses an empty string for foreign_metadata, but should include tests for:

  1. Valid JSON metadata
  2. Different metadata structures
  3. Edge cases (null, invalid JSON)

Example test cases to add:

def test_PdfDocument_with_metadata():
    metadata_cases = [
        {"source": "llama_index", "metadata": {"key": "value"}},
        {"source": "external", "tags": ["test", "pdf"]},
        None,
        {"invalid": float('inf')},  # Should handle JSON serialization edge cases
    ]
    
    for metadata in metadata_cases:
        document = PdfDocument(
            id=uuid.uuid4(),
            name="Test document.pdf",
            raw_data_location=test_file_path,
            foreign_metadata=metadata,
            mime_type="",
        )
        # Add assertions for metadata handling

@@ -32,7 +32,7 @@ def test_TextDocument(input_file, chunk_size):
id=uuid.uuid4(),
name=input_file,
raw_data_location=test_file_path,
metadata_id=uuid.uuid4(),
foreign_metadata="",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add test cases for foreign_metadata handling.

The test currently initializes foreign_metadata with an empty string but doesn't verify how the document handles various foreign metadata scenarios. Consider adding test cases for:

  • Valid JSON metadata
  • Empty metadata
  • Invalid JSON metadata

Example test case:

@pytest.mark.parametrize(
    "foreign_metadata,expected_result",
    [
        ('{"source": "external", "tags": ["test"]}', True),
        ("", True),
        ("invalid json", False),
    ],
)
def test_TextDocument_foreign_metadata(foreign_metadata, expected_result):
    document = TextDocument(
        id=uuid.uuid4(),
        name="test.txt",
        raw_data_location="test_path",
        foreign_metadata=foreign_metadata,
        mime_type=""
    )
    assert hasattr(document, "foreign_metadata") == expected_result

@@ -1,4 +1,5 @@
from cognee.modules.data.models import Data
import json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for JSON serialization and remove unnecessary indentation.

The current implementation has two potential issues:

  1. No error handling for JSON serialization failures
  2. Unnecessary overhead from pretty-printing JSON with indent=4

Consider this implementation:

 async def classify_documents(data_documents: list[Data]) -> list[Document]:
     documents = []
     for data_item in data_documents:
+        try:
+            metadata_json = json.dumps(data_item.foreign_metadata)
+        except (TypeError, ValueError) as e:
+            logger.error(f"Failed to serialize foreign_metadata for {data_item.id}: {e}")
+            metadata_json = "{}"
+
         document = EXTENSION_TO_DOCUMENT_CLASS[data_item.extension](
             id=data_item.id,
             title=f"{data_item.name}.{data_item.extension}",
             raw_data_location=data_item.raw_data_location,
             name=data_item.name,
             mime_type=data_item.mime_type,
-            foreign_metadata=json.dumps(data_item.foreign_metadata, indent=4),
+            foreign_metadata=metadata_json,
         )

Also applies to: 61-61

Comment on lines +96 to 99
_metadata={"index_fields": ["text"]},
)
except Exception as e:
print(e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve error handling in DocumentChunk creation.

Replace print statements with proper logging:

-            except Exception as e:
-                print(e)
+            except Exception as e:
+                logger.error("Failed to create DocumentChunk: %s", str(e))

Also consider:

  1. Adding more specific exception types
  2. Including context information in the error message
  3. Potentially propagating the error up the stack

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +98 to 100
except litellm.exceptions.BadRequestError:
raise EmbeddingException("Failed to index data points.")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance error handling granularity.

The current implementation might miss important OpenAI-specific errors. Consider:

  1. Restoring specific handling for OpenAIError
  2. Adding specific error messages for different failure scenarios
-        except litellm.exceptions.BadRequestError:
-            raise EmbeddingException("Failed to index data points.")
+        except litellm.exceptions.BadRequestError as e:
+            raise EmbeddingException(f"Bad request while indexing data points: {str(e)}")
+        except litellm.llms.OpenAI.openai.OpenAIError as e:
+            raise EmbeddingException(f"OpenAI service error: {str(e)}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except litellm.exceptions.BadRequestError:
raise EmbeddingException("Failed to index data points.")
except litellm.exceptions.BadRequestError as e:
raise EmbeddingException(f"Bad request while indexing data points: {str(e)}")
except litellm.llms.OpenAI.openai.OpenAIError as e:
raise EmbeddingException(f"OpenAI service error: {str(e)}")

@Vasilije1990 Vasilije1990 self-requested a review January 21, 2025 13:02
dexters1 and others added 3 commits January 21, 2025 18:20
Rename foreign metadata to external metadata for metadata coming outside of Cognee
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
cognee/tasks/ingestion/ingest_data.py (1)

98-98: ⚠️ Potential issue

Fix tuple assignment to external_metadata.

The trailing comma creates an unintended tuple.

-data_point.external_metadata = (get_external_metadata_dict(data_item),)
+data_point.external_metadata = get_external_metadata_dict(data_item)
🧹 Nitpick comments (3)
cognee/tasks/ingestion/ingest_data.py (3)

1-18: Consider reorganizing imports for better readability.

Consider organizing imports into standard library, third-party, and local application imports with a blank line between each group:

 from typing import Any, List, Union, BinaryIO
 import inspect
 import warnings
 
 import dlt
 
 import cognee.modules.ingestion as ingestion
 from cognee.infrastructure.databases.relational import get_relational_engine
 from cognee.modules.data.methods import create_dataset
 from cognee.modules.data.models.DatasetData import DatasetData
 from cognee.modules.users.models import User
 from cognee.modules.users.permissions.methods import give_permission_on_document
 from cognee.shared.utils import send_telemetry
 from .get_dlt_destination import get_dlt_destination
 from .save_data_item_to_storage import save_data_item_to_storage

29-36: Enhance the helper function with documentation and improved error handling.

The function could benefit from:

  1. A docstring explaining its purpose and return value
  2. More specific type hints
  3. A more informative warning message
-def get_external_metadata_dict(data_item: Union[BinaryIO, str, Any]) -> dict[str, Any]:
+def get_external_metadata_dict(data_item: Any) -> dict[str, Any]:
+    """Extract metadata from a data item.
+    
+    Args:
+        data_item: Any object that might have a dict method for metadata extraction
+        
+    Returns:
+        dict: A dictionary containing metadata and origin information, or empty dict if extraction fails
+    """
     if hasattr(data_item, "dict") and inspect.ismethod(getattr(data_item, "dict")):
         return {"metadata": data_item.dict(), "origin": str(type(data_item))}
     else:
         warnings.warn(
-            f"Data of type {type(data_item)}... does not have dict method. Returning empty metadata."
+            f"Unable to extract metadata from {type(data_item).__name__}: object does not implement dict() method"
         )
         return {}

137-154: Add error handling for pipeline execution and extract configuration.

The pipeline configuration should be handled more robustly:

+def _get_pipeline_config(dialect_name: str) -> tuple[str, str]:
+    """Get pipeline configuration based on database dialect.
+    
+    Returns:
+        tuple: (dataset_name, table_name)
+    """
+    if dialect_name == "sqlite":
+        return "main", "file_metadata"
+    return "public", "file_metadata"

 if db_engine.engine.dialect.name == "sqlite":
-    # To use sqlite with dlt dataset_name must be set to "main".
-    # Sqlite doesn't support schemas
-    run_info = pipeline.run(
-        data_resources(file_paths, user),
-        table_name="file_metadata",
-        dataset_name="main",
-        write_disposition="merge",
-    )
-else:
-    # Data should be stored in the same schema to allow deduplication
-    run_info = pipeline.run(
-        data_resources(file_paths, user),
-        table_name="file_metadata",
-        dataset_name="public",
-        write_disposition="merge",
-    )
+    dataset_name, table_name = _get_pipeline_config(db_engine.engine.dialect.name)
+    try:
+        run_info = pipeline.run(
+            data_resources(file_paths, user),
+            table_name=table_name,
+            dataset_name=dataset_name,
+            write_disposition="merge",
+        )
+    except Exception as e:
+        raise RuntimeError(f"Pipeline execution failed: {str(e)}") from e
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 655ab0b and 40c0279.

📒 Files selected for processing (9)
  • cognee/modules/data/models/Data.py (2 hunks)
  • cognee/modules/data/processing/document_types/Document.py (1 hunks)
  • cognee/tasks/documents/classify_documents.py (2 hunks)
  • cognee/tasks/ingestion/ingest_data.py (1 hunks)
  • cognee/tests/integration/documents/AudioDocument_test.py (1 hunks)
  • cognee/tests/integration/documents/ImageDocument_test.py (1 hunks)
  • cognee/tests/integration/documents/PdfDocument_test.py (1 hunks)
  • cognee/tests/integration/documents/TextDocument_test.py (1 hunks)
  • cognee/tests/integration/documents/UnstructuredDocument_test.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • cognee/modules/data/processing/document_types/Document.py
  • cognee/tests/integration/documents/ImageDocument_test.py
  • cognee/tests/integration/documents/AudioDocument_test.py
  • cognee/modules/data/models/Data.py
  • cognee/tests/integration/documents/TextDocument_test.py
  • cognee/tasks/documents/classify_documents.py
  • cognee/tests/integration/documents/UnstructuredDocument_test.py
  • cognee/tests/integration/documents/PdfDocument_test.py
⏰ Context from checks skipped due to timeout of 90000ms (19)
  • GitHub Check: run_multimedia_example_test / test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: run_simple_example_test / test
  • GitHub Check: run_dynamic_steps_example_test / test
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: windows-latest
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: profiler
  • GitHub Check: docker-compose-test

Comment on lines +55 to +128
data = [data]

async with db_engine.get_async_session() as session:
# Read metadata stored with dlt
files_metadata = await db_engine.get_all_data_from_table(table_name, dataset_name)
for file_metadata in files_metadata:
from sqlalchemy import select
from cognee.modules.data.models import Data
file_paths = []

dataset = await create_dataset(dataset_name, user.id, session)
# Process data
for data_item in data:
file_path = await save_data_item_to_storage(data_item, dataset_name)

data = (
await session.execute(select(Data).filter(Data.id == UUID(file_metadata["id"])))
).scalar_one_or_none()
file_paths.append(file_path)

if data is not None:
data.name = file_metadata["name"]
data.raw_data_location = file_metadata["file_path"]
data.extension = file_metadata["extension"]
data.mime_type = file_metadata["mime_type"]
# Ingest data and add metadata
with open(file_path.replace("file://", ""), mode="rb") as file:
classified_data = ingestion.classify(file)

# data_id is the hash of file contents + owner id to avoid duplicate data
data_id = ingestion.identify(classified_data, user)

file_metadata = classified_data.get_metadata()

from sqlalchemy import select

from cognee.modules.data.models import Data

db_engine = get_relational_engine()

async with db_engine.get_async_session() as session:
dataset = await create_dataset(dataset_name, user.id, session)

# Check to see if data should be updated
data_point = (
await session.execute(select(Data).filter(Data.id == data_id))
).scalar_one_or_none()

if data_point is not None:
data_point.name = file_metadata["name"]
data_point.raw_data_location = file_metadata["file_path"]
data_point.extension = file_metadata["extension"]
data_point.mime_type = file_metadata["mime_type"]
data_point.owner_id = user.id
data_point.content_hash = file_metadata["content_hash"]
data_point.external_metadata = (get_external_metadata_dict(data_item),)
await session.merge(data_point)
else:
data_point = Data(
id=data_id,
name=file_metadata["name"],
raw_data_location=file_metadata["file_path"],
extension=file_metadata["extension"],
mime_type=file_metadata["mime_type"],
owner_id=user.id,
content_hash=file_metadata["content_hash"],
external_metadata=get_external_metadata_dict(data_item),
)

# Check if data is already in dataset
dataset_data = (
await session.execute(
select(DatasetData).filter(
DatasetData.data_id == data_id, DatasetData.dataset_id == dataset.id
)
)
).scalar_one_or_none()
# If data is not present in dataset add it
if dataset_data is None:
dataset.data.append(data_point)

await session.merge(data)
await session.commit()
else:
data = Data(
id=UUID(file_metadata["id"]),
name=file_metadata["name"],
raw_data_location=file_metadata["file_path"],
extension=file_metadata["extension"],
mime_type=file_metadata["mime_type"],
)

dataset.data.append(data)
await session.commit()

await give_permission_on_document(user, UUID(file_metadata["id"]), "read")
await give_permission_on_document(user, UUID(file_metadata["id"]), "write")
await give_permission_on_document(user, data_id, "read")
await give_permission_on_document(user, data_id, "write")
return file_paths
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Refactor long function into smaller, focused functions.

The data_storing function is too complex and handles multiple responsibilities. Consider breaking it down into smaller functions:

  1. Extract database operations:
async def _update_data_point(data_point: Data, file_metadata: dict, user: User, data_item: Any) -> None:
    data_point.name = file_metadata["name"]
    data_point.raw_data_location = file_metadata["file_path"]
    data_point.extension = file_metadata["extension"]
    data_point.mime_type = file_metadata["mime_type"]
    data_point.owner_id = user.id
    data_point.content_hash = file_metadata["content_hash"]
    data_point.external_metadata = get_external_metadata_dict(data_item)

async def _create_data_point(data_id: str, file_metadata: dict, user: User, data_item: Any) -> Data:
    return Data(
        id=data_id,
        name=file_metadata["name"],
        raw_data_location=file_metadata["file_path"],
        extension=file_metadata["extension"],
        mime_type=file_metadata["mime_type"],
        owner_id=user.id,
        content_hash=file_metadata["content_hash"],
        external_metadata=get_external_metadata_dict(data_item)
    )
  1. Add error handling:
 async def data_storing(data: Any, dataset_name: str, user: User):
     if not isinstance(data, list):
         data = [data]

     file_paths = []

     for data_item in data:
-        file_path = await save_data_item_to_storage(data_item, dataset_name)
-        file_paths.append(file_path)
+        try:
+            file_path = await save_data_item_to_storage(data_item, dataset_name)
+            file_paths.append(file_path)
+        except Exception as e:
+            raise RuntimeError(f"Failed to save data item to storage: {str(e)}") from e

         async with db_engine.get_async_session() as session:
+            try:
                 dataset = await create_dataset(dataset_name, user.id, session)
                 # ... rest of the database operations ...
                 await session.commit()
+            except Exception as e:
+                await session.rollback()
+                raise RuntimeError(f"Database operation failed: {str(e)}") from e

Comment on lines +38 to 53
@dlt.resource(standalone=True, primary_key="id", merge_key="id")
async def data_resources(file_paths: List[str], user: User):
for file_path in file_paths:
with open(file_path.replace("file://", ""), mode="rb") as file:
classified_data = ingestion.classify(file)
data_id = ingestion.identify(classified_data)
data_id = ingestion.identify(classified_data, user)
file_metadata = classified_data.get_metadata()
yield {
"id": data_id,
"name": file_metadata["name"],
"file_path": file_metadata["file_path"],
"extension": file_metadata["extension"],
"mime_type": file_metadata["mime_type"],
"content_hash": file_metadata["content_hash"],
"owner_id": str(user.id),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for file operations.

The function should handle potential file operation errors and validate inputs:

 @dlt.resource(standalone=True, primary_key="id", merge_key="id")
 async def data_resources(file_paths: List[str], user: User):
+    if not file_paths:
+        raise ValueError("file_paths cannot be empty")
+
     for file_path in file_paths:
-        with open(file_path.replace("file://", ""), mode="rb") as file:
-            classified_data = ingestion.classify(file)
-            data_id = ingestion.identify(classified_data, user)
-            file_metadata = classified_data.get_metadata()
-            yield {
-                "id": data_id,
-                "name": file_metadata["name"],
-                "file_path": file_metadata["file_path"],
-                "extension": file_metadata["extension"],
-                "mime_type": file_metadata["mime_type"],
-                "content_hash": file_metadata["content_hash"],
-                "owner_id": str(user.id),
-            }
+        try:
+            with open(file_path.replace("file://", ""), mode="rb") as file:
+                classified_data = ingestion.classify(file)
+                data_id = ingestion.identify(classified_data, user)
+                file_metadata = classified_data.get_metadata()
+                yield {
+                    "id": data_id,
+                    "name": file_metadata["name"],
+                    "file_path": file_metadata["file_path"],
+                    "extension": file_metadata["extension"],
+                    "mime_type": file_metadata["mime_type"],
+                    "content_hash": file_metadata["content_hash"],
+                    "owner_id": str(user.id),
+                }
+        except FileNotFoundError as e:
+            raise ValueError(f"File not found: {file_path}") from e
+        except Exception as e:
+            raise RuntimeError(f"Error processing file {file_path}: {str(e)}") from e
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@dlt.resource(standalone=True, primary_key="id", merge_key="id")
async def data_resources(file_paths: List[str], user: User):
for file_path in file_paths:
with open(file_path.replace("file://", ""), mode="rb") as file:
classified_data = ingestion.classify(file)
data_id = ingestion.identify(classified_data)
data_id = ingestion.identify(classified_data, user)
file_metadata = classified_data.get_metadata()
yield {
"id": data_id,
"name": file_metadata["name"],
"file_path": file_metadata["file_path"],
"extension": file_metadata["extension"],
"mime_type": file_metadata["mime_type"],
"content_hash": file_metadata["content_hash"],
"owner_id": str(user.id),
}
@dlt.resource(standalone=True, primary_key="id", merge_key="id")
async def data_resources(file_paths: List[str], user: User):
if not file_paths:
raise ValueError("file_paths cannot be empty")
for file_path in file_paths:
try:
with open(file_path.replace("file://", ""), mode="rb") as file:
classified_data = ingestion.classify(file)
data_id = ingestion.identify(classified_data, user)
file_metadata = classified_data.get_metadata()
yield {
"id": data_id,
"name": file_metadata["name"],
"file_path": file_metadata["file_path"],
"extension": file_metadata["extension"],
"mime_type": file_metadata["mime_type"],
"content_hash": file_metadata["content_hash"],
"owner_id": str(user.id),
}
except FileNotFoundError as e:
raise ValueError(f"File not found: {file_path}") from e
except Exception as e:
raise RuntimeError(f"Error processing file {file_path}: {str(e)}") from e

@Vasilije1990 Vasilije1990 merged commit 90657a2 into dev Jan 23, 2025
28 checks passed
@Vasilije1990 Vasilije1990 deleted the COG-793-metadata-rework branch January 23, 2025 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants