feat: outsources chunking parameters to extract chunk from documents … #289

hajdul88 · 2024-12-09T16:57:58Z

…task

Summary by CodeRabbit

New Features
- Introduced a new ChunkerConfig class for flexible chunking strategy selection.
- Updated document processing methods to support dynamic chunker selection.
Bug Fixes
- Enhanced error handling for unsupported chunker names.
Documentation
- Updated method signatures across various document classes to include a new chunker parameter.
Chores
- Removed direct dependencies on TextChunker, streamlining chunking implementation.

…task

coderabbitai · 2024-12-09T16:58:06Z

Walkthrough

The changes involve modifications to several document processing classes, including AudioDocument, ImageDocument, PdfDocument, TextDocument, and Document. Each class's read method has been updated to accept an additional chunker: str parameter, facilitating dynamic selection of chunking strategies. The new ChunkerMapping.py file introduces the ChunkerConfig class, which maps chunker types to their corresponding classes. The extract_chunks_from_documents function has also been updated to include the chunker parameter, allowing for more flexible document processing.

Changes

File Path	Change Summary
`cognee/modules/data/processing/document_types/AudioDocument.py`	Updated `read` method signature to include `chunker: str`. Removed `TextChunker` import, using `ChunkerConfig` instead.
`cognee/modules/data/processing/document_types/ChunkerMapping.py`	Added `ChunkerConfig` class with `get_chunker` method for mapping chunker types to classes.
`cognee/modules/data/processing/document_types/Document.py`	Updated `read` method signature to include `chunker: str` as an optional parameter.
`cognee/modules/data/processing/document_types/ImageDocument.py`	Updated `read` method signature to include `chunker: str`. Removed `TextChunker` import, using `ChunkerConfig` instead.
`cognee/modules/data/processing/document_types/PdfDocument.py`	Updated `read` method signature to include `chunker: str`. Removed `TextChunker` import, using `ChunkerConfig` instead.
`cognee/modules/data/processing/document_types/TextDocument.py`	Updated `read` method signature to include `chunker: str`. Removed `TextChunker` import, using `ChunkerConfig` instead.
`cognee/tasks/documents/extract_chunks_from_documents.py`	Updated `extract_chunks_from_documents` function signature to include `chunker` parameter.
`cognee/tests/integration/documents/AudioDocument_test.py`	Updated test to call `document.read` with `chunker='text_chunker'`.
`cognee/tests/integration/documents/ImageDocument_test.py`	Updated test to call `document.read` with `chunker='text_chunker'`.
`cognee/tests/integration/documents/PdfDocument_test.py`	Updated test to call `document.read` with `chunker='text_chunker'`.
`cognee/tests/integration/documents/TextDocument_test.py`	Updated test to call `document.read` with `chunker='text_chunker'`.

Possibly related PRs

Small cleanup pull request #201: The changes in the AudioDocument class are related to the modifications in the classify_documents function, as both involve the AudioDocument class and its instantiation, reflecting a broader context of handling different document types.

Suggested reviewers

Vasilije1990

Poem

In the land of code where documents dwell,
A new way to chunk, oh what a spell!
With ChunkerConfig guiding the way,
Flexible processing brightens the day.
From audio to text, all types in a row,
Let’s hop to the future, watch our systems grow! 🐰✨

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7768239 and 65e1c92.

📒 Files selected for processing (1)

cognee/modules/data/processing/document_types/Document.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

cognee/modules/data/processing/document_types/Document.py

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 6

🧹 Outside diff range and nitpick comments (4)

cognee/modules/data/processing/document_types/ChunkerMapping.py (1)
4-6: Consider making the chunker mapping immutable

The class-level dictionary could be modified at runtime. Consider using a frozen dictionary or tuple-based mapping for immutability.
-    chunker_mapping = {
-        "text_chunker": TextChunker
-    }
+    chunker_mapping = frozendict({
+        "text_chunker": TextChunker
+    })
cognee/modules/data/processing/document_types/PdfDocument.py (1)
8-8: Add type hints and update docstring

The method signature should include type hints for better code maintainability and IDE support.
-    def read(self, chunk_size: int, chunker: str):
+    def read(self, chunk_size: int, chunker: str) -> Generator[str, None, None]:
+        """Read and chunk PDF document content.
+        
+        Args:
+            chunk_size (int): Size of each chunk
+            chunker (str): Type of chunker to use (e.g., "text", "semantic")
+            
+        Yields:
+            str: Document chunks
+            
+        Raises:
+            ValueError: If chunker type is invalid
+        """
cognee/modules/data/processing/document_types/TextDocument.py (1)
Line range hint 7-15: Ensure proper file handle cleanup

The file handle should be properly closed even if an error occurs during reading.
     def read(self, chunk_size: int, chunker: str):
         def get_text():
-            with open(self.raw_data_location, mode = "r", encoding = "utf-8") as file:
-                while True:
-                    text = file.read(1024)
+            BUFFER_SIZE = 1024 * 4  # 4KB buffer for better performance
+            try:
+                with open(self.raw_data_location, mode="r", encoding="utf-8") as file:
+                    while True:
+                        text = file.read(BUFFER_SIZE)
-                    if len(text.strip()) == 0:
-                        break
+                        if not text:
+                            break
-                    yield text
+                        yield text
+            except IOError as e:
+                raise RuntimeError(f"Failed to read text file: {e}") from e
cognee/modules/data/processing/document_types/AudioDocument.py (1)

1-1: Consider architectural improvements for better maintainability

The current implementation has several areas for improvement:

Move common chunker initialization logic to the base Document class to reduce code duplication

Add a configuration validation system to verify chunker types and parameters at startup

Implement a progress monitoring system for long-running operations like transcription

Consider using the Template Method pattern for the read method

Would you like me to provide example implementations for any of these improvements?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between ce96431 and 29a02e0.

📒 Files selected for processing (7)

cognee/modules/data/processing/document_types/AudioDocument.py (2 hunks)
cognee/modules/data/processing/document_types/ChunkerMapping.py (1 hunks)
cognee/modules/data/processing/document_types/Document.py (1 hunks)
cognee/modules/data/processing/document_types/ImageDocument.py (2 hunks)
cognee/modules/data/processing/document_types/PdfDocument.py (1 hunks)
cognee/modules/data/processing/document_types/TextDocument.py (2 hunks)
cognee/tasks/documents/extract_chunks_from_documents.py (1 hunks)

🔇 Additional comments (4)

cognee/modules/data/processing/document_types/ChunkerMapping.py (1)

9-15: Add return type hint to get_chunker method

The method should specify its return type for better type safety.

-    def get_chunker(cls, chunker_name: str):
+    def get_chunker(cls, chunker_name: str) -> type[TextChunker]:

Also, let's verify if there are other chunker implementations that should be included in the mapping.

✅ Verification successful

Let's gather more information about the chunker implementations to verify the return type and available chunkers.

Add return type hint to get_chunker method

The return type hint should be type[TextChunker] as verified from the codebase:

The chunker_mapping only contains TextChunker class
TextChunker is the base class for chunking implementation
No other chunker implementations were found in the codebase

-    def get_chunker(cls, chunker_name: str):
+    def get_chunker(cls, chunker_name: str) -> type[TextChunker]:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for other chunker implementations
echo "Searching for potential chunker classes..."
rg -l "class.*Chunker" --type py

Length of output: 232

Script:

#!/bin/bash
# Check the content of both files to understand the chunker hierarchy and implementations
echo "=== Content of TextChunker.py ==="
cat cognee/modules/chunking/TextChunker.py

echo -e "\n=== Content of ChunkerMapping.py ==="
cat cognee/modules/data/processing/document_types/ChunkerMapping.py

Length of output: 4505

cognee/tasks/documents/extract_chunks_from_documents.py (1)

4-7: 🛠️ Refactor suggestion

Add type hints and update documentation

The function needs proper type hints and documentation for the new parameter.

-async def extract_chunks_from_documents(documents: list[Document], chunk_size: int = 1024, chunker = 'text_chunker'):
+async def extract_chunks_from_documents(
+    documents: list[Document],
+    chunk_size: int = 1024,
+    chunker: str = 'text_chunker'
+) -> AsyncIterator[str]:
+    """Extract chunks from multiple documents.
+    
+    Args:
+        documents: List of documents to process
+        chunk_size: Size of each chunk
+        chunker: Type of chunker to use (default: "text_chunker")
+    
+    Yields:
+        Chunks of document content
+    """

Let's verify if all document types support the new chunker parameter.

✅ Verification successful

Let me verify the implementation details of the read method in these document types to ensure they properly handle the chunker parameter.

Let me correct the paths and try again.

Let me try one more time with ripgrep to see the actual implementations.

Type hints and documentation changes approved - all document types support chunker parameter

The verification shows that:

All document types (Image, Audio, Text, PDF) have implemented the read method with the chunker parameter
They all use the parameter correctly by passing it to ChunkerConfig.get_chunker
The base Document class declares the parameter in its interface

The suggested type hints and documentation improvements are valid and should be implemented.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if all document types have been updated
echo "Checking document type implementations..."
rg -l "class.*Document.*\(.*\):" --type py | xargs rg "def read.*chunker.*:"

Length of output: 741

Script:

#!/bin/bash
# Check the implementation of read method in document types
echo "Document.py base class implementation:"
ast-grep --pattern 'def read($$$) { $$$ }' cognee/modules/data/processing/document_types/Document.py

echo -e "\nConcrete implementations:"
for file in ImageDocument.py AudioDocument.py TextDocument.py PdfDocument.py; do
    echo -e "\n$file implementation:"
    ast-grep --pattern 'def read($$$) { $$$ }' cognee/modules/data/processing/document_types/document_types/$file
done

Length of output: 1876

Script:

#!/bin/bash
# Check the implementation of read method in document types
echo "Document.py base class implementation:"
ast-grep --pattern 'def read($$$) { $$$ }' cognee/modules/data/processing/document_types/Document.py

echo -e "\nConcrete implementations:"
for file in ImageDocument AudioDocument TextDocument PdfDocument; do
    echo -e "\n$file implementation:"
    ast-grep --pattern 'def read($$$) { $$$ }' cognee/modules/data/processing/document_types/${file}.py
done

Length of output: 1247

Script:

#!/bin/bash
# Check implementations with context
echo "Document.py base class implementation:"
rg -A 5 "def read.*chunker.*:" cognee/modules/data/processing/document_types/Document.py

echo -e "\nConcrete implementations:"
for file in ImageDocument AudioDocument TextDocument PdfDocument; do
    echo -e "\n${file}.py implementation:"
    rg -A 5 "def read.*chunker.*:" cognee/modules/data/processing/document_types/${file}.py
done

Length of output: 2195

cognee/modules/data/processing/document_types/ImageDocument.py (2)

3-3: LGTM: Import changes align with new chunking architecture

The import change supports the transition to configurable chunking strategies.

17-18: Verify error handling for invalid chunker types

The implementation looks clean, but we should ensure proper error handling for:

Invalid chunker types
Chunker instantiation failures

Let's check the error handling implementation:

✅ Verification successful

Let me gather more specific information about the ChunkerConfig implementation to verify error handling.

Let me try a different approach to locate the chunker implementation and error handling.

Based on the search results, I can now see that ChunkerConfig.get_chunker is implemented in ChunkerMapping.py and it does have proper error handling. Let me verify the complete implementation of ChunkerMapping.py.

Based on the implementation found, I can now provide the final response:

Error handling for chunker types is properly implemented

The ChunkerConfig implementation has appropriate error handling:

Invalid chunker types raise a NotImplementedError with a descriptive message listing available options
Chunker instantiation is handled through a clean factory pattern with type safety

The implementation is sufficient for both validation points mentioned in the review:

Invalid chunker types ✓ (handled via NotImplementedError)
Chunker instantiation failures ✓ (handled via type-safe factory pattern)

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check ChunkerConfig implementation for error handling
ast-grep --pattern 'class ChunkerConfig {
  $$$
  get_chunker($_) {
    $$$
  }
}'

# Look for any error handling tests
rg -l "test.*chunker" | xargs rg "invalid|error|exception"

Length of output: 30859

Script:

#!/bin/bash
# Search for ChunkerConfig implementation
ast-grep --pattern 'class ChunkerConfig {
  $$$
}'

# Look for chunker-related tests
rg -A 5 "test.*chunker" "cognee/tests"

# Search for chunker-related exceptions
rg "class.*Error.*chunker" -A 3

Length of output: 125

Script:

#!/bin/bash
# Search for chunker-related files
fd chunker

# Look for get_chunker method implementation
rg -A 10 "get_chunker"

# Search for chunker-related error handling
rg -A 5 "raise.*chunker"

Length of output: 2892

Script:

#!/bin/bash
# Get complete implementation of ChunkerMapping
cat cognee/modules/data/processing/document_types/ChunkerMapping.py

# Look for any chunker-related exceptions in the codebase
rg "class.*Error" cognee/modules/chunking/

Length of output: 620

cognee/modules/data/processing/document_types/Document.py

cognee/modules/data/processing/document_types/PdfDocument.py

cognee/modules/data/processing/document_types/TextDocument.py

cognee/modules/data/processing/document_types/AudioDocument.py

cognee/modules/data/processing/document_types/ImageDocument.py

…pipeline-parameters-in-cognee

…ifying-and-outsourcing-pipeline-parameters-in-cognee

…ipeline-parameters-in-cognee

coderabbitai

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (1)

cognee/tests/integration/documents/AudioDocument_test.py (1)
Line range hint 25-25: Consider implementing a shared test suite for chunking strategies

The current implementation duplicates chunking validation across different document types. Consider:

Creating a shared test suite for common chunking behavior

Defining a standard interface for chunking strategy validation

Adding integration tests that verify chunking consistency across document types

Example shared test structure:
class BaseChunkerTest:
    def test_basic_chunking(self, document, chunker):
        # Common chunking tests
    
    def test_invalid_chunker(self, document):
        # Common error cases

class TestPdfDocumentChunking(BaseChunkerTest):
    # PDF-specific tests

class TestTextDocumentChunking(BaseChunkerTest):
    # Text-specific tests

class TestAudioDocumentChunking(BaseChunkerTest):
    # Audio-specific tests
Also applies to: 36-36, 34-34

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 29a02e0 and 7768239.

📒 Files selected for processing (5)

cognee/modules/data/processing/document_types/Document.py (1 hunks)
cognee/tests/integration/documents/AudioDocument_test.py (1 hunks)
cognee/tests/integration/documents/ImageDocument_test.py (1 hunks)
cognee/tests/integration/documents/PdfDocument_test.py (1 hunks)
cognee/tests/integration/documents/TextDocument_test.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

cognee/modules/data/processing/document_types/Document.py

🔇 Additional comments (1)

cognee/tests/integration/documents/AudioDocument_test.py (1)

34-34: Consider audio-specific chunking strategies and test coverage

While the test covers basic text chunking of transcribed audio, consider:

Audio-specific chunking strategies (e.g., time-based chunks, speaker-based chunks)
Testing the interaction between transcription and chunking
Verifying that chunking preserves important audio context (e.g., speaker transitions)

cognee/tests/integration/documents/PdfDocument_test.py

cognee/tests/integration/documents/TextDocument_test.py

cognee/tests/integration/documents/ImageDocument_test.py

cognee/modules/data/processing/document_types/TextDocument.py

dexters1 · 2024-12-12T11:41:37Z

What about the UnstructuredDocument class? Do these changes need to be implemented there as well?

hajdul88 · 2024-12-16T10:41:19Z

What about the UnstructuredDocument class? Do these changes need to be implemented there as well?

If we outsource chunkers like this, then yes, but we discuss this with Boris.

…ipeline-parameters-in-cognee

* feat: Add error handling in case user is already part of database and permission already given to group Added error handling in case permission is already given to group and user is already part of group Feature COG-656 * feat: Add user verification for accessing data Verify user has access to data before returning it Feature COG-656 * feat: Add compute search to cognee Add compute search to cognee which makes searches human readable Feature COG-656 * feat: Add simple instruction for system prompt Add simple instruction for system prompt Feature COG-656 * pass pydantic model tocognify * feat: Add unauth access error to getting data Raise unauth access error when trying to read data without access Feature COG-656 * refactor: Rename query compute to query completion Rename searching type from compute to completion Refactor COG-656 * chore: Update typo in code Update typo in string in code Chore COG-656 * Add mcp to cognee * Add simple README * Update cognee-mcp/mcpcognee/__main__.py Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Create dockerhub.yml * Update get_cognify_router.py * fix: Resolve reflection issue when running cognee a second time after pruning data When running cognee a second time after pruning data some metadata doesn't get pruned. This makes cognee believe some tables exist that have been deleted Fix * fix: Add metadata reflection fix to sqlite as well Added fix when reflecting metadata to sqlite as well Fix * update * Revert "fix: Add metadata reflection fix to sqlite as well" This reverts commit 394a0b2. * COG-810 Implement a top-down dependency graph builder tool (#268) * feat: parse repo to call graph * Update/repo_processor/top_down_repo_parse.py task * fix: minor improvements * feat: file parsing jedi script optimisation --------- * Add type to DataPoint metadata (#364) * Add type to DataPoint metadata * Add missing index_fields * Use DataPoint UUID type in pgvector create_data_points * Make _metadata mandatory everywhere * Fixes * Fixes to our demo * feat: Add search by dataset for cognee Added ability to search by datasets for cognee users Feature COG-912 * feat: outsources chunking parameters to extract chunk from documents … (#289) * feat: outsources chunking parameters to extract chunk from documents task * fix: Remove backend lock from UI Removed lock that prevented using multiple datasets in cognify Fix COG-912 * COG 870 Remove duplicate edges from the code graph (#293) * feat: turn summarize_code into generator * feat: extract run_code_graph_pipeline, update the pipeline * feat: minimal code graph example * refactor: update argument * refactor: move run_code_graph_pipeline to cognify/code_graph_pipeline * refactor: indentation and whitespace nits * refactor: add deprecated use comments and warnings --------- Co-authored-by: Vasilije <[email protected]> Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: Boris <[email protected]> * test: Added test for getting of documents for search Added test to verify getting documents related to datasets intended for search Test COG-912 * Structured code summarization (#375) * feat: turn summarize_code into generator * feat: extract run_code_graph_pipeline, update the pipeline * feat: minimal code graph example * refactor: update argument * refactor: move run_code_graph_pipeline to cognify/code_graph_pipeline * refactor: indentation and whitespace nits * refactor: add deprecated use comments and warnings * Structured code summarization * add missing prompt file * Remove summarization_model argument from summarize_code and fix typehinting * minor refactors --------- Co-authored-by: lxobr <[email protected]> Co-authored-by: Vasilije <[email protected]> Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: Boris <[email protected]> * fix: Resolve issue with cognify router graph model default value Resolve issue with default value for graph model in cognify endpoint Fix * chore: Resolve typo in getting documents code Resolve typo in code chore COG-912 * Update .github/workflows/dockerhub.yml Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Update .github/workflows/dockerhub.yml * Update .github/workflows/dockerhub.yml * Update .github/workflows/dockerhub.yml * Update get_cognify_router.py * fix: Resolve syntax issue with cognify router Resolve syntax issue with cognify router Fix * feat: Add ruff pre-commit hook for linting and formatting Added formatting and linting on pre-commit hook Feature COG-650 * chore: Update ruff lint options in pyproject file Update ruff lint options in pyproject file Chore * test: Add ruff linter github action Added linting check with ruff in github actions Test COG-650 * feat: deletes executor limit from get_repo_file_dependencies * feat: implements mock feature in LiteLLM engine * refactor: Remove changes to cognify router Remove changes to cognify router Refactor COG-650 * fix: fixing boolean env for github actions * test: Add test for ruff format for cognee code Test if code is formatted for cognee Test COG-650 * refactor: Rename ruff gh actions Rename ruff gh actions to be more understandable Refactor COG-650 * chore: Remove checking of ruff lint and format on push Remove checking of ruff lint and format on push Chore COG-650 * feat: Add deletion of local files when deleting data Delete local files when deleting data from cognee Feature COG-475 * fix: changes back the max workers to 12 * feat: Adds mock summary for codegraph pipeline * refacotr: Add current development status Save current development status Refactor * Fix langfuse * Fix langfuse * Fix langfuse * Add evaluation notebook * Rename eval notebook * chore: Add temporary state of development Add temp development state to branch Chore * fix: Add poetry.lock file, make langfuse mandatory Added langfuse as mandatory dependency, added poetry.lock file Fix * Fix: fixes langfuse config settings * feat: Add deletion of local files made by cognee through data endpoint Delete local files made by cognee when deleting data from database through endpoint Feature COG-475 * test: Revert changes on test_pgvector Revert changes on test_pgvector which were made to test deletion of local files Test COG-475 * chore: deletes the old test for the codegraph pipeline * test: Add test to verify deletion of local files Added test that checks local files created by cognee will be deleted and those not created by cognee won't Test COG-475 * chore: deletes unused old version of the codegraph * chore: deletes unused imports from code_graph_pipeline * Ingest non-code files * Fixing review findings * Ingest non-code files (#395) * Ingest non-code files * Fixing review findings * test: Update test regarding message Update assertion message, add veryfing of file existence * Handle retryerrors in code summary (#396) * Handle retryerrors in code summary * Log instead of print * fix: updates the acreate_structured_output * chore: Add logging to sentry when file which should exist can't be found Log to sentry that a file which should exist can't be found Chore COG-475 * Fix diagram * fix: refactor mcp * Add Smithery CLI installation instructions and badge * Move readme * Update README.md * Update README.md * Cog 813 source code chunks (#383) * fix: pass the list of all CodeFiles to enrichment task * feat: introduce SourceCodeChunk, update metadata * feat: get_source_code_chunks code graph pipeline task * feat: integrate get_source_code_chunks task, comment out summarize_code * Fix code summarization (#387) * feat: update data models * feat: naive parse long strings in source code * fix: get_non_py_files instead of get_non_code_files * fix: limit recursion, add comment * handle embedding empty input error (#398) * feat: robustly handle CodeFile source code * refactor: sort imports * todo: add support for other embedding models * feat: add custom logger * feat: add robustness to get_source_code_chunks Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * feat: improve embedding exceptions * refactor: format indents, rename module --------- Co-authored-by: alekszievr <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Fix diagram * Fix instructions * adding and fixing files * Update README.md * ruff format * Fix linter issues * Implement PR review * Comment out profiling * fix: add allowed extensions * fix: adhere UnstructuredDocument.read() to Document * feat: time code graph run and add mock support * Fix ollama, work on visualization * fix: Fixes faulty logging format and sets up error logging in dynamic steps example * Overcome ContextWindowExceededError by checking token count while chunking (#413) * fix: Fixes duplicated edges in cognify by limiting the recursion depth in add datapoints * Adjust AudioDocument and handle None token limit * Handle azure models as well * Add clean logging to code graph example * Remove setting envvars from arg * fix: fixes create_cognee_style_network_with_logo unit test * fix: removes accidental remained print * Get embedding engine instead of passing it. Get it from vector engine instead of direct getter. * Fix visualization * Get embedding engine instead of passing it in code chunking. * Fix poetry issues * chore: Update version of poetry install action * chore: Update action to trigger on pull request for any branch * chore: Remove if in github action to allow triggering on push * chore: Remove if condition to allow gh actions to trigger on push to PR * chore: Update poetry version in github actions * chore: Set fixed ubuntu version to 22.04 * chore: Update py lint to use ubuntu 22.04 * chore: update ubuntu version to 22.04 * feat: implements the first version of graph based completion in search * chore: Update python 3.9 gh action to use 3.12 instead * chore: Update formatting of utils.py * Fix poetry issues * Adjust integration tests * fix: Fixes ruff formatting * Handle circular import * fix: Resolve profiler issue with partial and recursive logger imports Resolve issue for profiler with partial and recursive logger imports * fix: Remove logger from __init__.py file * test: Test profiling on HEAD branch * test: Return profiler to base branch * Set max_tokens in config * Adjust SWE-bench script to code graph pipeline call * Adjust SWE-bench script to code graph pipeline call * fix: Add fix for accessing dictionary elements that don't exits Using get for the text key instead of direct access to handle situation if the text key doesn't exist * feat: Add ability to change graph database configuration through cognee * feat: adds pydantic types to graph layer models * feat: adds basic retriever for swe bench * Match Ruff version in config to the one in github actions * feat: implements code retreiver * Fix: fixes unit test for codepart search * Format with Ruff 0.9.0 * Fix: deleting incorrect repo path * fix: resolve issue with langfuse dependency installation when integrating cognee in different packages * version: Increase version to 0.1.21 --------- Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: Vasilije <[email protected]> Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: Rita Aleksziev <[email protected]> Co-authored-by: vasilije <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: lxobr <[email protected]> Co-authored-by: alekszievr <[email protected]> Co-authored-by: hajdul88 <[email protected]> Co-authored-by: Henry Mao <[email protected]>

* Revert "fix: Add metadata reflection fix to sqlite as well" This reverts commit 394a0b2. * COG-810 Implement a top-down dependency graph builder tool (#268) * feat: parse repo to call graph * Update/repo_processor/top_down_repo_parse.py task * fix: minor improvements * feat: file parsing jedi script optimisation --------- * Add type to DataPoint metadata (#364) * Add missing index_fields * Use DataPoint UUID type in pgvector create_data_points * Make _metadata mandatory everywhere * feat: Add search by dataset for cognee Added ability to search by datasets for cognee users Feature COG-912 * feat: outsources chunking parameters to extract chunk from documents … (#289) * feat: outsources chunking parameters to extract chunk from documents task * fix: Remove backend lock from UI Removed lock that prevented using multiple datasets in cognify Fix COG-912 * COG 870 Remove duplicate edges from the code graph (#293) * feat: turn summarize_code into generator * feat: extract run_code_graph_pipeline, update the pipeline * feat: minimal code graph example * refactor: update argument * refactor: move run_code_graph_pipeline to cognify/code_graph_pipeline * refactor: indentation and whitespace nits * refactor: add deprecated use comments and warnings --------- Co-authored-by: Vasilije <[email protected]> Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: Boris <[email protected]> * test: Added test for getting of documents for search Added test to verify getting documents related to datasets intended for search Test COG-912 * Structured code summarization (#375) * feat: turn summarize_code into generator * feat: extract run_code_graph_pipeline, update the pipeline * feat: minimal code graph example * refactor: update argument * refactor: move run_code_graph_pipeline to cognify/code_graph_pipeline * refactor: indentation and whitespace nits * refactor: add deprecated use comments and warnings * Structured code summarization * add missing prompt file * Remove summarization_model argument from summarize_code and fix typehinting * minor refactors --------- Co-authored-by: lxobr <[email protected]> Co-authored-by: Vasilije <[email protected]> Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: Boris <[email protected]> * fix: Resolve issue with cognify router graph model default value Resolve issue with default value for graph model in cognify endpoint Fix * chore: Resolve typo in getting documents code Resolve typo in code chore COG-912 * Update .github/workflows/dockerhub.yml Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Update .github/workflows/dockerhub.yml * Update .github/workflows/dockerhub.yml * Update .github/workflows/dockerhub.yml * Update get_cognify_router.py * fix: Resolve syntax issue with cognify router Resolve syntax issue with cognify router Fix * feat: Add ruff pre-commit hook for linting and formatting Added formatting and linting on pre-commit hook Feature COG-650 * chore: Update ruff lint options in pyproject file Update ruff lint options in pyproject file Chore * test: Add ruff linter github action Added linting check with ruff in github actions Test COG-650 * feat: deletes executor limit from get_repo_file_dependencies * feat: implements mock feature in LiteLLM engine * refactor: Remove changes to cognify router Remove changes to cognify router Refactor COG-650 * fix: fixing boolean env for github actions * test: Add test for ruff format for cognee code Test if code is formatted for cognee Test COG-650 * refactor: Rename ruff gh actions Rename ruff gh actions to be more understandable Refactor COG-650 * chore: Remove checking of ruff lint and format on push Remove checking of ruff lint and format on push Chore COG-650 * feat: Add deletion of local files when deleting data Delete local files when deleting data from cognee Feature COG-475 * fix: changes back the max workers to 12 * feat: Adds mock summary for codegraph pipeline * refacotr: Add current development status Save current development status Refactor * Fix langfuse * Fix langfuse * Fix langfuse * Add evaluation notebook * Rename eval notebook * chore: Add temporary state of development Add temp development state to branch Chore * fix: Add poetry.lock file, make langfuse mandatory Added langfuse as mandatory dependency, added poetry.lock file Fix * Fix: fixes langfuse config settings * feat: Add deletion of local files made by cognee through data endpoint Delete local files made by cognee when deleting data from database through endpoint Feature COG-475 * test: Revert changes on test_pgvector Revert changes on test_pgvector which were made to test deletion of local files Test COG-475 * chore: deletes the old test for the codegraph pipeline * test: Add test to verify deletion of local files Added test that checks local files created by cognee will be deleted and those not created by cognee won't Test COG-475 * chore: deletes unused old version of the codegraph * chore: deletes unused imports from code_graph_pipeline * Ingest non-code files * Fixing review findings * Ingest non-code files (#395) * Ingest non-code files * Fixing review findings * test: Update test regarding message Update assertion message, add veryfing of file existence * Handle retryerrors in code summary (#396) * Handle retryerrors in code summary * Log instead of print * fix: updates the acreate_structured_output * chore: Add logging to sentry when file which should exist can't be found Log to sentry that a file which should exist can't be found Chore COG-475 * Fix diagram * fix: refactor mcp * Add Smithery CLI installation instructions and badge * Move readme * Update README.md * Update README.md * Cog 813 source code chunks (#383) * fix: pass the list of all CodeFiles to enrichment task * feat: introduce SourceCodeChunk, update metadata * feat: get_source_code_chunks code graph pipeline task * feat: integrate get_source_code_chunks task, comment out summarize_code * Fix code summarization (#387) * feat: update data models * feat: naive parse long strings in source code * fix: get_non_py_files instead of get_non_code_files * fix: limit recursion, add comment * handle embedding empty input error (#398) * feat: robustly handle CodeFile source code * refactor: sort imports * todo: add support for other embedding models * feat: add custom logger * feat: add robustness to get_source_code_chunks Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * feat: improve embedding exceptions * refactor: format indents, rename module --------- Co-authored-by: alekszievr <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Fix diagram * Fix diagram * Fix instructions * Fix instructions * adding and fixing files * Update README.md * ruff format * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Implement PR review * Comment out profiling * Comment out profiling * Comment out profiling * fix: add allowed extensions * fix: adhere UnstructuredDocument.read() to Document * feat: time code graph run and add mock support * Fix ollama, work on visualization * fix: Fixes faulty logging format and sets up error logging in dynamic steps example * Overcome ContextWindowExceededError by checking token count while chunking (#413) * fix: Fixes duplicated edges in cognify by limiting the recursion depth in add datapoints * Adjust AudioDocument and handle None token limit * Handle azure models as well * Fix visualization * Fix visualization * Fix visualization * Add clean logging to code graph example * Remove setting envvars from arg * fix: fixes create_cognee_style_network_with_logo unit test * fix: removes accidental remained print * Fix visualization * Fix visualization * Fix visualization * Get embedding engine instead of passing it. Get it from vector engine instead of direct getter. * Fix visualization * Fix visualization * Fix poetry issues * Get embedding engine instead of passing it in code chunking. * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * chore: Update version of poetry install action * chore: Update action to trigger on pull request for any branch * chore: Remove if in github action to allow triggering on push * chore: Remove if condition to allow gh actions to trigger on push to PR * chore: Update poetry version in github actions * chore: Set fixed ubuntu version to 22.04 * chore: Update py lint to use ubuntu 22.04 * chore: update ubuntu version to 22.04 * feat: implements the first version of graph based completion in search * chore: Update python 3.9 gh action to use 3.12 instead * chore: Update formatting of utils.py * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Adjust integration tests * fix: Fixes ruff formatting * Handle circular import * fix: Resolve profiler issue with partial and recursive logger imports Resolve issue for profiler with partial and recursive logger imports * fix: Remove logger from __init__.py file * test: Test profiling on HEAD branch * test: Return profiler to base branch * Set max_tokens in config * Adjust SWE-bench script to code graph pipeline call * Adjust SWE-bench script to code graph pipeline call * fix: Add fix for accessing dictionary elements that don't exits Using get for the text key instead of direct access to handle situation if the text key doesn't exist * feat: Add ability to change graph database configuration through cognee * feat: adds pydantic types to graph layer models * test: Test ubuntu 24.04 * test: change all actions to ubuntu-latest * feat: adds basic retriever for swe bench * Match Ruff version in config to the one in github actions * feat: implements code retreiver * Fix: fixes unit test for codepart search * Format with Ruff 0.9.0 * Fix: deleting incorrect repo path * docs: Add LlamaIndex Cognee integration notebook Added LlamaIndex Cognee integration notebook * test: Add github action for testing llama index cognee integration notebook * fix: resolve issue with langfuse dependency installation when integrating cognee in different packages * version: Increase version to 0.1.21 * fix: update dependencies of the mcp server * Update README.md * Fix: Fixes logging setup * feat: deletes on the fly embeddings as uses edge collections * fix: Change nbformat on llama index integration notebook * fix: Resolve api key issue with llama index integration notebook * fix: Attempt to resolve issue with Ubuntu 24.04 segmentation fault * version: Increase version to 0.1.22 --------- Co-authored-by: vasilije <[email protected]> Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: lxobr <[email protected]> Co-authored-by: alekszievr <[email protected]> Co-authored-by: hajdul88 <[email protected]> Co-authored-by: Vasilije <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Rita Aleksziev <[email protected]> Co-authored-by: Henry Mao <[email protected]>

feat: outsources chunking parameters to extract chunk from documents …

29a02e0

…task

hajdul88 marked this pull request as draft December 9, 2024 16:59

coderabbitai bot reviewed Dec 9, 2024

View reviewed changes

hajdul88 added 3 commits December 9, 2024 18:42

Merge branch 'main' into feature/cog-788-identifying-and-outsourcing-…

4438a8d

…pipeline-parameters-in-cognee

chore: adds new chunker parameter to tests

13f4e45

Merge remote-tracking branch 'origin/main' into feature/cog-788-ident…

5f9f9bd

…ifying-and-outsourcing-pipeline-parameters-in-cognee

borisarzentar changed the base branch from main to dev December 11, 2024 20:56

Merge branch 'dev' into feature/cog-788-identifying-and-outsourcing-p…

7768239

…ipeline-parameters-in-cognee

hajdul88 marked this pull request as ready for review December 12, 2024 07:47

hajdul88 added the run-checks label Dec 12, 2024

coderabbitai bot reviewed Dec 12, 2024

View reviewed changes

cognee/tests/integration/documents/PdfDocument_test.py Show resolved Hide resolved

cognee/tests/integration/documents/TextDocument_test.py Show resolved Hide resolved

cognee/tests/integration/documents/ImageDocument_test.py Show resolved Hide resolved

hajdul88 requested review from borisarzentar, Vasilije1990, dexters1, alekszievr and lxobr December 12, 2024 07:58

borisarzentar reviewed Dec 12, 2024

View reviewed changes

cognee/modules/data/processing/document_types/TextDocument.py Show resolved Hide resolved

borisarzentar and others added 2 commits December 16, 2024 16:10

Merge branch 'dev' into feature/cog-788-identifying-and-outsourcing-p…

bfc2a58

…ipeline-parameters-in-cognee

Merge branch 'dev' into feature/cog-788-identifying-and-outsourcing-p…

65e1c92

…ipeline-parameters-in-cognee

borisarzentar approved these changes Dec 17, 2024

View reviewed changes

borisarzentar merged commit 9e7ab64 into dev Dec 17, 2024
26 checks passed

borisarzentar deleted the feature/cog-788-identifying-and-outsourcing-pipeline-parameters-in-cognee branch December 17, 2024 10:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: outsources chunking parameters to extract chunk from documents … #289

feat: outsources chunking parameters to extract chunk from documents … #289

hajdul88 commented Dec 9, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 9, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

dexters1 commented Dec 12, 2024

hajdul88 commented Dec 16, 2024

feat: outsources chunking parameters to extract chunk from documents … #289

feat: outsources chunking parameters to extract chunk from documents … #289

Conversation

hajdul88 commented Dec 9, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Dec 9, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

dexters1 commented Dec 12, 2024

hajdul88 commented Dec 16, 2024

hajdul88 commented Dec 9, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 9, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)