Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: outsources chunking parameters to extract chunk from documents … #289

Conversation

hajdul88
Copy link
Contributor

@hajdul88 hajdul88 commented Dec 9, 2024

…task

Summary by CodeRabbit

  • New Features

    • Introduced a new ChunkerConfig class for flexible chunking strategy selection.
    • Updated document processing methods to support dynamic chunker selection.
  • Bug Fixes

    • Enhanced error handling for unsupported chunker names.
  • Documentation

    • Updated method signatures across various document classes to include a new chunker parameter.
  • Chores

    • Removed direct dependencies on TextChunker, streamlining chunking implementation.

Copy link
Contributor

coderabbitai bot commented Dec 9, 2024

Walkthrough

The changes involve modifications to several document processing classes, including AudioDocument, ImageDocument, PdfDocument, TextDocument, and Document. Each class's read method has been updated to accept an additional chunker: str parameter, facilitating dynamic selection of chunking strategies. The new ChunkerMapping.py file introduces the ChunkerConfig class, which maps chunker types to their corresponding classes. The extract_chunks_from_documents function has also been updated to include the chunker parameter, allowing for more flexible document processing.

Changes

File Path Change Summary
cognee/modules/data/processing/document_types/AudioDocument.py Updated read method signature to include chunker: str. Removed TextChunker import, using ChunkerConfig instead.
cognee/modules/data/processing/document_types/ChunkerMapping.py Added ChunkerConfig class with get_chunker method for mapping chunker types to classes.
cognee/modules/data/processing/document_types/Document.py Updated read method signature to include chunker: str as an optional parameter.
cognee/modules/data/processing/document_types/ImageDocument.py Updated read method signature to include chunker: str. Removed TextChunker import, using ChunkerConfig instead.
cognee/modules/data/processing/document_types/PdfDocument.py Updated read method signature to include chunker: str. Removed TextChunker import, using ChunkerConfig instead.
cognee/modules/data/processing/document_types/TextDocument.py Updated read method signature to include chunker: str. Removed TextChunker import, using ChunkerConfig instead.
cognee/tasks/documents/extract_chunks_from_documents.py Updated extract_chunks_from_documents function signature to include chunker parameter.
cognee/tests/integration/documents/AudioDocument_test.py Updated test to call document.read with chunker='text_chunker'.
cognee/tests/integration/documents/ImageDocument_test.py Updated test to call document.read with chunker='text_chunker'.
cognee/tests/integration/documents/PdfDocument_test.py Updated test to call document.read with chunker='text_chunker'.
cognee/tests/integration/documents/TextDocument_test.py Updated test to call document.read with chunker='text_chunker'.

Possibly related PRs

  • Small cleanup pull request #201: The changes in the AudioDocument class are related to the modifications in the classify_documents function, as both involve the AudioDocument class and its instantiation, reflecting a broader context of handling different document types.

Suggested reviewers

  • Vasilije1990

Poem

In the land of code where documents dwell,
A new way to chunk, oh what a spell!
With ChunkerConfig guiding the way,
Flexible processing brightens the day.
From audio to text, all types in a row,
Let’s hop to the future, watch our systems grow! 🐰✨


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7768239 and 65e1c92.

📒 Files selected for processing (1)
  • cognee/modules/data/processing/document_types/Document.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • cognee/modules/data/processing/document_types/Document.py

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@hajdul88 hajdul88 marked this pull request as draft December 9, 2024 16:59
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Outside diff range and nitpick comments (4)
cognee/modules/data/processing/document_types/ChunkerMapping.py (1)

4-6: Consider making the chunker mapping immutable

The class-level dictionary could be modified at runtime. Consider using a frozen dictionary or tuple-based mapping for immutability.

-    chunker_mapping = {
-        "text_chunker": TextChunker
-    }
+    chunker_mapping = frozendict({
+        "text_chunker": TextChunker
+    })
cognee/modules/data/processing/document_types/PdfDocument.py (1)

8-8: Add type hints and update docstring

The method signature should include type hints for better code maintainability and IDE support.

-    def read(self, chunk_size: int, chunker: str):
+    def read(self, chunk_size: int, chunker: str) -> Generator[str, None, None]:
+        """Read and chunk PDF document content.
+        
+        Args:
+            chunk_size (int): Size of each chunk
+            chunker (str): Type of chunker to use (e.g., "text", "semantic")
+            
+        Yields:
+            str: Document chunks
+            
+        Raises:
+            ValueError: If chunker type is invalid
+        """
cognee/modules/data/processing/document_types/TextDocument.py (1)

Line range hint 7-15: Ensure proper file handle cleanup

The file handle should be properly closed even if an error occurs during reading.

     def read(self, chunk_size: int, chunker: str):
         def get_text():
-            with open(self.raw_data_location, mode = "r", encoding = "utf-8") as file:
-                while True:
-                    text = file.read(1024)
+            BUFFER_SIZE = 1024 * 4  # 4KB buffer for better performance
+            try:
+                with open(self.raw_data_location, mode="r", encoding="utf-8") as file:
+                    while True:
+                        text = file.read(BUFFER_SIZE)
-                    if len(text.strip()) == 0:
-                        break
+                        if not text:
+                            break
-                    yield text
+                        yield text
+            except IOError as e:
+                raise RuntimeError(f"Failed to read text file: {e}") from e
cognee/modules/data/processing/document_types/AudioDocument.py (1)

1-1: Consider architectural improvements for better maintainability

The current implementation has several areas for improvement:

  1. Move common chunker initialization logic to the base Document class to reduce code duplication
  2. Add a configuration validation system to verify chunker types and parameters at startup
  3. Implement a progress monitoring system for long-running operations like transcription
  4. Consider using the Template Method pattern for the read method

Would you like me to provide example implementations for any of these improvements?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between ce96431 and 29a02e0.

📒 Files selected for processing (7)
  • cognee/modules/data/processing/document_types/AudioDocument.py (2 hunks)
  • cognee/modules/data/processing/document_types/ChunkerMapping.py (1 hunks)
  • cognee/modules/data/processing/document_types/Document.py (1 hunks)
  • cognee/modules/data/processing/document_types/ImageDocument.py (2 hunks)
  • cognee/modules/data/processing/document_types/PdfDocument.py (1 hunks)
  • cognee/modules/data/processing/document_types/TextDocument.py (2 hunks)
  • cognee/tasks/documents/extract_chunks_from_documents.py (1 hunks)
🔇 Additional comments (4)
cognee/modules/data/processing/document_types/ChunkerMapping.py (1)

9-15: Add return type hint to get_chunker method

The method should specify its return type for better type safety.

-    def get_chunker(cls, chunker_name: str):
+    def get_chunker(cls, chunker_name: str) -> type[TextChunker]:

Also, let's verify if there are other chunker implementations that should be included in the mapping.

✅ Verification successful

Let's gather more information about the chunker implementations to verify the return type and available chunkers.


Add return type hint to get_chunker method

The return type hint should be type[TextChunker] as verified from the codebase:

  • The chunker_mapping only contains TextChunker class
  • TextChunker is the base class for chunking implementation
  • No other chunker implementations were found in the codebase
-    def get_chunker(cls, chunker_name: str):
+    def get_chunker(cls, chunker_name: str) -> type[TextChunker]:
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for other chunker implementations
echo "Searching for potential chunker classes..."
rg -l "class.*Chunker" --type py

Length of output: 232


Script:

#!/bin/bash
# Check the content of both files to understand the chunker hierarchy and implementations
echo "=== Content of TextChunker.py ==="
cat cognee/modules/chunking/TextChunker.py

echo -e "\n=== Content of ChunkerMapping.py ==="
cat cognee/modules/data/processing/document_types/ChunkerMapping.py

Length of output: 4505

cognee/tasks/documents/extract_chunks_from_documents.py (1)

4-7: 🛠️ Refactor suggestion

Add type hints and update documentation

The function needs proper type hints and documentation for the new parameter.

-async def extract_chunks_from_documents(documents: list[Document], chunk_size: int = 1024, chunker = 'text_chunker'):
+async def extract_chunks_from_documents(
+    documents: list[Document],
+    chunk_size: int = 1024,
+    chunker: str = 'text_chunker'
+) -> AsyncIterator[str]:
+    """Extract chunks from multiple documents.
+    
+    Args:
+        documents: List of documents to process
+        chunk_size: Size of each chunk
+        chunker: Type of chunker to use (default: "text_chunker")
+    
+    Yields:
+        Chunks of document content
+    """

Let's verify if all document types support the new chunker parameter.

✅ Verification successful

Let me verify the implementation details of the read method in these document types to ensure they properly handle the chunker parameter.


Let me correct the paths and try again.


Let me try one more time with ripgrep to see the actual implementations.


Type hints and documentation changes approved - all document types support chunker parameter

The verification shows that:

  1. All document types (Image, Audio, Text, PDF) have implemented the read method with the chunker parameter
  2. They all use the parameter correctly by passing it to ChunkerConfig.get_chunker
  3. The base Document class declares the parameter in its interface

The suggested type hints and documentation improvements are valid and should be implemented.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if all document types have been updated
echo "Checking document type implementations..."
rg -l "class.*Document.*\(.*\):" --type py | xargs rg "def read.*chunker.*:"

Length of output: 741


Script:

#!/bin/bash
# Check the implementation of read method in document types
echo "Document.py base class implementation:"
ast-grep --pattern 'def read($$$) { $$$ }' cognee/modules/data/processing/document_types/Document.py

echo -e "\nConcrete implementations:"
for file in ImageDocument.py AudioDocument.py TextDocument.py PdfDocument.py; do
    echo -e "\n$file implementation:"
    ast-grep --pattern 'def read($$$) { $$$ }' cognee/modules/data/processing/document_types/document_types/$file
done

Length of output: 1876


Script:

#!/bin/bash
# Check the implementation of read method in document types
echo "Document.py base class implementation:"
ast-grep --pattern 'def read($$$) { $$$ }' cognee/modules/data/processing/document_types/Document.py

echo -e "\nConcrete implementations:"
for file in ImageDocument AudioDocument TextDocument PdfDocument; do
    echo -e "\n$file implementation:"
    ast-grep --pattern 'def read($$$) { $$$ }' cognee/modules/data/processing/document_types/${file}.py
done

Length of output: 1247


Script:

#!/bin/bash
# Check implementations with context
echo "Document.py base class implementation:"
rg -A 5 "def read.*chunker.*:" cognee/modules/data/processing/document_types/Document.py

echo -e "\nConcrete implementations:"
for file in ImageDocument AudioDocument TextDocument PdfDocument; do
    echo -e "\n${file}.py implementation:"
    rg -A 5 "def read.*chunker.*:" cognee/modules/data/processing/document_types/${file}.py
done

Length of output: 2195

cognee/modules/data/processing/document_types/ImageDocument.py (2)

3-3: LGTM: Import changes align with new chunking architecture

The import change supports the transition to configurable chunking strategies.


17-18: Verify error handling for invalid chunker types

The implementation looks clean, but we should ensure proper error handling for:

  1. Invalid chunker types
  2. Chunker instantiation failures

Let's check the error handling implementation:

✅ Verification successful

Let me gather more specific information about the ChunkerConfig implementation to verify error handling.


Let me try a different approach to locate the chunker implementation and error handling.


Based on the search results, I can now see that ChunkerConfig.get_chunker is implemented in ChunkerMapping.py and it does have proper error handling. Let me verify the complete implementation of ChunkerMapping.py.


Based on the implementation found, I can now provide the final response:

Error handling for chunker types is properly implemented

The ChunkerConfig implementation has appropriate error handling:

  • Invalid chunker types raise a NotImplementedError with a descriptive message listing available options
  • Chunker instantiation is handled through a clean factory pattern with type safety

The implementation is sufficient for both validation points mentioned in the review:

  1. Invalid chunker types ✓ (handled via NotImplementedError)
  2. Chunker instantiation failures ✓ (handled via type-safe factory pattern)
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check ChunkerConfig implementation for error handling
ast-grep --pattern 'class ChunkerConfig {
  $$$
  get_chunker($_) {
    $$$
  }
}'

# Look for any error handling tests
rg -l "test.*chunker" | xargs rg "invalid|error|exception"

Length of output: 30859


Script:

#!/bin/bash
# Search for ChunkerConfig implementation
ast-grep --pattern 'class ChunkerConfig {
  $$$
}'

# Look for chunker-related tests
rg -A 5 "test.*chunker" "cognee/tests"

# Search for chunker-related exceptions
rg "class.*Error.*chunker" -A 3

Length of output: 125


Script:

#!/bin/bash
# Search for chunker-related files
fd chunker

# Look for get_chunker method implementation
rg -A 10 "get_chunker"

# Search for chunker-related error handling
rg -A 5 "raise.*chunker"

Length of output: 2892


Script:

#!/bin/bash
# Get complete implementation of ChunkerMapping
cat cognee/modules/data/processing/document_types/ChunkerMapping.py

# Look for any chunker-related exceptions in the codebase
rg "class.*Error" cognee/modules/chunking/

Length of output: 620

@borisarzentar borisarzentar changed the base branch from main to dev December 11, 2024 20:56
@hajdul88 hajdul88 marked this pull request as ready for review December 12, 2024 07:47
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (1)
cognee/tests/integration/documents/AudioDocument_test.py (1)

Line range hint 25-25: Consider implementing a shared test suite for chunking strategies

The current implementation duplicates chunking validation across different document types. Consider:

  1. Creating a shared test suite for common chunking behavior
  2. Defining a standard interface for chunking strategy validation
  3. Adding integration tests that verify chunking consistency across document types

Example shared test structure:

class BaseChunkerTest:
    def test_basic_chunking(self, document, chunker):
        # Common chunking tests
    
    def test_invalid_chunker(self, document):
        # Common error cases

class TestPdfDocumentChunking(BaseChunkerTest):
    # PDF-specific tests

class TestTextDocumentChunking(BaseChunkerTest):
    # Text-specific tests

class TestAudioDocumentChunking(BaseChunkerTest):
    # Audio-specific tests

Also applies to: 36-36, 34-34

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 29a02e0 and 7768239.

📒 Files selected for processing (5)
  • cognee/modules/data/processing/document_types/Document.py (1 hunks)
  • cognee/tests/integration/documents/AudioDocument_test.py (1 hunks)
  • cognee/tests/integration/documents/ImageDocument_test.py (1 hunks)
  • cognee/tests/integration/documents/PdfDocument_test.py (1 hunks)
  • cognee/tests/integration/documents/TextDocument_test.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • cognee/modules/data/processing/document_types/Document.py
🔇 Additional comments (1)
cognee/tests/integration/documents/AudioDocument_test.py (1)

34-34: Consider audio-specific chunking strategies and test coverage

While the test covers basic text chunking of transcribed audio, consider:

  1. Audio-specific chunking strategies (e.g., time-based chunks, speaker-based chunks)
  2. Testing the interaction between transcription and chunking
  3. Verifying that chunking preserves important audio context (e.g., speaker transitions)

@dexters1
Copy link
Collaborator

What about the UnstructuredDocument class? Do these changes need to be implemented there as well?

@hajdul88
Copy link
Contributor Author

What about the UnstructuredDocument class? Do these changes need to be implemented there as well?

If we outsource chunkers like this, then yes, but we discuss this with Boris.

@borisarzentar borisarzentar merged commit 9e7ab64 into dev Dec 17, 2024
26 checks passed
@borisarzentar borisarzentar deleted the feature/cog-788-identifying-and-outsourcing-pipeline-parameters-in-cognee branch December 17, 2024 10:31
borisarzentar added a commit that referenced this pull request Jan 10, 2025
* feat: Add error handling in case user is already part of database and permission already given to group

Added error handling in case permission is already given to group and user is already part of group

Feature COG-656

* feat: Add user verification for accessing data

Verify user has access to data before returning it

Feature COG-656

* feat: Add compute search to cognee

Add compute search to cognee which makes searches human readable

Feature COG-656

* feat: Add simple instruction for system prompt

Add simple instruction for system prompt

Feature COG-656

* pass pydantic model tocognify

* feat: Add unauth access error to getting data

Raise unauth access error when trying to read data without access

Feature COG-656

* refactor: Rename query compute to query completion

Rename searching type from compute to completion

Refactor COG-656

* chore: Update typo in code

Update typo in string in code

Chore COG-656

* Add mcp to cognee

* Add simple README

* Update cognee-mcp/mcpcognee/__main__.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Create dockerhub.yml

* Update get_cognify_router.py

* fix: Resolve reflection issue when running cognee a second time after pruning data

When running cognee a second time after pruning data some metadata doesn't get pruned.
This makes cognee believe some tables exist that have been deleted

Fix

* fix: Add metadata reflection fix to sqlite as well

Added fix when reflecting metadata to sqlite as well

Fix

* update

* Revert "fix: Add metadata reflection fix to sqlite as well"

This reverts commit 394a0b2.

* COG-810 Implement a top-down dependency graph builder tool (#268)

* feat: parse repo to call graph

* Update/repo_processor/top_down_repo_parse.py task

* fix: minor improvements

* feat: file parsing jedi script optimisation

---------

* Add type to DataPoint metadata (#364)

* Add type to DataPoint metadata

* Add missing index_fields

* Use DataPoint UUID type in pgvector create_data_points

* Make _metadata mandatory everywhere

* Fixes

* Fixes to our demo

* feat: Add search by dataset for cognee

Added ability to search by datasets for cognee users

Feature COG-912

* feat: outsources chunking parameters to extract chunk from documents … (#289)

* feat: outsources chunking parameters to extract chunk from documents task

* fix: Remove backend lock from UI

Removed lock that prevented using multiple datasets in cognify

Fix COG-912

* COG 870 Remove duplicate edges from the code graph (#293)

* feat: turn summarize_code into generator

* feat: extract run_code_graph_pipeline, update the pipeline

* feat: minimal code graph example

* refactor: update argument

* refactor: move run_code_graph_pipeline to cognify/code_graph_pipeline

* refactor: indentation and whitespace nits

* refactor: add deprecated use comments and warnings

---------

Co-authored-by: Vasilije <[email protected]>
Co-authored-by: Igor Ilic <[email protected]>
Co-authored-by: Boris <[email protected]>

* test: Added test for getting of documents for search

Added test to verify getting documents related to datasets intended for search

Test COG-912

* Structured code summarization (#375)

* feat: turn summarize_code into generator

* feat: extract run_code_graph_pipeline, update the pipeline

* feat: minimal code graph example

* refactor: update argument

* refactor: move run_code_graph_pipeline to cognify/code_graph_pipeline

* refactor: indentation and whitespace nits

* refactor: add deprecated use comments and warnings

* Structured code summarization

* add missing prompt file

* Remove summarization_model argument from summarize_code and fix typehinting

* minor refactors

---------

Co-authored-by: lxobr <[email protected]>
Co-authored-by: Vasilije <[email protected]>
Co-authored-by: Igor Ilic <[email protected]>
Co-authored-by: Boris <[email protected]>

* fix: Resolve issue with cognify router graph model default value

Resolve issue with default value for graph model in cognify endpoint

Fix

* chore: Resolve typo in getting documents code

Resolve typo in code

chore COG-912

* Update .github/workflows/dockerhub.yml

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Update .github/workflows/dockerhub.yml

* Update .github/workflows/dockerhub.yml

* Update .github/workflows/dockerhub.yml

* Update get_cognify_router.py

* fix: Resolve syntax issue with cognify router

Resolve syntax issue with cognify router

Fix

* feat: Add ruff pre-commit hook for linting and formatting

Added formatting and linting on pre-commit hook

Feature COG-650

* chore: Update ruff lint options in pyproject file

Update ruff lint options in pyproject file

Chore

* test: Add ruff linter github action

Added linting check with ruff in github actions

Test COG-650

* feat: deletes executor limit from get_repo_file_dependencies

* feat: implements mock feature in LiteLLM engine

* refactor: Remove changes to cognify router

Remove changes to cognify router

Refactor COG-650

* fix: fixing boolean env for github actions

* test: Add test for ruff format for cognee code

Test if code is formatted for cognee

Test COG-650

* refactor: Rename ruff gh actions

Rename ruff gh actions to be more understandable

Refactor COG-650

* chore: Remove checking of ruff lint and format on push

Remove checking of ruff lint and format on push

Chore COG-650

* feat: Add deletion of local files when deleting data

Delete local files when deleting data from cognee

Feature COG-475

* fix: changes back the max workers to 12

* feat: Adds mock summary for codegraph pipeline

* refacotr: Add current development status

Save current development status

Refactor

* Fix langfuse

* Fix langfuse

* Fix langfuse

* Add evaluation notebook

* Rename eval notebook

* chore: Add temporary state of development

Add temp development state to branch

Chore

* fix: Add poetry.lock file, make langfuse mandatory

Added langfuse as mandatory dependency, added poetry.lock file

Fix

* Fix: fixes langfuse config settings

* feat: Add deletion of local files made by cognee through data endpoint

Delete local files made by cognee when deleting data from database through endpoint

Feature COG-475

* test: Revert changes on test_pgvector

Revert changes on test_pgvector which were made to test deletion of local files

Test COG-475

* chore: deletes the old test for the codegraph pipeline

* test: Add test to verify deletion of local files

Added test that checks local files created by cognee will be deleted and those not created by cognee won't

Test COG-475

* chore: deletes unused old version of the codegraph

* chore: deletes unused imports from code_graph_pipeline

* Ingest non-code files

* Fixing review findings

* Ingest non-code files (#395)

* Ingest non-code files

* Fixing review findings

* test: Update test regarding message

Update assertion message, add veryfing of file existence

* Handle retryerrors in code summary (#396)

* Handle retryerrors in code summary

* Log instead of print

* fix: updates the acreate_structured_output

* chore: Add logging to sentry when file which should exist can't be found

Log to sentry that a file which should exist can't be found

Chore COG-475

* Fix diagram

* fix: refactor mcp

* Add Smithery CLI installation instructions and badge

* Move readme

* Update README.md

* Update README.md

* Cog 813 source code chunks (#383)

* fix: pass the list of all CodeFiles to enrichment task

* feat: introduce SourceCodeChunk, update metadata

* feat: get_source_code_chunks code graph pipeline task

* feat: integrate get_source_code_chunks task, comment out summarize_code

* Fix code summarization (#387)

* feat: update data models

* feat: naive parse long strings in source code

* fix: get_non_py_files instead of get_non_code_files

* fix: limit recursion, add comment

* handle embedding empty input error (#398)

* feat: robustly handle CodeFile source code

* refactor: sort imports

* todo: add support for other embedding models

* feat: add custom logger

* feat: add robustness to get_source_code_chunks

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* feat: improve embedding exceptions

* refactor: format indents, rename module

---------

Co-authored-by: alekszievr <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Fix diagram

* Fix instructions

* adding and fixing files

* Update README.md

* ruff format

* Fix linter issues

* Implement PR review

* Comment out profiling

* fix: add allowed extensions

* fix: adhere UnstructuredDocument.read() to Document

* feat: time code graph run and add mock support

* Fix ollama, work on visualization

* fix: Fixes faulty logging format and sets up error logging in dynamic steps example

* Overcome ContextWindowExceededError by checking token count while chunking (#413)

* fix: Fixes duplicated edges in cognify by limiting the recursion depth in add datapoints

* Adjust AudioDocument and handle None token limit

* Handle azure models as well

* Add clean logging to code graph example

* Remove setting envvars from arg

* fix: fixes create_cognee_style_network_with_logo unit test

* fix: removes accidental remained print

* Get embedding engine instead of passing it. Get it from vector engine instead of direct getter.

* Fix visualization

* Get embedding engine instead of passing it in code chunking.

* Fix poetry issues

* chore: Update version of poetry install action

* chore: Update action to trigger on pull request for any branch

* chore: Remove if in github action to allow triggering on push

* chore: Remove if condition to allow gh actions to trigger on push to PR

* chore: Update poetry version in github actions

* chore: Set fixed ubuntu version to 22.04

* chore: Update py lint to use ubuntu 22.04

* chore: update ubuntu version to 22.04

* feat: implements the first version of graph based completion in search

* chore: Update python 3.9 gh action to use 3.12 instead

* chore: Update formatting of utils.py

* Fix poetry issues

* Adjust integration tests

* fix: Fixes ruff formatting

* Handle circular import

* fix: Resolve profiler issue with partial and recursive logger imports

Resolve issue for profiler with partial and recursive logger imports

* fix: Remove logger from __init__.py file

* test: Test profiling on HEAD branch

* test: Return profiler to base branch

* Set max_tokens in config

* Adjust SWE-bench script to code graph pipeline call

* Adjust SWE-bench script to code graph pipeline call

* fix: Add fix for accessing dictionary elements that don't exits

Using get for the text key instead of direct access to handle situation if the text key doesn't exist

* feat: Add ability to change graph database configuration through cognee

* feat: adds pydantic types to graph layer models

* feat: adds basic retriever for swe bench

* Match Ruff version in config to the one in github actions

* feat: implements code retreiver

* Fix: fixes unit test for codepart search

* Format with Ruff 0.9.0

* Fix: deleting incorrect repo path

* fix: resolve issue with langfuse dependency installation when integrating cognee in different packages

* version: Increase version to 0.1.21

---------

Co-authored-by: Igor Ilic <[email protected]>
Co-authored-by: Vasilije <[email protected]>
Co-authored-by: Igor Ilic <[email protected]>
Co-authored-by: Rita Aleksziev <[email protected]>
Co-authored-by: vasilije <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: lxobr <[email protected]>
Co-authored-by: alekszievr <[email protected]>
Co-authored-by: hajdul88 <[email protected]>
Co-authored-by: Henry Mao <[email protected]>
borisarzentar added a commit that referenced this pull request Jan 13, 2025
* Revert "fix: Add metadata reflection fix to sqlite as well"

This reverts commit 394a0b2.

* COG-810 Implement a top-down dependency graph builder tool (#268)

* feat: parse repo to call graph

* Update/repo_processor/top_down_repo_parse.py task

* fix: minor improvements

* feat: file parsing jedi script optimisation

---------

* Add type to DataPoint metadata (#364)

* Add missing index_fields

* Use DataPoint UUID type in pgvector create_data_points

* Make _metadata mandatory everywhere

* feat: Add search by dataset for cognee

Added ability to search by datasets for cognee users

Feature COG-912

* feat: outsources chunking parameters to extract chunk from documents … (#289)

* feat: outsources chunking parameters to extract chunk from documents task

* fix: Remove backend lock from UI

Removed lock that prevented using multiple datasets in cognify

Fix COG-912

* COG 870 Remove duplicate edges from the code graph (#293)

* feat: turn summarize_code into generator

* feat: extract run_code_graph_pipeline, update the pipeline

* feat: minimal code graph example

* refactor: update argument

* refactor: move run_code_graph_pipeline to cognify/code_graph_pipeline

* refactor: indentation and whitespace nits

* refactor: add deprecated use comments and warnings

---------

Co-authored-by: Vasilije <[email protected]>
Co-authored-by: Igor Ilic <[email protected]>
Co-authored-by: Boris <[email protected]>

* test: Added test for getting of documents for search

Added test to verify getting documents related to datasets intended for search

Test COG-912

* Structured code summarization (#375)

* feat: turn summarize_code into generator

* feat: extract run_code_graph_pipeline, update the pipeline

* feat: minimal code graph example

* refactor: update argument

* refactor: move run_code_graph_pipeline to cognify/code_graph_pipeline

* refactor: indentation and whitespace nits

* refactor: add deprecated use comments and warnings

* Structured code summarization

* add missing prompt file

* Remove summarization_model argument from summarize_code and fix typehinting

* minor refactors

---------

Co-authored-by: lxobr <[email protected]>
Co-authored-by: Vasilije <[email protected]>
Co-authored-by: Igor Ilic <[email protected]>
Co-authored-by: Boris <[email protected]>

* fix: Resolve issue with cognify router graph model default value

Resolve issue with default value for graph model in cognify endpoint

Fix

* chore: Resolve typo in getting documents code

Resolve typo in code

chore COG-912

* Update .github/workflows/dockerhub.yml

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Update .github/workflows/dockerhub.yml

* Update .github/workflows/dockerhub.yml

* Update .github/workflows/dockerhub.yml

* Update get_cognify_router.py

* fix: Resolve syntax issue with cognify router

Resolve syntax issue with cognify router

Fix

* feat: Add ruff pre-commit hook for linting and formatting

Added formatting and linting on pre-commit hook

Feature COG-650

* chore: Update ruff lint options in pyproject file

Update ruff lint options in pyproject file

Chore

* test: Add ruff linter github action

Added linting check with ruff in github actions

Test COG-650

* feat: deletes executor limit from get_repo_file_dependencies

* feat: implements mock feature in LiteLLM engine

* refactor: Remove changes to cognify router

Remove changes to cognify router

Refactor COG-650

* fix: fixing boolean env for github actions

* test: Add test for ruff format for cognee code

Test if code is formatted for cognee

Test COG-650

* refactor: Rename ruff gh actions

Rename ruff gh actions to be more understandable

Refactor COG-650

* chore: Remove checking of ruff lint and format on push

Remove checking of ruff lint and format on push

Chore COG-650

* feat: Add deletion of local files when deleting data

Delete local files when deleting data from cognee

Feature COG-475

* fix: changes back the max workers to 12

* feat: Adds mock summary for codegraph pipeline

* refacotr: Add current development status

Save current development status

Refactor

* Fix langfuse

* Fix langfuse

* Fix langfuse

* Add evaluation notebook

* Rename eval notebook

* chore: Add temporary state of development

Add temp development state to branch

Chore

* fix: Add poetry.lock file, make langfuse mandatory

Added langfuse as mandatory dependency, added poetry.lock file

Fix

* Fix: fixes langfuse config settings

* feat: Add deletion of local files made by cognee through data endpoint

Delete local files made by cognee when deleting data from database through endpoint

Feature COG-475

* test: Revert changes on test_pgvector

Revert changes on test_pgvector which were made to test deletion of local files

Test COG-475

* chore: deletes the old test for the codegraph pipeline

* test: Add test to verify deletion of local files

Added test that checks local files created by cognee will be deleted and those not created by cognee won't

Test COG-475

* chore: deletes unused old version of the codegraph

* chore: deletes unused imports from code_graph_pipeline

* Ingest non-code files

* Fixing review findings

* Ingest non-code files (#395)

* Ingest non-code files

* Fixing review findings

* test: Update test regarding message

Update assertion message, add veryfing of file existence

* Handle retryerrors in code summary (#396)

* Handle retryerrors in code summary

* Log instead of print

* fix: updates the acreate_structured_output

* chore: Add logging to sentry when file which should exist can't be found

Log to sentry that a file which should exist can't be found

Chore COG-475

* Fix diagram

* fix: refactor mcp

* Add Smithery CLI installation instructions and badge

* Move readme

* Update README.md

* Update README.md

* Cog 813 source code chunks (#383)

* fix: pass the list of all CodeFiles to enrichment task

* feat: introduce SourceCodeChunk, update metadata

* feat: get_source_code_chunks code graph pipeline task

* feat: integrate get_source_code_chunks task, comment out summarize_code

* Fix code summarization (#387)

* feat: update data models

* feat: naive parse long strings in source code

* fix: get_non_py_files instead of get_non_code_files

* fix: limit recursion, add comment

* handle embedding empty input error (#398)

* feat: robustly handle CodeFile source code

* refactor: sort imports

* todo: add support for other embedding models

* feat: add custom logger

* feat: add robustness to get_source_code_chunks

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* feat: improve embedding exceptions

* refactor: format indents, rename module

---------

Co-authored-by: alekszievr <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Fix diagram

* Fix diagram

* Fix instructions

* Fix instructions

* adding and fixing files

* Update README.md

* ruff format

* Fix linter issues

* Fix linter issues

* Fix linter issues

* Fix linter issues

* Fix linter issues

* Fix linter issues

* Fix linter issues

* Fix linter issues

* Fix linter issues

* Fix linter issues

* Implement PR review

* Comment out profiling

* Comment out profiling

* Comment out profiling

* fix: add allowed extensions

* fix: adhere UnstructuredDocument.read() to Document

* feat: time code graph run and add mock support

* Fix ollama, work on visualization

* fix: Fixes faulty logging format and sets up error logging in dynamic steps example

* Overcome ContextWindowExceededError by checking token count while chunking (#413)

* fix: Fixes duplicated edges in cognify by limiting the recursion depth in add datapoints

* Adjust AudioDocument and handle None token limit

* Handle azure models as well

* Fix visualization

* Fix visualization

* Fix visualization

* Add clean logging to code graph example

* Remove setting envvars from arg

* fix: fixes create_cognee_style_network_with_logo unit test

* fix: removes accidental remained print

* Fix visualization

* Fix visualization

* Fix visualization

* Get embedding engine instead of passing it. Get it from vector engine instead of direct getter.

* Fix visualization

* Fix visualization

* Fix poetry issues

* Get embedding engine instead of passing it in code chunking.

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* chore: Update version of poetry install action

* chore: Update action to trigger on pull request for any branch

* chore: Remove if in github action to allow triggering on push

* chore: Remove if condition to allow gh actions to trigger on push to PR

* chore: Update poetry version in github actions

* chore: Set fixed ubuntu version to 22.04

* chore: Update py lint to use ubuntu 22.04

* chore: update ubuntu version to 22.04

* feat: implements the first version of graph based completion in search

* chore: Update python 3.9 gh action to use 3.12 instead

* chore: Update formatting of utils.py

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* Fix poetry issues

* Adjust integration tests

* fix: Fixes ruff formatting

* Handle circular import

* fix: Resolve profiler issue with partial and recursive logger imports

Resolve issue for profiler with partial and recursive logger imports

* fix: Remove logger from __init__.py file

* test: Test profiling on HEAD branch

* test: Return profiler to base branch

* Set max_tokens in config

* Adjust SWE-bench script to code graph pipeline call

* Adjust SWE-bench script to code graph pipeline call

* fix: Add fix for accessing dictionary elements that don't exits

Using get for the text key instead of direct access to handle situation if the text key doesn't exist

* feat: Add ability to change graph database configuration through cognee

* feat: adds pydantic types to graph layer models

* test: Test ubuntu 24.04

* test: change all actions to ubuntu-latest

* feat: adds basic retriever for swe bench

* Match Ruff version in config to the one in github actions

* feat: implements code retreiver

* Fix: fixes unit test for codepart search

* Format with Ruff 0.9.0

* Fix: deleting incorrect repo path

* docs: Add LlamaIndex Cognee integration notebook

Added LlamaIndex Cognee integration notebook

* test: Add github action for testing llama index cognee integration notebook

* fix: resolve issue with langfuse dependency installation when integrating cognee in different packages

* version: Increase version to 0.1.21

* fix: update dependencies of the mcp server

* Update README.md

* Fix: Fixes logging setup

* feat: deletes on the fly embeddings as uses edge collections

* fix: Change nbformat on llama index integration notebook

* fix: Resolve api key issue with llama index integration notebook

* fix: Attempt to resolve issue with Ubuntu 24.04 segmentation fault

* version: Increase version to 0.1.22

---------

Co-authored-by: vasilije <[email protected]>
Co-authored-by: Igor Ilic <[email protected]>
Co-authored-by: Igor Ilic <[email protected]>
Co-authored-by: lxobr <[email protected]>
Co-authored-by: alekszievr <[email protected]>
Co-authored-by: hajdul88 <[email protected]>
Co-authored-by: Vasilije <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: Rita Aleksziev <[email protected]>
Co-authored-by: Henry Mao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants