-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: outsources chunking parameters to extract chunk from documents … #289
feat: outsources chunking parameters to extract chunk from documents … #289
Conversation
WalkthroughThe changes involve modifications to several document processing classes, including Changes
Possibly related PRs
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Outside diff range and nitpick comments (4)
cognee/modules/data/processing/document_types/ChunkerMapping.py (1)
4-6
: Consider making the chunker mapping immutableThe class-level dictionary could be modified at runtime. Consider using a frozen dictionary or tuple-based mapping for immutability.
- chunker_mapping = { - "text_chunker": TextChunker - } + chunker_mapping = frozendict({ + "text_chunker": TextChunker + })cognee/modules/data/processing/document_types/PdfDocument.py (1)
8-8
: Add type hints and update docstringThe method signature should include type hints for better code maintainability and IDE support.
- def read(self, chunk_size: int, chunker: str): + def read(self, chunk_size: int, chunker: str) -> Generator[str, None, None]: + """Read and chunk PDF document content. + + Args: + chunk_size (int): Size of each chunk + chunker (str): Type of chunker to use (e.g., "text", "semantic") + + Yields: + str: Document chunks + + Raises: + ValueError: If chunker type is invalid + """cognee/modules/data/processing/document_types/TextDocument.py (1)
Line range hint
7-15
: Ensure proper file handle cleanupThe file handle should be properly closed even if an error occurs during reading.
def read(self, chunk_size: int, chunker: str): def get_text(): - with open(self.raw_data_location, mode = "r", encoding = "utf-8") as file: - while True: - text = file.read(1024) + BUFFER_SIZE = 1024 * 4 # 4KB buffer for better performance + try: + with open(self.raw_data_location, mode="r", encoding="utf-8") as file: + while True: + text = file.read(BUFFER_SIZE) - if len(text.strip()) == 0: - break + if not text: + break - yield text + yield text + except IOError as e: + raise RuntimeError(f"Failed to read text file: {e}") from ecognee/modules/data/processing/document_types/AudioDocument.py (1)
1-1
: Consider architectural improvements for better maintainabilityThe current implementation has several areas for improvement:
- Move common chunker initialization logic to the base
Document
class to reduce code duplication- Add a configuration validation system to verify chunker types and parameters at startup
- Implement a progress monitoring system for long-running operations like transcription
- Consider using the Template Method pattern for the
read
methodWould you like me to provide example implementations for any of these improvements?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (7)
cognee/modules/data/processing/document_types/AudioDocument.py
(2 hunks)cognee/modules/data/processing/document_types/ChunkerMapping.py
(1 hunks)cognee/modules/data/processing/document_types/Document.py
(1 hunks)cognee/modules/data/processing/document_types/ImageDocument.py
(2 hunks)cognee/modules/data/processing/document_types/PdfDocument.py
(1 hunks)cognee/modules/data/processing/document_types/TextDocument.py
(2 hunks)cognee/tasks/documents/extract_chunks_from_documents.py
(1 hunks)
🔇 Additional comments (4)
cognee/modules/data/processing/document_types/ChunkerMapping.py (1)
9-15
: Add return type hint to get_chunker method
The method should specify its return type for better type safety.
- def get_chunker(cls, chunker_name: str):
+ def get_chunker(cls, chunker_name: str) -> type[TextChunker]:
Also, let's verify if there are other chunker implementations that should be included in the mapping.
✅ Verification successful
Let's gather more information about the chunker implementations to verify the return type and available chunkers.
Add return type hint to get_chunker method
The return type hint should be type[TextChunker]
as verified from the codebase:
- The
chunker_mapping
only containsTextChunker
class TextChunker
is the base class for chunking implementation- No other chunker implementations were found in the codebase
- def get_chunker(cls, chunker_name: str):
+ def get_chunker(cls, chunker_name: str) -> type[TextChunker]:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for other chunker implementations
echo "Searching for potential chunker classes..."
rg -l "class.*Chunker" --type py
Length of output: 232
Script:
#!/bin/bash
# Check the content of both files to understand the chunker hierarchy and implementations
echo "=== Content of TextChunker.py ==="
cat cognee/modules/chunking/TextChunker.py
echo -e "\n=== Content of ChunkerMapping.py ==="
cat cognee/modules/data/processing/document_types/ChunkerMapping.py
Length of output: 4505
cognee/tasks/documents/extract_chunks_from_documents.py (1)
4-7
: 🛠️ Refactor suggestion
Add type hints and update documentation
The function needs proper type hints and documentation for the new parameter.
-async def extract_chunks_from_documents(documents: list[Document], chunk_size: int = 1024, chunker = 'text_chunker'):
+async def extract_chunks_from_documents(
+ documents: list[Document],
+ chunk_size: int = 1024,
+ chunker: str = 'text_chunker'
+) -> AsyncIterator[str]:
+ """Extract chunks from multiple documents.
+
+ Args:
+ documents: List of documents to process
+ chunk_size: Size of each chunk
+ chunker: Type of chunker to use (default: "text_chunker")
+
+ Yields:
+ Chunks of document content
+ """
Let's verify if all document types support the new chunker parameter.
✅ Verification successful
Let me verify the implementation details of the read
method in these document types to ensure they properly handle the chunker parameter.
Let me correct the paths and try again.
Let me try one more time with ripgrep to see the actual implementations.
Type hints and documentation changes approved - all document types support chunker parameter
The verification shows that:
- All document types (Image, Audio, Text, PDF) have implemented the
read
method with thechunker
parameter - They all use the parameter correctly by passing it to
ChunkerConfig.get_chunker
- The base
Document
class declares the parameter in its interface
The suggested type hints and documentation improvements are valid and should be implemented.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if all document types have been updated
echo "Checking document type implementations..."
rg -l "class.*Document.*\(.*\):" --type py | xargs rg "def read.*chunker.*:"
Length of output: 741
Script:
#!/bin/bash
# Check the implementation of read method in document types
echo "Document.py base class implementation:"
ast-grep --pattern 'def read($$$) { $$$ }' cognee/modules/data/processing/document_types/Document.py
echo -e "\nConcrete implementations:"
for file in ImageDocument.py AudioDocument.py TextDocument.py PdfDocument.py; do
echo -e "\n$file implementation:"
ast-grep --pattern 'def read($$$) { $$$ }' cognee/modules/data/processing/document_types/document_types/$file
done
Length of output: 1876
Script:
#!/bin/bash
# Check the implementation of read method in document types
echo "Document.py base class implementation:"
ast-grep --pattern 'def read($$$) { $$$ }' cognee/modules/data/processing/document_types/Document.py
echo -e "\nConcrete implementations:"
for file in ImageDocument AudioDocument TextDocument PdfDocument; do
echo -e "\n$file implementation:"
ast-grep --pattern 'def read($$$) { $$$ }' cognee/modules/data/processing/document_types/${file}.py
done
Length of output: 1247
Script:
#!/bin/bash
# Check implementations with context
echo "Document.py base class implementation:"
rg -A 5 "def read.*chunker.*:" cognee/modules/data/processing/document_types/Document.py
echo -e "\nConcrete implementations:"
for file in ImageDocument AudioDocument TextDocument PdfDocument; do
echo -e "\n${file}.py implementation:"
rg -A 5 "def read.*chunker.*:" cognee/modules/data/processing/document_types/${file}.py
done
Length of output: 2195
cognee/modules/data/processing/document_types/ImageDocument.py (2)
3-3
: LGTM: Import changes align with new chunking architecture
The import change supports the transition to configurable chunking strategies.
17-18
: Verify error handling for invalid chunker types
The implementation looks clean, but we should ensure proper error handling for:
- Invalid chunker types
- Chunker instantiation failures
Let's check the error handling implementation:
✅ Verification successful
Let me gather more specific information about the ChunkerConfig implementation to verify error handling.
Let me try a different approach to locate the chunker implementation and error handling.
Based on the search results, I can now see that ChunkerConfig.get_chunker is implemented in ChunkerMapping.py and it does have proper error handling. Let me verify the complete implementation of ChunkerMapping.py.
Based on the implementation found, I can now provide the final response:
Error handling for chunker types is properly implemented
The ChunkerConfig implementation has appropriate error handling:
- Invalid chunker types raise a NotImplementedError with a descriptive message listing available options
- Chunker instantiation is handled through a clean factory pattern with type safety
The implementation is sufficient for both validation points mentioned in the review:
- Invalid chunker types ✓ (handled via NotImplementedError)
- Chunker instantiation failures ✓ (handled via type-safe factory pattern)
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check ChunkerConfig implementation for error handling
ast-grep --pattern 'class ChunkerConfig {
$$$
get_chunker($_) {
$$$
}
}'
# Look for any error handling tests
rg -l "test.*chunker" | xargs rg "invalid|error|exception"
Length of output: 30859
Script:
#!/bin/bash
# Search for ChunkerConfig implementation
ast-grep --pattern 'class ChunkerConfig {
$$$
}'
# Look for chunker-related tests
rg -A 5 "test.*chunker" "cognee/tests"
# Search for chunker-related exceptions
rg "class.*Error.*chunker" -A 3
Length of output: 125
Script:
#!/bin/bash
# Search for chunker-related files
fd chunker
# Look for get_chunker method implementation
rg -A 10 "get_chunker"
# Search for chunker-related error handling
rg -A 5 "raise.*chunker"
Length of output: 2892
Script:
#!/bin/bash
# Get complete implementation of ChunkerMapping
cat cognee/modules/data/processing/document_types/ChunkerMapping.py
# Look for any chunker-related exceptions in the codebase
rg "class.*Error" cognee/modules/chunking/
Length of output: 620
…pipeline-parameters-in-cognee
…ifying-and-outsourcing-pipeline-parameters-in-cognee
…ipeline-parameters-in-cognee
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (1)
cognee/tests/integration/documents/AudioDocument_test.py (1)
Line range hint
25-25
: Consider implementing a shared test suite for chunking strategiesThe current implementation duplicates chunking validation across different document types. Consider:
- Creating a shared test suite for common chunking behavior
- Defining a standard interface for chunking strategy validation
- Adding integration tests that verify chunking consistency across document types
Example shared test structure:
class BaseChunkerTest: def test_basic_chunking(self, document, chunker): # Common chunking tests def test_invalid_chunker(self, document): # Common error cases class TestPdfDocumentChunking(BaseChunkerTest): # PDF-specific tests class TestTextDocumentChunking(BaseChunkerTest): # Text-specific tests class TestAudioDocumentChunking(BaseChunkerTest): # Audio-specific testsAlso applies to: 36-36, 34-34
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
cognee/modules/data/processing/document_types/Document.py
(1 hunks)cognee/tests/integration/documents/AudioDocument_test.py
(1 hunks)cognee/tests/integration/documents/ImageDocument_test.py
(1 hunks)cognee/tests/integration/documents/PdfDocument_test.py
(1 hunks)cognee/tests/integration/documents/TextDocument_test.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cognee/modules/data/processing/document_types/Document.py
🔇 Additional comments (1)
cognee/tests/integration/documents/AudioDocument_test.py (1)
34-34
: Consider audio-specific chunking strategies and test coverage
While the test covers basic text chunking of transcribed audio, consider:
- Audio-specific chunking strategies (e.g., time-based chunks, speaker-based chunks)
- Testing the interaction between transcription and chunking
- Verifying that chunking preserves important audio context (e.g., speaker transitions)
What about the UnstructuredDocument class? Do these changes need to be implemented there as well? |
If we outsource chunkers like this, then yes, but we discuss this with Boris. |
…ipeline-parameters-in-cognee
…ipeline-parameters-in-cognee
* feat: Add error handling in case user is already part of database and permission already given to group Added error handling in case permission is already given to group and user is already part of group Feature COG-656 * feat: Add user verification for accessing data Verify user has access to data before returning it Feature COG-656 * feat: Add compute search to cognee Add compute search to cognee which makes searches human readable Feature COG-656 * feat: Add simple instruction for system prompt Add simple instruction for system prompt Feature COG-656 * pass pydantic model tocognify * feat: Add unauth access error to getting data Raise unauth access error when trying to read data without access Feature COG-656 * refactor: Rename query compute to query completion Rename searching type from compute to completion Refactor COG-656 * chore: Update typo in code Update typo in string in code Chore COG-656 * Add mcp to cognee * Add simple README * Update cognee-mcp/mcpcognee/__main__.py Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Create dockerhub.yml * Update get_cognify_router.py * fix: Resolve reflection issue when running cognee a second time after pruning data When running cognee a second time after pruning data some metadata doesn't get pruned. This makes cognee believe some tables exist that have been deleted Fix * fix: Add metadata reflection fix to sqlite as well Added fix when reflecting metadata to sqlite as well Fix * update * Revert "fix: Add metadata reflection fix to sqlite as well" This reverts commit 394a0b2. * COG-810 Implement a top-down dependency graph builder tool (#268) * feat: parse repo to call graph * Update/repo_processor/top_down_repo_parse.py task * fix: minor improvements * feat: file parsing jedi script optimisation --------- * Add type to DataPoint metadata (#364) * Add type to DataPoint metadata * Add missing index_fields * Use DataPoint UUID type in pgvector create_data_points * Make _metadata mandatory everywhere * Fixes * Fixes to our demo * feat: Add search by dataset for cognee Added ability to search by datasets for cognee users Feature COG-912 * feat: outsources chunking parameters to extract chunk from documents … (#289) * feat: outsources chunking parameters to extract chunk from documents task * fix: Remove backend lock from UI Removed lock that prevented using multiple datasets in cognify Fix COG-912 * COG 870 Remove duplicate edges from the code graph (#293) * feat: turn summarize_code into generator * feat: extract run_code_graph_pipeline, update the pipeline * feat: minimal code graph example * refactor: update argument * refactor: move run_code_graph_pipeline to cognify/code_graph_pipeline * refactor: indentation and whitespace nits * refactor: add deprecated use comments and warnings --------- Co-authored-by: Vasilije <[email protected]> Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: Boris <[email protected]> * test: Added test for getting of documents for search Added test to verify getting documents related to datasets intended for search Test COG-912 * Structured code summarization (#375) * feat: turn summarize_code into generator * feat: extract run_code_graph_pipeline, update the pipeline * feat: minimal code graph example * refactor: update argument * refactor: move run_code_graph_pipeline to cognify/code_graph_pipeline * refactor: indentation and whitespace nits * refactor: add deprecated use comments and warnings * Structured code summarization * add missing prompt file * Remove summarization_model argument from summarize_code and fix typehinting * minor refactors --------- Co-authored-by: lxobr <[email protected]> Co-authored-by: Vasilije <[email protected]> Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: Boris <[email protected]> * fix: Resolve issue with cognify router graph model default value Resolve issue with default value for graph model in cognify endpoint Fix * chore: Resolve typo in getting documents code Resolve typo in code chore COG-912 * Update .github/workflows/dockerhub.yml Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Update .github/workflows/dockerhub.yml * Update .github/workflows/dockerhub.yml * Update .github/workflows/dockerhub.yml * Update get_cognify_router.py * fix: Resolve syntax issue with cognify router Resolve syntax issue with cognify router Fix * feat: Add ruff pre-commit hook for linting and formatting Added formatting and linting on pre-commit hook Feature COG-650 * chore: Update ruff lint options in pyproject file Update ruff lint options in pyproject file Chore * test: Add ruff linter github action Added linting check with ruff in github actions Test COG-650 * feat: deletes executor limit from get_repo_file_dependencies * feat: implements mock feature in LiteLLM engine * refactor: Remove changes to cognify router Remove changes to cognify router Refactor COG-650 * fix: fixing boolean env for github actions * test: Add test for ruff format for cognee code Test if code is formatted for cognee Test COG-650 * refactor: Rename ruff gh actions Rename ruff gh actions to be more understandable Refactor COG-650 * chore: Remove checking of ruff lint and format on push Remove checking of ruff lint and format on push Chore COG-650 * feat: Add deletion of local files when deleting data Delete local files when deleting data from cognee Feature COG-475 * fix: changes back the max workers to 12 * feat: Adds mock summary for codegraph pipeline * refacotr: Add current development status Save current development status Refactor * Fix langfuse * Fix langfuse * Fix langfuse * Add evaluation notebook * Rename eval notebook * chore: Add temporary state of development Add temp development state to branch Chore * fix: Add poetry.lock file, make langfuse mandatory Added langfuse as mandatory dependency, added poetry.lock file Fix * Fix: fixes langfuse config settings * feat: Add deletion of local files made by cognee through data endpoint Delete local files made by cognee when deleting data from database through endpoint Feature COG-475 * test: Revert changes on test_pgvector Revert changes on test_pgvector which were made to test deletion of local files Test COG-475 * chore: deletes the old test for the codegraph pipeline * test: Add test to verify deletion of local files Added test that checks local files created by cognee will be deleted and those not created by cognee won't Test COG-475 * chore: deletes unused old version of the codegraph * chore: deletes unused imports from code_graph_pipeline * Ingest non-code files * Fixing review findings * Ingest non-code files (#395) * Ingest non-code files * Fixing review findings * test: Update test regarding message Update assertion message, add veryfing of file existence * Handle retryerrors in code summary (#396) * Handle retryerrors in code summary * Log instead of print * fix: updates the acreate_structured_output * chore: Add logging to sentry when file which should exist can't be found Log to sentry that a file which should exist can't be found Chore COG-475 * Fix diagram * fix: refactor mcp * Add Smithery CLI installation instructions and badge * Move readme * Update README.md * Update README.md * Cog 813 source code chunks (#383) * fix: pass the list of all CodeFiles to enrichment task * feat: introduce SourceCodeChunk, update metadata * feat: get_source_code_chunks code graph pipeline task * feat: integrate get_source_code_chunks task, comment out summarize_code * Fix code summarization (#387) * feat: update data models * feat: naive parse long strings in source code * fix: get_non_py_files instead of get_non_code_files * fix: limit recursion, add comment * handle embedding empty input error (#398) * feat: robustly handle CodeFile source code * refactor: sort imports * todo: add support for other embedding models * feat: add custom logger * feat: add robustness to get_source_code_chunks Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * feat: improve embedding exceptions * refactor: format indents, rename module --------- Co-authored-by: alekszievr <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Fix diagram * Fix instructions * adding and fixing files * Update README.md * ruff format * Fix linter issues * Implement PR review * Comment out profiling * fix: add allowed extensions * fix: adhere UnstructuredDocument.read() to Document * feat: time code graph run and add mock support * Fix ollama, work on visualization * fix: Fixes faulty logging format and sets up error logging in dynamic steps example * Overcome ContextWindowExceededError by checking token count while chunking (#413) * fix: Fixes duplicated edges in cognify by limiting the recursion depth in add datapoints * Adjust AudioDocument and handle None token limit * Handle azure models as well * Add clean logging to code graph example * Remove setting envvars from arg * fix: fixes create_cognee_style_network_with_logo unit test * fix: removes accidental remained print * Get embedding engine instead of passing it. Get it from vector engine instead of direct getter. * Fix visualization * Get embedding engine instead of passing it in code chunking. * Fix poetry issues * chore: Update version of poetry install action * chore: Update action to trigger on pull request for any branch * chore: Remove if in github action to allow triggering on push * chore: Remove if condition to allow gh actions to trigger on push to PR * chore: Update poetry version in github actions * chore: Set fixed ubuntu version to 22.04 * chore: Update py lint to use ubuntu 22.04 * chore: update ubuntu version to 22.04 * feat: implements the first version of graph based completion in search * chore: Update python 3.9 gh action to use 3.12 instead * chore: Update formatting of utils.py * Fix poetry issues * Adjust integration tests * fix: Fixes ruff formatting * Handle circular import * fix: Resolve profiler issue with partial and recursive logger imports Resolve issue for profiler with partial and recursive logger imports * fix: Remove logger from __init__.py file * test: Test profiling on HEAD branch * test: Return profiler to base branch * Set max_tokens in config * Adjust SWE-bench script to code graph pipeline call * Adjust SWE-bench script to code graph pipeline call * fix: Add fix for accessing dictionary elements that don't exits Using get for the text key instead of direct access to handle situation if the text key doesn't exist * feat: Add ability to change graph database configuration through cognee * feat: adds pydantic types to graph layer models * feat: adds basic retriever for swe bench * Match Ruff version in config to the one in github actions * feat: implements code retreiver * Fix: fixes unit test for codepart search * Format with Ruff 0.9.0 * Fix: deleting incorrect repo path * fix: resolve issue with langfuse dependency installation when integrating cognee in different packages * version: Increase version to 0.1.21 --------- Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: Vasilije <[email protected]> Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: Rita Aleksziev <[email protected]> Co-authored-by: vasilije <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: lxobr <[email protected]> Co-authored-by: alekszievr <[email protected]> Co-authored-by: hajdul88 <[email protected]> Co-authored-by: Henry Mao <[email protected]>
* Revert "fix: Add metadata reflection fix to sqlite as well" This reverts commit 394a0b2. * COG-810 Implement a top-down dependency graph builder tool (#268) * feat: parse repo to call graph * Update/repo_processor/top_down_repo_parse.py task * fix: minor improvements * feat: file parsing jedi script optimisation --------- * Add type to DataPoint metadata (#364) * Add missing index_fields * Use DataPoint UUID type in pgvector create_data_points * Make _metadata mandatory everywhere * feat: Add search by dataset for cognee Added ability to search by datasets for cognee users Feature COG-912 * feat: outsources chunking parameters to extract chunk from documents … (#289) * feat: outsources chunking parameters to extract chunk from documents task * fix: Remove backend lock from UI Removed lock that prevented using multiple datasets in cognify Fix COG-912 * COG 870 Remove duplicate edges from the code graph (#293) * feat: turn summarize_code into generator * feat: extract run_code_graph_pipeline, update the pipeline * feat: minimal code graph example * refactor: update argument * refactor: move run_code_graph_pipeline to cognify/code_graph_pipeline * refactor: indentation and whitespace nits * refactor: add deprecated use comments and warnings --------- Co-authored-by: Vasilije <[email protected]> Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: Boris <[email protected]> * test: Added test for getting of documents for search Added test to verify getting documents related to datasets intended for search Test COG-912 * Structured code summarization (#375) * feat: turn summarize_code into generator * feat: extract run_code_graph_pipeline, update the pipeline * feat: minimal code graph example * refactor: update argument * refactor: move run_code_graph_pipeline to cognify/code_graph_pipeline * refactor: indentation and whitespace nits * refactor: add deprecated use comments and warnings * Structured code summarization * add missing prompt file * Remove summarization_model argument from summarize_code and fix typehinting * minor refactors --------- Co-authored-by: lxobr <[email protected]> Co-authored-by: Vasilije <[email protected]> Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: Boris <[email protected]> * fix: Resolve issue with cognify router graph model default value Resolve issue with default value for graph model in cognify endpoint Fix * chore: Resolve typo in getting documents code Resolve typo in code chore COG-912 * Update .github/workflows/dockerhub.yml Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Update .github/workflows/dockerhub.yml * Update .github/workflows/dockerhub.yml * Update .github/workflows/dockerhub.yml * Update get_cognify_router.py * fix: Resolve syntax issue with cognify router Resolve syntax issue with cognify router Fix * feat: Add ruff pre-commit hook for linting and formatting Added formatting and linting on pre-commit hook Feature COG-650 * chore: Update ruff lint options in pyproject file Update ruff lint options in pyproject file Chore * test: Add ruff linter github action Added linting check with ruff in github actions Test COG-650 * feat: deletes executor limit from get_repo_file_dependencies * feat: implements mock feature in LiteLLM engine * refactor: Remove changes to cognify router Remove changes to cognify router Refactor COG-650 * fix: fixing boolean env for github actions * test: Add test for ruff format for cognee code Test if code is formatted for cognee Test COG-650 * refactor: Rename ruff gh actions Rename ruff gh actions to be more understandable Refactor COG-650 * chore: Remove checking of ruff lint and format on push Remove checking of ruff lint and format on push Chore COG-650 * feat: Add deletion of local files when deleting data Delete local files when deleting data from cognee Feature COG-475 * fix: changes back the max workers to 12 * feat: Adds mock summary for codegraph pipeline * refacotr: Add current development status Save current development status Refactor * Fix langfuse * Fix langfuse * Fix langfuse * Add evaluation notebook * Rename eval notebook * chore: Add temporary state of development Add temp development state to branch Chore * fix: Add poetry.lock file, make langfuse mandatory Added langfuse as mandatory dependency, added poetry.lock file Fix * Fix: fixes langfuse config settings * feat: Add deletion of local files made by cognee through data endpoint Delete local files made by cognee when deleting data from database through endpoint Feature COG-475 * test: Revert changes on test_pgvector Revert changes on test_pgvector which were made to test deletion of local files Test COG-475 * chore: deletes the old test for the codegraph pipeline * test: Add test to verify deletion of local files Added test that checks local files created by cognee will be deleted and those not created by cognee won't Test COG-475 * chore: deletes unused old version of the codegraph * chore: deletes unused imports from code_graph_pipeline * Ingest non-code files * Fixing review findings * Ingest non-code files (#395) * Ingest non-code files * Fixing review findings * test: Update test regarding message Update assertion message, add veryfing of file existence * Handle retryerrors in code summary (#396) * Handle retryerrors in code summary * Log instead of print * fix: updates the acreate_structured_output * chore: Add logging to sentry when file which should exist can't be found Log to sentry that a file which should exist can't be found Chore COG-475 * Fix diagram * fix: refactor mcp * Add Smithery CLI installation instructions and badge * Move readme * Update README.md * Update README.md * Cog 813 source code chunks (#383) * fix: pass the list of all CodeFiles to enrichment task * feat: introduce SourceCodeChunk, update metadata * feat: get_source_code_chunks code graph pipeline task * feat: integrate get_source_code_chunks task, comment out summarize_code * Fix code summarization (#387) * feat: update data models * feat: naive parse long strings in source code * fix: get_non_py_files instead of get_non_code_files * fix: limit recursion, add comment * handle embedding empty input error (#398) * feat: robustly handle CodeFile source code * refactor: sort imports * todo: add support for other embedding models * feat: add custom logger * feat: add robustness to get_source_code_chunks Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * feat: improve embedding exceptions * refactor: format indents, rename module --------- Co-authored-by: alekszievr <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Fix diagram * Fix diagram * Fix instructions * Fix instructions * adding and fixing files * Update README.md * ruff format * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Implement PR review * Comment out profiling * Comment out profiling * Comment out profiling * fix: add allowed extensions * fix: adhere UnstructuredDocument.read() to Document * feat: time code graph run and add mock support * Fix ollama, work on visualization * fix: Fixes faulty logging format and sets up error logging in dynamic steps example * Overcome ContextWindowExceededError by checking token count while chunking (#413) * fix: Fixes duplicated edges in cognify by limiting the recursion depth in add datapoints * Adjust AudioDocument and handle None token limit * Handle azure models as well * Fix visualization * Fix visualization * Fix visualization * Add clean logging to code graph example * Remove setting envvars from arg * fix: fixes create_cognee_style_network_with_logo unit test * fix: removes accidental remained print * Fix visualization * Fix visualization * Fix visualization * Get embedding engine instead of passing it. Get it from vector engine instead of direct getter. * Fix visualization * Fix visualization * Fix poetry issues * Get embedding engine instead of passing it in code chunking. * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * chore: Update version of poetry install action * chore: Update action to trigger on pull request for any branch * chore: Remove if in github action to allow triggering on push * chore: Remove if condition to allow gh actions to trigger on push to PR * chore: Update poetry version in github actions * chore: Set fixed ubuntu version to 22.04 * chore: Update py lint to use ubuntu 22.04 * chore: update ubuntu version to 22.04 * feat: implements the first version of graph based completion in search * chore: Update python 3.9 gh action to use 3.12 instead * chore: Update formatting of utils.py * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Adjust integration tests * fix: Fixes ruff formatting * Handle circular import * fix: Resolve profiler issue with partial and recursive logger imports Resolve issue for profiler with partial and recursive logger imports * fix: Remove logger from __init__.py file * test: Test profiling on HEAD branch * test: Return profiler to base branch * Set max_tokens in config * Adjust SWE-bench script to code graph pipeline call * Adjust SWE-bench script to code graph pipeline call * fix: Add fix for accessing dictionary elements that don't exits Using get for the text key instead of direct access to handle situation if the text key doesn't exist * feat: Add ability to change graph database configuration through cognee * feat: adds pydantic types to graph layer models * test: Test ubuntu 24.04 * test: change all actions to ubuntu-latest * feat: adds basic retriever for swe bench * Match Ruff version in config to the one in github actions * feat: implements code retreiver * Fix: fixes unit test for codepart search * Format with Ruff 0.9.0 * Fix: deleting incorrect repo path * docs: Add LlamaIndex Cognee integration notebook Added LlamaIndex Cognee integration notebook * test: Add github action for testing llama index cognee integration notebook * fix: resolve issue with langfuse dependency installation when integrating cognee in different packages * version: Increase version to 0.1.21 * fix: update dependencies of the mcp server * Update README.md * Fix: Fixes logging setup * feat: deletes on the fly embeddings as uses edge collections * fix: Change nbformat on llama index integration notebook * fix: Resolve api key issue with llama index integration notebook * fix: Attempt to resolve issue with Ubuntu 24.04 segmentation fault * version: Increase version to 0.1.22 --------- Co-authored-by: vasilije <[email protected]> Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: lxobr <[email protected]> Co-authored-by: alekszievr <[email protected]> Co-authored-by: hajdul88 <[email protected]> Co-authored-by: Vasilije <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Rita Aleksziev <[email protected]> Co-authored-by: Henry Mao <[email protected]>
…task
Summary by CodeRabbit
New Features
ChunkerConfig
class for flexible chunking strategy selection.Bug Fixes
Documentation
chunker
parameter.Chores
TextChunker
, streamlining chunking implementation.