Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/cog 971 preparing swe bench run #424

Merged
merged 8 commits into from
Jan 10, 2025

Conversation

hajdul88
Copy link
Contributor

@hajdul88 hajdul88 commented Jan 10, 2025

Summary by CodeRabbit

Based on the comprehensive summary, here are the updated release notes:

  • Configuration

    • Added optional max_tokens configuration to control token limits during document processing.
  • Document Processing

    • Enhanced document chunking with new token-based limit functionality.
    • Improved support for various document types (PDF, Text, Image, Audio, Unstructured).
  • Code Retrieval

    • Updated code search functionality to optionally include document context.
    • Refined code part extraction and dependency graph generation.
  • Type Identification

    • Added pydantic_type attributes to multiple data models for improved type tracking.
  • Performance

    • Streamlined embedding and chunking processes.
    • Added optional execution time tracking for code graph processing.

These updates provide more flexible document processing, enhanced code retrieval, and improved type management across the system.

Copy link
Contributor

coderabbitai bot commented Jan 10, 2025

Walkthrough

The changes span multiple files across the Cognee project, introducing a new optional max_tokens parameter to various document processing and chunking functions. This parameter allows more granular control over token limits during text processing. Additionally, several classes received a new pydantic_type attribute for enhanced type identification. The modifications streamline code graph pipeline functionality, improve document chunking flexibility, and add more robust type metadata to data models.

Changes

File Path Change Summary
cognee/api/v1/cognify/code_graph_pipeline.py Simplified task definitions by removing embedding engine and adding max_tokens parameter
cognee/modules/chunking/TextChunker.py Added max_tokens parameter and new method for token count management
cognee/modules/data/processing/document_types/* Updated read methods across document types to include max_tokens
cognee/modules/cognify/config.py Added max_tokens configuration option
cognee/shared/CodeGraphEntities.py Added pydantic_type to multiple classes
cognee/shared/data_models.py Added pydantic_type to summary-related classes
cognee/tasks/chunks/chunk_by_paragraph.py Integrated token-based chunking logic
cognee/tasks/documents/extract_chunks_from_documents.py Added max_tokens parameter to chunk extraction
cognee/tasks/repo_processor/get_source_code_chunks.py Simplified embedding model handling
cognee/modules/retrieval/description_to_codepart_search.py Updated function signatures to include include_docs parameter
cognee/tasks/repo_processor/get_non_code_files.py Introduced new allowed file extensions for processing
cognee/tasks/repo_processor/expand_dependency_graph.py Added private function for handling code parts in dependency graph

Poem

🐰 A Rabbit's Ode to Token Limits

In chunks of code, where tokens dance and play,
A limit now controls their merry way.
With max_tokens as our guiding light,
We parse and slice with precision bright.
Hop, hop, hurray! 🎉

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review.

Finishing Touches

  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (22)
cognee/modules/retrieval/description_to_codepart_search.py (2)

48-48: Remove the debugging print statement

The print(include_docs) statement appears to be for debugging purposes. It's recommended to remove it or replace it with a proper logging statement to maintain clean code in production.


130-132: Add separators when concatenating code pieces

When concatenating code pieces into context, consider adding separators or newlines between them to improve readability and maintainability.

Apply this change:

context = "\n\n".join(
    code_piece.get_attribute("source_code") for code_piece in code_pieces_to_return
)
cognee/modules/engine/models/EntityType.py (1)

8-8: Consider using ClassVar for class-level attribute

Since pydantic_type is a class-level attribute that does not represent an instance field, consider annotating it with ClassVar to clarify its intent.

Apply this change:

from typing import ClassVar

# ...

pydantic_type: ClassVar[str] = "EntityType"
cognee/modules/engine/models/Entity.py (1)

10-10: Consider using ClassVar for class-level attribute

Since pydantic_type is a class-level attribute, annotating it with ClassVar will make its purpose clearer and align with best practices.

Apply this change:

from typing import ClassVar

# ...

pydantic_type: ClassVar[str] = "Entity"
cognee/modules/data/processing/document_types/Document.py (1)

14-15: Add docstring to document the new parameter.

The addition of the max_tokens parameter looks good. Consider adding a docstring to document its purpose and behavior.

     def read(self, chunk_size: int, chunker=str, max_tokens: Optional[int] = None) -> str:
+        """Read document content in chunks.
+        
+        Args:
+            chunk_size: Size of each chunk
+            chunker: Chunking strategy to use
+            max_tokens: Optional maximum number of tokens per chunk
+        
+        Returns:
+            Document content as string
+        """
         pass
cognee/tasks/documents/extract_chunks_from_documents.py (1)

6-15: Add docstring to document the new parameter.

The propagation of the max_tokens parameter to document.read() is implemented correctly. Consider adding a docstring to document the parameter's purpose and behavior.

 async def extract_chunks_from_documents(
     documents: list[Document],
     chunk_size: int = 1024,
     chunker="text_chunker",
     max_tokens: Optional[int] = None,
 ):
+    """Extract chunks from a list of documents.
+    
+    Args:
+        documents: List of documents to process
+        chunk_size: Size of each chunk
+        chunker: Chunking strategy to use
+        max_tokens: Optional maximum number of tokens per chunk
+    
+    Yields:
+        Document chunks
+    """
     for document in documents:
         for document_chunk in document.read(
             chunk_size=chunk_size, chunker=chunker, max_tokens=max_tokens
         ):
             yield document_chunk
cognee/modules/data/processing/document_types/PdfDocument.py (1)

12-12: Add docstring to document the new parameter.

The implementation looks good, but please add a docstring to document the purpose and usage of the max_tokens parameter.

     def read(self, chunk_size: int, chunker: str, max_tokens: Optional[int] = None):
+        """
+        Read and chunk the PDF document.
+        
+        Args:
+            chunk_size: Size of each chunk
+            chunker: Type of chunker to use
+            max_tokens: Optional maximum number of tokens to process
+        """
cognee/modules/data/processing/document_types/TextDocument.py (1)

10-10: Add docstring to document the new parameter.

The implementation looks good, but please add a docstring to document the purpose and usage of the max_tokens parameter.

     def read(self, chunk_size: int, chunker: str, max_tokens: Optional[int] = None):
+        """
+        Read and chunk the text document.
+        
+        Args:
+            chunk_size: Size of each chunk
+            chunker: Type of chunker to use
+            max_tokens: Optional maximum number of tokens to process
+        """
cognee/modules/data/processing/document_types/AudioDocument.py (2)

16-16: Add docstring to document the new parameter.

The implementation looks good, but please add a docstring to document the purpose and usage of the max_tokens parameter.

     def read(self, chunk_size: int, chunker: str, max_tokens: Optional[int] = None):
+        """
+        Read and chunk the audio document after transcription.
+        
+        Args:
+            chunk_size: Size of each chunk
+            chunker: Type of chunker to use
+            max_tokens: Optional maximum number of tokens to process
+        """

22-24: Simplify the lambda function.

The lambda function can be simplified by using a list literal instead.

         chunker = chunker_func(
-            self, chunk_size=chunk_size, get_text=lambda: [text], max_tokens=max_tokens
+            self, chunk_size=chunk_size, get_text=iter([text]), max_tokens=max_tokens
         )
cognee/modules/data/processing/document_types/ImageDocument.py (2)

16-16: Add docstring for the read method.

Please add a docstring to document the new max_tokens parameter and its purpose.

    def read(self, chunk_size: int, chunker: str, max_tokens: Optional[int] = None):
+        """Read and chunk the image content.
+        
+        Args:
+            chunk_size: Size of each chunk
+            chunker: Type of chunker to use
+            max_tokens: Optional maximum number of tokens per chunk
+        
+        Yields:
+            Iterator of text chunks
+        """

21-23: Consider validating max_tokens parameter.

The max_tokens parameter is passed directly to the chunker without validation. Consider adding validation to ensure it's a positive integer when provided.

+        if max_tokens is not None and max_tokens <= 0:
+            raise ValueError("max_tokens must be a positive integer")
         chunker = chunker_func(
             self, chunk_size=chunk_size, get_text=lambda: [text], max_tokens=max_tokens
         )
cognee/modules/data/processing/document_types/UnstructuredDocument.py (2)

13-13: Add docstring and improve type hints.

Please add a docstring and improve type hints for better code maintainability.

-    def read(self, chunk_size: int, chunker: str, max_tokens: Optional[int] = None) -> str:
+    def read(self, chunk_size: int, chunker: str, max_tokens: Optional[int] = None) -> Iterator[str]:
+        """Read and chunk the document content using unstructured library.
+        
+        Args:
+            chunk_size: Size of each chunk
+            chunker: Type of chunker to use
+            max_tokens: Optional maximum number of tokens per chunk
+        
+        Yields:
+            Iterator of text chunks
+            
+        Raises:
+            UnstructuredLibraryImportError: If unstructured library is not installed
+        """

Line range hint 14-31: Add type hints to the get_text generator function.

Consider adding type hints to the inner generator function for better code maintainability.

-        def get_text():
+        def get_text() -> Iterator[str]:
cognee/shared/CodeGraphEntities.py (1)

8-8: Consider validating pydantic_type matches class name.

The pydantic_type values match their respective class names, but there's no validation to ensure this remains true if class names change. Consider adding a validation in the base class or using a class decorator.

def validate_pydantic_type(cls):
    """Decorator to validate pydantic_type matches class name."""
    if cls.pydantic_type != cls.__name__:
        raise ValueError(f"pydantic_type must match class name: {cls.__name__}")
    return cls

@validate_pydantic_type
class Repository(DataPoint):
    # ... rest of the class definition

Also applies to: 15-15, 27-27, 36-36

examples/python/code_graph_example.py (2)

33-33: Consider making log level configurable.

The logging level is hardcoded to ERROR. Consider making it configurable via command line arguments.

+    parser.add_argument(
+        "--log-level",
+        type=str,
+        choices=["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"],
+        default="ERROR",
+        help="Set the logging level",
+    )
     args = parse_args()
-    setup_logging(logging.ERROR)
+    setup_logging(getattr(logging, args.log_level.upper()))

43-45: Improve timing output format.

Consider using a more structured format for timing output and adding more details.

-        print("\n" + "=" * 50)
-        print(f"Pipeline Execution Time: {end_time - start_time:.2f} seconds")
-        print("=" * 50 + "\n")
+        execution_time = end_time - start_time
+        print("\nPipeline Execution Summary")
+        print("=" * 50)
+        print(f"Total time: {execution_time:.2f} seconds")
+        print(f"Average time per file: {execution_time/len(list(args.repo_path)):.2f} seconds")
+        print("=" * 50)
cognee/tasks/repo_processor/expand_dependency_graph.py (1)

Line range hint 13-36: LGTM! Well-structured implementation with good error handling.

The function properly handles empty code parts and includes appropriate logging.

Remove commented out code.

The function contains commented out code that should be removed if no longer needed:

-                # part_of = code_file,
-                # graph.add_node(part_node_id, source_code=code_part, node_type=part_type)
-                # graph.add_edge(parent_node_id, part_node_id, relation="contains")

Document side effect in function docstring.

The function modifies the input code_file parameter by extending its contains list. This side effect should be documented in the function's docstring.

-    """Add code part nodes and edges for a specific part type."""
+    """Add code part nodes and edges for a specific part type.
+    
+    Args:
+        code_file: The CodeFile object to be modified. Its contains list will be extended.
+        part_type: The type of code parts being added.
+        code_parts: List of code parts to process.
+    """
cognee/tasks/chunks/chunk_by_paragraph.py (1)

30-32: Simplify model name extraction.

The current approach of splitting and taking the last part could be simplified.

-embedding_model = vector_engine.embedding_engine.model
-embedding_model = embedding_model.split("/")[-1]
+embedding_model = vector_engine.embedding_engine.model.rpartition("/")[-1]
cognee/modules/chunking/TextChunker.py (1)

25-28: Consider adding docstring to the new method.

The check_word_count_and_token_count method would benefit from documentation explaining its purpose and parameters.

 def check_word_count_and_token_count(self, word_count_before, token_count_before, chunk_data):
+    """
+    Check if adding a new chunk would exceed word count or token count limits.
+    
+    Args:
+        word_count_before (int): Current word count before adding the chunk
+        token_count_before (int): Current token count before adding the chunk
+        chunk_data (dict): Dictionary containing the chunk's word_count and token_count
+    
+    Returns:
+        bool: True if adding the chunk won't exceed limits, False otherwise
+    """
     word_count_fits = word_count_before + chunk_data["word_count"] <= self.max_chunk_size
     token_count_fits = token_count_before + chunk_data["token_count"] <= self.max_tokens
     return word_count_fits and token_count_fits
evals/eval_swe_bench.py (1)

91-100: Consider parallel processing for better performance.

The sequential processing of predictions might be slower for large datasets. Consider using asyncio.gather to process predictions in parallel.

-    preds = []
-    for instance in dataset:
-        instance_id = instance["instance_id"]
-        model_patch = await pred_func(instance)
-        preds.append(
-            {
-                "instance_id": instance_id,
-                "model_patch": model_patch,
-                "model_name_or_path": model_name,
-            }
-        )
+    tasks = [
+        pred_func(instance) for instance in dataset
+    ]
+    model_patches = await asyncio.gather(*tasks)
+    preds = [
+        {
+            "instance_id": instance["instance_id"],
+            "model_patch": model_patch,
+            "model_name_or_path": model_name,
+        }
+        for instance, model_patch in zip(dataset, model_patches)
+    ]
cognee/infrastructure/llm/prompts/patch_gen_kg_instructions.txt (1)

1-6: Enhance clarity and grammar in the instructions.

Consider these improvements:

  1. Add a comma before "and" in line 3: "with correct syntax, and"
  2. Replace "make sure" with "ensure" for stronger wording
-You are a senior software engineer. I need you to solve this issue by looking at the provided context and
-generate a single patch file that I can apply directly to this repository using git apply.
-Additionally, please make sure that you provide code only with correct syntax and
-you apply the patch on the relevant files (together with their path that you can try to find out from the github issue). Don't change the names of existing
-functions or classes, as they may be referenced from other code.
-Please respond only with a single patch file in the following format without adding any additional context or string.
+You are a senior software engineer. I need you to solve this issue by looking at the provided context and
+generate a single patch file that I can apply directly to this repository using git apply.
+Additionally, please ensure that you provide code only with correct syntax, and
+you apply the patch on the relevant files (together with their path that you can try to find out from the github issue). Don't change the names of existing
+functions or classes, as they may be referenced from other code.
+Please respond only with a single patch file in the following format without adding any additional context or string.
🧰 Tools
🪛 LanguageTool

[style] ~3-~3: Consider using a different verb to strengthen your wording.
Context: ...y using git apply. Additionally, please make sure that you provide code only with correct...

(MAKE_SURE_ENSURE)


[uncategorized] ~3-~3: Use a comma before ‘and’ if it connects two independent clauses (unless they are closely connected and short).
Context: ...ou provide code only with correct syntax and you apply the patch on the relevant fil...

(COMMA_COMPOUND_SENTENCE)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 64e803b and 16155f0.

📒 Files selected for processing (29)
  • cognee/api/v1/cognify/code_graph_pipeline.py (2 hunks)
  • cognee/infrastructure/llm/prompts/patch_gen_kg_instructions.txt (1 hunks)
  • cognee/modules/chunking/TextChunker.py (2 hunks)
  • cognee/modules/chunking/models/DocumentChunk.py (1 hunks)
  • cognee/modules/cognify/config.py (1 hunks)
  • cognee/modules/data/processing/document_types/AudioDocument.py (2 hunks)
  • cognee/modules/data/processing/document_types/Document.py (2 hunks)
  • cognee/modules/data/processing/document_types/ImageDocument.py (2 hunks)
  • cognee/modules/data/processing/document_types/PdfDocument.py (2 hunks)
  • cognee/modules/data/processing/document_types/TextDocument.py (2 hunks)
  • cognee/modules/data/processing/document_types/UnstructuredDocument.py (2 hunks)
  • cognee/modules/engine/models/Entity.py (1 hunks)
  • cognee/modules/engine/models/EntityType.py (1 hunks)
  • cognee/modules/retrieval/description_to_codepart_search.py (5 hunks)
  • cognee/shared/CodeGraphEntities.py (3 hunks)
  • cognee/shared/data_models.py (3 hunks)
  • cognee/tasks/chunks/chunk_by_paragraph.py (5 hunks)
  • cognee/tasks/documents/extract_chunks_from_documents.py (1 hunks)
  • cognee/tasks/repo_processor/expand_dependency_graph.py (1 hunks)
  • cognee/tasks/repo_processor/extract_code_parts.py (0 hunks)
  • cognee/tasks/repo_processor/get_local_dependencies.py (0 hunks)
  • cognee/tasks/repo_processor/get_non_code_files.py (1 hunks)
  • cognee/tasks/repo_processor/get_source_code_chunks.py (4 hunks)
  • cognee/tasks/summarization/models.py (1 hunks)
  • cognee/tests/integration/documents/UnstructuredDocument_test.py (2 hunks)
  • cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (2 hunks)
  • cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py (1 hunks)
  • evals/eval_swe_bench.py (5 hunks)
  • examples/python/code_graph_example.py (2 hunks)
💤 Files with no reviewable changes (2)
  • cognee/tasks/repo_processor/extract_code_parts.py
  • cognee/tasks/repo_processor/get_local_dependencies.py
🧰 Additional context used
📓 Learnings (2)
cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (1)
Learnt from: 0xideas
PR: topoteretes/cognee#205
File: cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py:7-7
Timestamp: 2024-11-13T14:55:05.912Z
Learning: When changes are made to the chunking implementation in `cognee/tasks/chunks`, the ground truth values in the corresponding tests in `cognee/tests/unit/processing/chunks` need to be updated accordingly.
cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py (1)
Learnt from: 0xideas
PR: topoteretes/cognee#205
File: cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py:7-7
Timestamp: 2024-11-13T14:55:05.912Z
Learning: When changes are made to the chunking implementation in `cognee/tasks/chunks`, the ground truth values in the corresponding tests in `cognee/tests/unit/processing/chunks` need to be updated accordingly.
🪛 LanguageTool
cognee/infrastructure/llm/prompts/patch_gen_kg_instructions.txt

[style] ~3-~3: Consider using a different verb to strengthen your wording.
Context: ...y using git apply. Additionally, please make sure that you provide code only with correct...

(MAKE_SURE_ENSURE)


[uncategorized] ~3-~3: Use a comma before ‘and’ if it connects two independent clauses (unless they are closely connected and short).
Context: ...ou provide code only with correct syntax and you apply the patch on the relevant fil...

(COMMA_COMPOUND_SENTENCE)

🔇 Additional comments (13)
cognee/modules/chunking/models/DocumentChunk.py (1)

15-15: LGTM!

The addition of the pydantic_type attribute is consistent with other model classes and follows the established pattern.

cognee/tasks/summarization/models.py (1)

20-20: LGTM!

The addition of the pydantic_type attribute is consistent with other model classes and follows the established pattern.

cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (1)

30-34: LGTM! Good improvement in code readability.

Converting positional arguments to keyword arguments makes the code more maintainable and self-documenting.

Also applies to: 49-51

cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py (2)

52-54: LGTM! Consistent with other test file changes.

Converting to keyword arguments improves code clarity.


Line range hint 3-41: Verify ground truth values.

Based on the retrieved learning, ground truth values need to be verified when chunking implementation changes. Please ensure the GROUND_TRUTH dictionary values are still valid with the updated chunking implementation.

✅ Verification successful

Ground truth values are still valid

The chunking implementation maintains the same behavior regarding paragraph boundaries, sentence preservation, and cut types. The test's ground truth values correctly reflect the current implementation's output format and behavior.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if the chunking implementation has changed
# Look for changes in the chunk_by_paragraph implementation
rg -A 5 "def chunk_by_paragraph" cognee/tasks/chunks/

Length of output: 498


Script:

#!/bin/bash
# Get the full implementation of chunk_by_paragraph
ast-grep --pattern 'def chunk_by_paragraph($$$) {
  $$$
}'

# Look for related utility functions that might affect chunking
rg -A 5 "def (get_word_count|get_cut_type|split_paragraph)" cognee/tasks/chunks/

Length of output: 143


Script:

#!/bin/bash
# Get the chunk_by_paragraph implementation with more context
rg -A 20 "^def chunk_by_paragraph" cognee/tasks/chunks/

# Look for related imports and functions
rg "from|import|def" cognee/tasks/chunks/chunk_by_paragraph.py

Length of output: 1941


Script:

#!/bin/bash
# Get more context from the chunk_by_paragraph implementation
rg -A 50 "^def chunk_by_paragraph" cognee/tasks/chunks/

# Look for chunk_by_sentence implementation
rg -A 20 "^def chunk_by_sentence" cognee/tasks/chunks/

Length of output: 5696

cognee/api/v1/cognify/code_graph_pipeline.py (2)

74-74: LGTM! Good addition of token limit control.

Adding max_tokens parameter to extract_chunks_from_documents task provides better control over chunk sizes.


63-63: Verify impact of removing embedding_model parameter.

The embedding_model parameter has been removed from the get_source_code_chunks task. Please ensure this doesn't affect the chunk generation functionality.

✅ Verification successful

Removal of embedding_model parameter is correct

The parameter was unused in the main function, and the embedding model configuration is properly handled by the helper function get_source_code_chunks_from_code_part with a default value of "text-embedding-3-large".

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if get_source_code_chunks uses embedding_model internally
ast-grep --pattern 'def get_source_code_chunks($$$) { $$$ embedding_model $$$ }'

Length of output: 82


Script:

#!/bin/bash
# Find the implementation of get_source_code_chunks
ast-grep --pattern 'def get_source_code_chunks'

# Backup search using ripgrep for better context
rg "def get_source_code_chunks" -A 10

Length of output: 2221

cognee/tests/integration/documents/UnstructuredDocument_test.py (1)

Line range hint 71-103: Add test cases for the new max_tokens parameter.

The test cases verify word counts and text lengths but don't cover the new max_tokens parameter functionality. Consider adding test cases that:

  1. Verify chunking behavior with different max_tokens values
  2. Test edge cases (e.g., very small token limits)
  3. Validate the interaction between chunk_size and max_tokens parameters
evals/eval_swe_bench.py (1)

Line range hint 130-148: Uncomment and parameterize evaluation code.

The evaluation code is commented out but seems important for the feature. Consider:

  1. Uncommenting the code
  2. Making the evaluation optional via command-line argument
  3. Adding error handling for the subprocess call
cognee/tasks/repo_processor/get_source_code_chunks.py (2)

130-132: LGTM! Good refactoring to use vector_engine.

The change to retrieve the embedding model from vector_engine centralizes the configuration and makes the code more maintainable.


157-157: Verify the impact of removing the embedding_model parameter.

The removal of the embedding_model parameter might affect existing callers of these functions.

Also applies to: 172-172

✅ Verification successful

The removal of embedding_model parameter appears safe

The codebase analysis shows no usage of the embedding_model parameter in any calls to get_source_code_chunks. The function is only used as a Task in the pipeline system and all existing callers are compatible with the new signature.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for calls to get_source_code_chunks with embedding_model parameter
rg -A 2 "get_source_code_chunks\([^)]*embedding_model" 

Length of output: 56


Script:

#!/bin/bash
# Find all occurrences of get_source_code_chunks
echo "=== All usages of get_source_code_chunks ==="
rg "get_source_code_chunks" -A 2

# Find function definition
echo -e "\n=== Function definition ==="
ast-grep --pattern 'def get_source_code_chunks($$$)'

# Check test files specifically
echo -e "\n=== Test file usages ==="
fd "test.*\.py" --exec rg "get_source_code_chunks" -A 2 {}

Length of output: 1946

cognee/shared/data_models.py (1)

234-234: LGTM! Good addition of type identification.

Adding pydantic_type to the models improves type safety during serialization/deserialization. The consistent naming pattern across all models is appreciated.

Also applies to: 243-243, 251-251, 262-262

cognee/tasks/repo_processor/get_non_code_files.py (1)

127-130: LGTM! Clean implementation of should_process.

The function effectively combines the ALLOWED_EXTENSIONS check with IGNORED_PATTERNS filtering.

Comment on lines +81 to +83
system_prompt="You are a Senior Software Engineer, summarize the context from documents"
f" in a way that it is gonna be provided next to codeparts as context"
f" while trying to solve this github issue connected to the project: {query}]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix the syntax error and improve the system prompt

There is an unmatched closing bracket ] at the end of the system_prompt string, which will cause a syntax error. Additionally, consider rephrasing the prompt for clarity and grammatical correctness.

Apply this diff to fix the syntax error:

-                    f" while trying to solve this github issue connected to the project: {query}]",
+                    f" while trying to solve this GitHub issue connected to the project: {query}",

Revised system_prompt for clarity:

system_prompt = (
    "You are a Senior Software Engineer. "
    "Summarize the context from the documents so it can be provided "
    "as context alongside code parts while trying to solve "
    f"the following GitHub issue connected to the project: {query}"
)

Comment on lines 29 to 31
async def code_description_to_code_part(
query: str, user: User, top_k: int, include_docs: bool
) -> List[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Ensure the return type matches the function's annotation

The function code_description_to_code_part is annotated to return List[str], but it currently returns a single string context. This mismatch can lead to errors for callers expecting a list. Please adjust the return type or modify the implementation to ensure consistency.

Modify the implementation to return a list of code pieces:

    code_pieces = [code_piece.get_attribute("source_code") for code_piece in code_pieces_to_return]
    if include_docs:
        code_pieces.insert(0, context_from_documents)
    return code_pieces

Also applies to: 130-137



class CognifyConfig(BaseSettings):
classification_model: object = DefaultContentPrediction
summarization_model: object = SummarizedContent

max_tokens: Optional[int] = os.getenv("MAX_TOKENS")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Convert MAX_TOKENS environment variable to int.

The environment variable needs to be converted to an integer when it's present.

-    max_tokens: Optional[int] = os.getenv("MAX_TOKENS")
+    max_tokens: Optional[int] = int(os.getenv("MAX_TOKENS")) if os.getenv("MAX_TOKENS") is not None else None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
max_tokens: Optional[int] = os.getenv("MAX_TOKENS")
max_tokens: Optional[int] = int(os.getenv("MAX_TOKENS")) if os.getenv("MAX_TOKENS") is not None else None

Comment on lines 39 to 41
tokenizer = tiktoken.encoding_for_model(embedding_model)
token_count = len(tokenizer.encode(sentence))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Move tokenizer initialization outside the loop for better performance.

Creating a new tokenizer instance for each sentence is inefficient. Consider initializing it once before the loop.

+tokenizer = tiktoken.encoding_for_model(embedding_model)
 for paragraph_id, sentence, word_count, end_type in chunk_by_sentence(
     data, maximum_length=paragraph_length
 ):
-    tokenizer = tiktoken.encoding_for_model(embedding_model)
     token_count = len(tokenizer.encode(sentence))

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +41 to +42
repo_path = "/Users/laszlohajdu/Documents/GitHub/graph_rag/"
async for result in run_code_graph_pipeline(repo_path, include_docs=include_docs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove hardcoded repository path.

The hardcoded repository path /Users/laszlohajdu/Documents/GitHub/graph_rag/ should be parameterized to make the code more portable and configurable.

-repo_path = "/Users/laszlohajdu/Documents/GitHub/graph_rag/"
+repo_path = os.getenv("REPO_PATH") or os.path.join(os.path.dirname(__file__), "..", "graph_rag")

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines 32 to 124
ALLOWED_EXTENSIONS = {
".txt",
".md",
".csv",
".json",
".xml",
".yaml",
".yml",
".html",
".css",
".js",
".ts",
".jsx",
".tsx",
".sql",
".log",
".ini",
".toml",
".properties",
".sh",
".bash",
".dockerfile",
".gitignore",
".gitattributes",
".makefile",
".pyproject",
".requirements",
".env",
".pdf",
".doc",
".docx",
".dot",
".dotx",
".rtf",
".wps",
".wpd",
".odt",
".ott",
".ottx",
".txt",
".wp",
".sdw",
".sdx",
".docm",
".dotm",
# Additional extensions for other programming languages
".java",
".c",
".cpp",
".h",
".cs",
".go",
".php",
".rb",
".swift",
".pl",
".lua",
".rs",
".scala",
".kt",
".sh",
".sql",
".v",
".asm",
".pas",
".d",
".ml",
".clj",
".cljs",
".erl",
".ex",
".exs",
".f",
".fs",
".r",
".pyi",
".pdb",
".ipynb",
".rmd",
".cabal",
".hs",
".nim",
".vhdl",
".verilog",
".svelte",
".html",
".css",
".scss",
".less",
".json5",
".yaml",
".yml",
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Remove duplicate extensions and organize the list.

The ALLOWED_EXTENSIONS set contains several duplicate entries:

  • .txt (lines 33, 71)
  • .html (lines 40, 117)
  • .css (lines 41, 118)
  • .yaml (lines 38, 122)
  • .yml (lines 39, 123)
  • .sql (lines 46, 93)
  • .sh (lines 51, 92)

Consider organizing the extensions by categories (documentation, source code, config files, etc.) using comments.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (2)
cognee/modules/retrieval/description_to_codepart_search.py (2)

Line range hint 29-44: Update docstring to match implementation

The docstring indicates a return type of Set[str] but the function returns a single concatenated string. Update the docstring to reflect the actual return type.

-    Returns:
-        Set[str]: A set of unique code parts matching the query.
+    Returns:
+        str: A concatenated string of code parts and optional document context.
🧰 Tools
🪛 GitHub Actions: ruff format

[warning] File requires formatting with Ruff formatter


Line range hint 1-142: Fix formatting issues

The file requires formatting with Ruff formatter according to the pipeline failures.

Run the following command to fix formatting:

ruff format cognee/modules/retrieval/description_to_codepart_search.py
🧰 Tools
🪛 GitHub Actions: ruff format

[warning] File requires formatting with Ruff formatter

♻️ Duplicate comments (1)
cognee/modules/retrieval/description_to_codepart_search.py (1)

80-82: ⚠️ Potential issue

Fix the syntax error in the system prompt

There is an unmatched closing bracket ] at the end of the system_prompt string.

🧰 Tools
🪛 GitHub Actions: ruff format

[warning] File requires formatting with Ruff formatter

🧹 Nitpick comments (1)
cognee/modules/retrieval/description_to_codepart_search.py (1)

117-126: Refactor node traversal logic for better maintainability

The nested conditional blocks for different pydantic_type values make the code hard to maintain. Consider extracting this logic into separate methods.

+    def _process_source_code_chunk(code_file, code_pieces):
+        for edge in code_file.get_skeleton_edges():
+            if edge.get_attribute("relationship_name") == "code_chunk_of":
+                code_pieces.add(edge.get_destination_node())
+
+    def _process_code_file(code_file, code_pieces):
+        for edge in code_file.get_skeleton_edges():
+            if edge.get_attribute("relationship_name") == "contains":
+                code_pieces.add(edge.get_destination_node())
+
     for node in code_summaries:
         # ... existing code ...
         for code_file in node_to_search_from.get_skeleton_neighbours():
-            if code_file.get_attribute("pydantic_type") == "SourceCodeChunk":
-                for code_file_edge in code_file.get_skeleton_edges():
-                    if code_file_edge.get_attribute("relationship_name") == "code_chunk_of":
-                        code_pieces_to_return.add(code_file_edge.get_destination_node())
-            elif code_file.get_attribute("pydantic_type") == "CodePart":
+            pydantic_type = code_file.get_attribute("pydantic_type")
+            if pydantic_type == "SourceCodeChunk":
+                _process_source_code_chunk(code_file, code_pieces_to_return)
+            elif pydantic_type == "CodePart":
                 code_pieces_to_return.add(code_file)
-            elif code_file.get_attribute("pydantic_type") == "CodeFile":
-                for code_file_edge in code_file.get_skeleton_edges():
-                    if code_file_edge.get_attribute("relationship_name") == "contains":
-                        code_pieces_to_return.add(code_file_edge.get_destination_node())
+            elif pydantic_type == "CodeFile":
+                _process_code_file(code_file, code_pieces_to_return)
+            else:
+                logging.warning("Unknown pydantic_type: %s", pydantic_type)
🧰 Tools
🪛 GitHub Actions: ruff format

[warning] File requires formatting with Ruff formatter

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 16155f0 and c163e35.

📒 Files selected for processing (1)
  • cognee/modules/retrieval/description_to_codepart_search.py (4 hunks)
🧰 Additional context used
🪛 GitHub Actions: ruff format
cognee/modules/retrieval/description_to_codepart_search.py

[warning] File requires formatting with Ruff formatter

Comment on lines +66 to +74
if include_docs:
search_results = await search(SearchType.INSIGHTS, query_text=query)

concatenated_descriptions = " ".join(
obj["description"]
for tpl in search_results
for obj in tpl
if isinstance(obj, dict) and "description" in obj
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for search results

The search results processing lacks error handling and validation. Consider adding checks for empty results and proper error handling.

         if include_docs:
             search_results = await search(SearchType.INSIGHTS, query_text=query)
+            if not search_results:
+                logging.warning("No document insights found for query: '%s'", query)
+                context_from_documents = ""
+                return context_from_documents

             concatenated_descriptions = " ".join(
                 obj["description"]
                 for tpl in search_results
                 for obj in tpl
                 if isinstance(obj, dict) and "description" in obj
             )

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 GitHub Actions: ruff format

[warning] File requires formatting with Ruff formatter

Comment on lines +135 to +142
context = ""
for code_piece in code_pieces_to_return:
context = context + code_piece.get_attribute("source_code")

if include_docs:
context = context_from_documents + context

return context
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve code concatenation for better memory efficiency and readability

The current string concatenation approach could be inefficient for large code bases and lacks separation between different code pieces.

-        context = ""
-        for code_piece in code_pieces_to_return:
-            context = context + code_piece.get_attribute("source_code")
+        code_pieces_content = [
+            code_piece.get_attribute("source_code")
+            for code_piece in code_pieces_to_return
+        ]
+        
+        # Add separator between code pieces for better readability
+        context = "\n\n---\n\n".join(code_pieces_content)

         if include_docs:
-            context = context_from_documents + context
+            context = f"{context_from_documents}\n\n---\n\n{context}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
context = ""
for code_piece in code_pieces_to_return:
context = context + code_piece.get_attribute("source_code")
if include_docs:
context = context_from_documents + context
return context
code_pieces_content = [
code_piece.get_attribute("source_code")
for code_piece in code_pieces_to_return
]
# Add separator between code pieces for better readability
context = "\n\n---\n\n".join(code_pieces_content)
if include_docs:
context = f"{context_from_documents}\n\n---\n\n{context}"
return context
🧰 Tools
🪛 GitHub Actions: ruff format

[warning] File requires formatting with Ruff formatter

@Vasilije1990 Vasilije1990 merged commit f694ca2 into dev Jan 10, 2025
25 of 26 checks passed
@Vasilije1990 Vasilije1990 deleted the feature/cog-971-preparing-swe-bench-run branch January 10, 2025 13:59
@coderabbitai coderabbitai bot mentioned this pull request Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants