Structured code summarization #375

alekszievr · 2024-12-16T15:10:41Z

Summary by CodeRabbit

New Features
- Introduced a new function for running the code graph pipeline.
- Added a new file with instructions for summarizing Python code.
- Added a new function to summarize code content asynchronously.
- Introduced new data model classes for summarized code components.
Bug Fixes
- Improved error handling in the pipeline execution process.
Documentation
- New instructions for summarizing Python code to aid technical writers and programmers.
Refactor
- Updated function signatures and removed deprecated parameters for better usability.

coderabbitai · 2024-12-16T15:10:49Z

Walkthrough

This pull request introduces significant modifications to the code summarization and graph pipeline infrastructure. Key changes include the reorganization of import statements, the deprecation of the run_pipeline function in favor of run_code_graph_pipeline, and the addition of new classes for structured code summaries. The summarize_code function has been refactored to yield results asynchronously, enhancing its flexibility in processing data points. A new file with instructions for summarizing Python code has also been created, contributing to improved clarity and usability.

Changes

File	Change Summary
`cognee/api/v1/cognify/code_graph_pipeline.py`	- Deprecated `run_pipeline` function - Introduced `run_code_graph_pipeline` function - Updated import statements - Modified `summarize_code` task signature
`cognee/infrastructure/llm/prompts/summarize_code.txt`	- New file with instructions for summarizing Python code
`cognee/modules/data/extraction/extract_summary.py`	- Added `extract_code_summary` async function - Updated import statements
`cognee/shared/data_models.py`	- Added `SummarizedFunction` class - Added `SummarizedClass` class - Added `SummarizedCode` class
`cognee/tasks/summarization/summarize_code.py`	- Updated function signature - Changed return type to async generator - Modified filtering and processing logic

Suggested reviewers

lxobr
hajdul88
Vasilije1990

Poem

🐰 In the realm of code, a rabbit's delight,
Summaries bloom with algorithmic might.
Pipelines dance, models refine their art,
Clarity emerges, a computational chart.
CodeRabbit hops through lines with glee! 🚀

Tip

CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command @coderabbitai generate docstrings to have CodeRabbit automatically generate docstrings for your pull request. We would love to hear your feedback on Discord.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

lxobr · 2024-12-16T18:44:01Z

cognee/modules/data/extraction/extract_summary.py

+async def extract_code_summary(content: str, response_model: Type[BaseModel]):
+    llm_client = get_llm_client()
+
+    system_prompt = read_query_prompt("summarize_code.txt")


I can't seem to find this file, can you please double check that it is commited?

ah right, file added now!

lxobr · 2024-12-16T19:04:45Z

cognee/api/v1/cognify/code_graph_pipeline.py

@@ -129,7 +136,7 @@ async def run_code_graph_pipeline(repo_path):
        Task(get_repo_file_dependencies),
        Task(enrich_dependency_graph, task_config={"batch_size": 50}),
        Task(expand_dependency_graph, task_config={"batch_size": 50}),
-        Task(summarize_code, summarization_model=SummarizedContent, task_config={"batch_size": 50}),
+        Task(summarize_code, summarization_model=SummarizedCode, task_config={"batch_size": 50}),


Should the the summarize_code function have a summarization_model argument (as it seems to be always called with the SummarizedCode pydantic model)?

Fair point, I can't think of a reason to keep it, so I'll remove it.

…ove-summarization

…inting

coderabbitai

Actionable comments posted: 4

🔭 Outside diff range comments (3)

cognee/modules/data/extraction/extract_summary.py (1)

Line range hint 11-26: Add error handling for LLM client failures

Both functions should handle potential LLM client failures gracefully to prevent pipeline disruption.

 async def extract_summary(content: str, response_model: Type[BaseModel]):
     llm_client = get_llm_client()
     system_prompt = read_query_prompt("summarize_content.txt")
-    llm_output = await llm_client.acreate_structured_output(content, system_prompt, response_model)
-    return llm_output
+    try:
+        llm_output = await llm_client.acreate_structured_output(content, system_prompt, response_model)
+        return llm_output
+    except Exception as e:
+        logger.error(f"Failed to generate summary: {str(e)}")
+        raise

cognee/api/v1/cognify/code_graph_pipeline.py (2)

Line range hint 1-3: Use standard Python deprecation warning

Replace the comment-based deprecation notice with Python's built-in deprecation warning.

-# NOTICE: This module contains deprecated functions.
-# Use only the run_code_graph_pipeline function; all other functions are deprecated.
-# Related issue: COG-906
+import warnings
+
+def deprecated(func):
+    def wrapper(*args, **kwargs):
+        warnings.warn(
+            f"{func.__name__} is deprecated. Use run_code_graph_pipeline instead.",
+            DeprecationWarning,
+            stacklevel=2
+        )
+        return func(*args, **kwargs)
+    return wrapper

Line range hint 127-134: Avoid hardcoded paths in run_code_graph_pipeline

The function uses hardcoded paths for data and system directories. Consider making these configurable.

-    data_directory_path = str(pathlib.Path(os.path.join(file_path, ".data_storage/code_graph")).resolve())
-    cognee.config.data_root_directory(data_directory_path)
-    cognee_directory_path = str(pathlib.Path(os.path.join(file_path, ".cognee_system/code_graph")).resolve())
-    cognee.config.system_root_directory(cognee_directory_path)
+    data_directory_path = os.getenv('COGNEE_DATA_DIR', os.path.join(file_path, ".data_storage/code_graph"))
+    cognee.config.data_root_directory(str(pathlib.Path(data_directory_path).resolve()))
+    system_directory_path = os.getenv('COGNEE_SYSTEM_DIR', os.path.join(file_path, ".cognee_system/code_graph"))
+    cognee.config.system_root_directory(str(pathlib.Path(system_directory_path).resolve()))

🧹 Nitpick comments (3)

cognee/tasks/summarization/summarize_code.py (1)

17-17: Add type hint for source_code attribute

Consider creating a Protocol or base class that defines the expected structure for objects with source_code.
from typing import Protocol

class HasSourceCode(Protocol):
    source_code: str

code_data_points: list[HasSourceCode] = [
    file for file in code_graph_nodes if hasattr(file, "source_code")
]

cognee/api/v1/cognify/code_graph_pipeline.py (1)

140-144: Consider externalizing pipeline task configuration

Task configurations are hardcoded. Consider moving them to a configuration file for better maintainability.
# config/pipeline_tasks.yaml
tasks:
  get_repo_file_dependencies:
    batch_size: null
  enrich_dependency_graph:
    batch_size: 50
  expand_dependency_graph:
    batch_size: 50
  summarize_code:
    batch_size: 50
  add_data_points:
    batch_size: 50

cognee/shared/data_models.py (1)

199-221: Add docstrings and field validation.

Consider the following improvements:

Add docstrings to all new models to document their purpose and usage.
Add validation for the file_name field to ensure it's a valid path.

 class SummarizedFunction(BaseModel):
+    """Represents a summarized function with its key attributes.
+    
+    Attributes:
+        name: The name of the function
+        description: A brief description of what the function does
+        inputs: List of input parameter descriptions
+        outputs: List of output descriptions
+        decorators: List of decorators applied to the function
+    """
     name: str
     description: str
     inputs: Optional[List[str]] = None
     outputs: Optional[List[str]] = None
     decorators: Optional[List[str]] = None

 class SummarizedClass(BaseModel):
+    """Represents a summarized class with its key attributes.
+    
+    Attributes:
+        name: The name of the class
+        description: A brief description of what the class represents
+        methods: List of summarized methods in the class
+        decorators: List of decorators applied to the class
+    """
     name: str
     description: str
     methods: Optional[List[SummarizedFunction]] = None
     decorators: Optional[List[str]] = None

 class SummarizedCode(BaseModel):
+    """Represents a summarized code file with its key components.
+    
+    Attributes:
+        file_name: The name/path of the file
+        high_level_summary: A high-level overview of the file's purpose
+        key_features: List of important features or aspects
+        imports: List of import statements
+        constants: List of constant definitions
+        classes: List of summarized classes
+        functions: List of summarized functions
+        workflow_description: Optional description of the code's workflow
+    """
     file_name: str = Field(..., pattern=r'^[\w\-./]+\.[a-zA-Z0-9]+$')
     high_level_summary: str
     key_features: List[str]
     imports: List[str] = []
     constants: List[str] = []
     classes: List[SummarizedClass] = []
     functions: List[SummarizedFunction] = []
     workflow_description: Optional[str] = None

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between da5e3ab and cf1d53a.

📒 Files selected for processing (5)

cognee/api/v1/cognify/code_graph_pipeline.py (2 hunks)
cognee/infrastructure/llm/prompts/summarize_code.txt (1 hunks)
cognee/modules/data/extraction/extract_summary.py (2 hunks)
cognee/shared/data_models.py (2 hunks)
cognee/tasks/summarization/summarize_code.py (1 hunks)

✅ Files skipped from review due to trivial changes (1)

cognee/infrastructure/llm/prompts/summarize_code.txt

🔇 Additional comments (1)

cognee/shared/data_models.py (1)

4-4: LGTM!

The import statement is correctly formatted and includes all necessary types.

coderabbitai · 2024-12-17T11:54:28Z

cognee/modules/data/extraction/extract_summary.py

+async def extract_code_summary(content: str):
+    llm_client = get_llm_client()
+
+    system_prompt = read_query_prompt("summarize_code.txt")
+
+    llm_output = await llm_client.acreate_structured_output(content, system_prompt, response_model=SummarizedCode)
+
+    return llm_output


🛠️ Refactor suggestion

Consider refactoring to reduce code duplication

The new extract_code_summary function shares identical logic with extract_summary, differing only in the prompt file and response model. Consider refactoring to eliminate duplication.

-async def extract_code_summary(content: str): - llm_client = get_llm_client() - - system_prompt = read_query_prompt("summarize_code.txt") - - llm_output = await llm_client.acreate_structured_output(content, system_prompt, response_model=SummarizedCode) - - return llm_output +async def extract_code_summary(content: str): + return await extract_summary(content, SummarizedCode)

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2024-12-17T11:54:28Z

cognee/tasks/summarization/summarize_code.py

    file_summaries_map = {
-        code_file_data_point.extracted_id: file_summary.summary
-        for code_file_data_point, file_summary in zip(code_files_data_points, file_summaries)
+        code_data_point.extracted_id: str(file_summary)
+        for code_data_point, file_summary in zip(code_data_points, file_summaries)
    }


⚠️ Potential issue

Add null check for extracted_id in file_summaries_map

The code assumes that extracted_id is always present and non-null, which might not be true.

file_summaries_map = { - code_data_point.extracted_id: str(file_summary) + code_data_point.extracted_id: str(file_summary) for code_data_point, file_summary in zip(code_data_points, file_summaries) + if code_data_point.extracted_id is not None }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

file_summaries_map = {

code_file_data_point.extracted_id: file_summary.summary

for code_file_data_point, file_summary in zip(code_files_data_points, file_summaries)

code_data_point.extracted_id: str(file_summary)

for code_data_point, file_summary in zip(code_data_points, file_summaries)

}

file_summaries_map = {

code_data_point.extracted_id: str(file_summary)

for code_data_point, file_summary in zip(code_data_points, file_summaries)

if code_data_point.extracted_id is not None

}

coderabbitai · 2024-12-17T11:54:28Z

cognee/tasks/summarization/summarize_code.py

    if len(code_graph_nodes) == 0:
        return



🛠️ Refactor suggestion

Use proper async generator syntax for early return

The early return for empty code_graph_nodes should use async generator syntax.

if len(code_graph_nodes) == 0: - return + return + # or more explicitly: + yield None + return

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2024-12-17T11:54:29Z

cognee/shared/data_models.py

+class SummarizedFunction(BaseModel):
+    name: str
+    description: str
+    inputs: Optional[List[str]] = None
+    outputs: Optional[str] = None
+    decorators: Optional[List[str]] = None
+


🛠️ Refactor suggestion

Update outputs field to support multiple return values.

The outputs field is currently defined as Optional[str], which doesn't support functions that return multiple values. Consider changing it to Optional[List[str]] for better flexibility.

class SummarizedFunction(BaseModel): name: str description: str inputs: Optional[List[str]] = None - outputs: Optional[str] = None + outputs: Optional[List[str]] = None decorators: Optional[List[str]] = None

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

class SummarizedFunction(BaseModel):

name: str

description: str

inputs: Optional[List[str]] = None

outputs: Optional[str] = None

decorators: Optional[List[str]] = None

class SummarizedFunction(BaseModel):

name: str

description: str

inputs: Optional[List[str]] = None

outputs: Optional[List[str]] = None

decorators: Optional[List[str]] = None

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

cognee/shared/data_models.py (2)

199-221: LGTM! Consider adding docstrings for better documentation.

The new Pydantic models create a well-structured hierarchy for code summarization. Consider adding docstrings to explain the purpose of each model:

 class SummarizedFunction(BaseModel):
+    """Represents a summarized function with its key attributes and metadata."""
     name: str
     description: str
     inputs: Optional[List[str]] = None
     outputs: Optional[List[str]] = None
     decorators: Optional[List[str]] = None

 class SummarizedClass(BaseModel):
+    """Represents a summarized class with its methods and metadata."""
     name: str
     description: str
     methods: Optional[List[SummarizedFunction]] = None
     decorators: Optional[List[str]] = None

 class SummarizedCode(BaseModel):
+    """Represents a complete code file summary with its structure and components."""
     file_name: str
     high_level_summary: str
     key_features: List[str]
     imports: List[str] = []
     constants: List[str] = []
     classes: List[SummarizedClass] = []
     functions: List[SummarizedFunction] = []
     workflow_description: Optional[str] = None

199-221: Add field descriptions and validations for better API documentation.

Consider enhancing the models with field descriptions and validations:

 class SummarizedFunction(BaseModel):
     """Represents a summarized function with its key attributes and metadata."""
-    name: str
-    description: str
-    inputs: Optional[List[str]] = None
-    outputs: Optional[List[str]] = None
-    decorators: Optional[List[str]] = None
+    name: str = Field(..., description="The name of the function")
+    description: str = Field(..., description="A brief description of the function's purpose")
+    inputs: Optional[List[str]] = Field(None, description="List of function parameters")
+    outputs: Optional[List[str]] = Field(None, description="List of return values")
+    decorators: Optional[List[str]] = Field(None, description="List of function decorators")

 class SummarizedClass(BaseModel):
     """Represents a summarized class with its methods and metadata."""
-    name: str
-    description: str
-    methods: Optional[List[SummarizedFunction]] = None
-    decorators: Optional[List[str]] = None
+    name: str = Field(..., description="The name of the class")
+    description: str = Field(..., description="A brief description of the class's purpose")
+    methods: Optional[List[SummarizedFunction]] = Field(None, description="List of class methods")
+    decorators: Optional[List[str]] = Field(None, description="List of class decorators")

 class SummarizedCode(BaseModel):
     """Represents a complete code file summary with its structure and components."""
-    file_name: str
-    high_level_summary: str
-    key_features: List[str]
-    imports: List[str] = []
-    constants: List[str] = []
-    classes: List[SummarizedClass] = []
-    functions: List[SummarizedFunction] = []
-    workflow_description: Optional[str] = None
+    file_name: str = Field(..., description="The name of the file being summarized")
+    high_level_summary: str = Field(..., description="A high-level overview of the file's purpose")
+    key_features: List[str] = Field(..., description="List of key features or functionalities")
+    imports: List[str] = Field(default_factory=list, description="List of import statements")
+    constants: List[str] = Field(default_factory=list, description="List of constant definitions")
+    classes: List[SummarizedClass] = Field(default_factory=list, description="List of class summaries")
+    functions: List[SummarizedFunction] = Field(default_factory=list, description="List of function summaries")
+    workflow_description: Optional[str] = Field(None, description="Description of the code's workflow")

Consider also adding these validations:

Add regex validation for file_name to ensure it's a valid path
Add format validation for imports to ensure they follow Python import syntax

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cf1d53a and 7523b0e.

📒 Files selected for processing (2)

cognee/modules/data/extraction/extract_summary.py (2 hunks)
cognee/shared/data_models.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

cognee/modules/data/extraction/extract_summary.py

🔇 Additional comments (1)

cognee/shared/data_models.py (1)

4-4: LGTM!

The import statement follows Python's import style guide.

* feat: Add error handling in case user is already part of database and permission already given to group Added error handling in case permission is already given to group and user is already part of group Feature COG-656 * feat: Add user verification for accessing data Verify user has access to data before returning it Feature COG-656 * feat: Add compute search to cognee Add compute search to cognee which makes searches human readable Feature COG-656 * feat: Add simple instruction for system prompt Add simple instruction for system prompt Feature COG-656 * pass pydantic model tocognify * feat: Add unauth access error to getting data Raise unauth access error when trying to read data without access Feature COG-656 * refactor: Rename query compute to query completion Rename searching type from compute to completion Refactor COG-656 * chore: Update typo in code Update typo in string in code Chore COG-656 * Add mcp to cognee * Add simple README * Update cognee-mcp/mcpcognee/__main__.py Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Create dockerhub.yml * Update get_cognify_router.py * fix: Resolve reflection issue when running cognee a second time after pruning data When running cognee a second time after pruning data some metadata doesn't get pruned. This makes cognee believe some tables exist that have been deleted Fix * fix: Add metadata reflection fix to sqlite as well Added fix when reflecting metadata to sqlite as well Fix * update * Revert "fix: Add metadata reflection fix to sqlite as well" This reverts commit 394a0b2. * COG-810 Implement a top-down dependency graph builder tool (#268) * feat: parse repo to call graph * Update/repo_processor/top_down_repo_parse.py task * fix: minor improvements * feat: file parsing jedi script optimisation --------- * Add type to DataPoint metadata (#364) * Add type to DataPoint metadata * Add missing index_fields * Use DataPoint UUID type in pgvector create_data_points * Make _metadata mandatory everywhere * Fixes * Fixes to our demo * feat: Add search by dataset for cognee Added ability to search by datasets for cognee users Feature COG-912 * feat: outsources chunking parameters to extract chunk from documents … (#289) * feat: outsources chunking parameters to extract chunk from documents task * fix: Remove backend lock from UI Removed lock that prevented using multiple datasets in cognify Fix COG-912 * COG 870 Remove duplicate edges from the code graph (#293) * feat: turn summarize_code into generator * feat: extract run_code_graph_pipeline, update the pipeline * feat: minimal code graph example * refactor: update argument * refactor: move run_code_graph_pipeline to cognify/code_graph_pipeline * refactor: indentation and whitespace nits * refactor: add deprecated use comments and warnings --------- Co-authored-by: Vasilije <[email protected]> Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: Boris <[email protected]> * test: Added test for getting of documents for search Added test to verify getting documents related to datasets intended for search Test COG-912 * Structured code summarization (#375) * feat: turn summarize_code into generator * feat: extract run_code_graph_pipeline, update the pipeline * feat: minimal code graph example * refactor: update argument * refactor: move run_code_graph_pipeline to cognify/code_graph_pipeline * refactor: indentation and whitespace nits * refactor: add deprecated use comments and warnings * Structured code summarization * add missing prompt file * Remove summarization_model argument from summarize_code and fix typehinting * minor refactors --------- Co-authored-by: lxobr <[email protected]> Co-authored-by: Vasilije <[email protected]> Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: Boris <[email protected]> * fix: Resolve issue with cognify router graph model default value Resolve issue with default value for graph model in cognify endpoint Fix * chore: Resolve typo in getting documents code Resolve typo in code chore COG-912 * Update .github/workflows/dockerhub.yml Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Update .github/workflows/dockerhub.yml * Update .github/workflows/dockerhub.yml * Update .github/workflows/dockerhub.yml * Update get_cognify_router.py * fix: Resolve syntax issue with cognify router Resolve syntax issue with cognify router Fix * feat: Add ruff pre-commit hook for linting and formatting Added formatting and linting on pre-commit hook Feature COG-650 * chore: Update ruff lint options in pyproject file Update ruff lint options in pyproject file Chore * test: Add ruff linter github action Added linting check with ruff in github actions Test COG-650 * feat: deletes executor limit from get_repo_file_dependencies * feat: implements mock feature in LiteLLM engine * refactor: Remove changes to cognify router Remove changes to cognify router Refactor COG-650 * fix: fixing boolean env for github actions * test: Add test for ruff format for cognee code Test if code is formatted for cognee Test COG-650 * refactor: Rename ruff gh actions Rename ruff gh actions to be more understandable Refactor COG-650 * chore: Remove checking of ruff lint and format on push Remove checking of ruff lint and format on push Chore COG-650 * feat: Add deletion of local files when deleting data Delete local files when deleting data from cognee Feature COG-475 * fix: changes back the max workers to 12 * feat: Adds mock summary for codegraph pipeline * refacotr: Add current development status Save current development status Refactor * Fix langfuse * Fix langfuse * Fix langfuse * Add evaluation notebook * Rename eval notebook * chore: Add temporary state of development Add temp development state to branch Chore * fix: Add poetry.lock file, make langfuse mandatory Added langfuse as mandatory dependency, added poetry.lock file Fix * Fix: fixes langfuse config settings * feat: Add deletion of local files made by cognee through data endpoint Delete local files made by cognee when deleting data from database through endpoint Feature COG-475 * test: Revert changes on test_pgvector Revert changes on test_pgvector which were made to test deletion of local files Test COG-475 * chore: deletes the old test for the codegraph pipeline * test: Add test to verify deletion of local files Added test that checks local files created by cognee will be deleted and those not created by cognee won't Test COG-475 * chore: deletes unused old version of the codegraph * chore: deletes unused imports from code_graph_pipeline * Ingest non-code files * Fixing review findings * Ingest non-code files (#395) * Ingest non-code files * Fixing review findings * test: Update test regarding message Update assertion message, add veryfing of file existence * Handle retryerrors in code summary (#396) * Handle retryerrors in code summary * Log instead of print * fix: updates the acreate_structured_output * chore: Add logging to sentry when file which should exist can't be found Log to sentry that a file which should exist can't be found Chore COG-475 * Fix diagram * fix: refactor mcp * Add Smithery CLI installation instructions and badge * Move readme * Update README.md * Update README.md * Cog 813 source code chunks (#383) * fix: pass the list of all CodeFiles to enrichment task * feat: introduce SourceCodeChunk, update metadata * feat: get_source_code_chunks code graph pipeline task * feat: integrate get_source_code_chunks task, comment out summarize_code * Fix code summarization (#387) * feat: update data models * feat: naive parse long strings in source code * fix: get_non_py_files instead of get_non_code_files * fix: limit recursion, add comment * handle embedding empty input error (#398) * feat: robustly handle CodeFile source code * refactor: sort imports * todo: add support for other embedding models * feat: add custom logger * feat: add robustness to get_source_code_chunks Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * feat: improve embedding exceptions * refactor: format indents, rename module --------- Co-authored-by: alekszievr <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Fix diagram * Fix instructions * adding and fixing files * Update README.md * ruff format * Fix linter issues * Implement PR review * Comment out profiling * fix: add allowed extensions * fix: adhere UnstructuredDocument.read() to Document * feat: time code graph run and add mock support * Fix ollama, work on visualization * fix: Fixes faulty logging format and sets up error logging in dynamic steps example * Overcome ContextWindowExceededError by checking token count while chunking (#413) * fix: Fixes duplicated edges in cognify by limiting the recursion depth in add datapoints * Adjust AudioDocument and handle None token limit * Handle azure models as well * Add clean logging to code graph example * Remove setting envvars from arg * fix: fixes create_cognee_style_network_with_logo unit test * fix: removes accidental remained print * Get embedding engine instead of passing it. Get it from vector engine instead of direct getter. * Fix visualization * Get embedding engine instead of passing it in code chunking. * Fix poetry issues * chore: Update version of poetry install action * chore: Update action to trigger on pull request for any branch * chore: Remove if in github action to allow triggering on push * chore: Remove if condition to allow gh actions to trigger on push to PR * chore: Update poetry version in github actions * chore: Set fixed ubuntu version to 22.04 * chore: Update py lint to use ubuntu 22.04 * chore: update ubuntu version to 22.04 * feat: implements the first version of graph based completion in search * chore: Update python 3.9 gh action to use 3.12 instead * chore: Update formatting of utils.py * Fix poetry issues * Adjust integration tests * fix: Fixes ruff formatting * Handle circular import * fix: Resolve profiler issue with partial and recursive logger imports Resolve issue for profiler with partial and recursive logger imports * fix: Remove logger from __init__.py file * test: Test profiling on HEAD branch * test: Return profiler to base branch * Set max_tokens in config * Adjust SWE-bench script to code graph pipeline call * Adjust SWE-bench script to code graph pipeline call * fix: Add fix for accessing dictionary elements that don't exits Using get for the text key instead of direct access to handle situation if the text key doesn't exist * feat: Add ability to change graph database configuration through cognee * feat: adds pydantic types to graph layer models * feat: adds basic retriever for swe bench * Match Ruff version in config to the one in github actions * feat: implements code retreiver * Fix: fixes unit test for codepart search * Format with Ruff 0.9.0 * Fix: deleting incorrect repo path * fix: resolve issue with langfuse dependency installation when integrating cognee in different packages * version: Increase version to 0.1.21 --------- Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: Vasilije <[email protected]> Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: Rita Aleksziev <[email protected]> Co-authored-by: vasilije <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: lxobr <[email protected]> Co-authored-by: alekszievr <[email protected]> Co-authored-by: hajdul88 <[email protected]> Co-authored-by: Henry Mao <[email protected]>

* Revert "fix: Add metadata reflection fix to sqlite as well" This reverts commit 394a0b2. * COG-810 Implement a top-down dependency graph builder tool (#268) * feat: parse repo to call graph * Update/repo_processor/top_down_repo_parse.py task * fix: minor improvements * feat: file parsing jedi script optimisation --------- * Add type to DataPoint metadata (#364) * Add missing index_fields * Use DataPoint UUID type in pgvector create_data_points * Make _metadata mandatory everywhere * feat: Add search by dataset for cognee Added ability to search by datasets for cognee users Feature COG-912 * feat: outsources chunking parameters to extract chunk from documents … (#289) * feat: outsources chunking parameters to extract chunk from documents task * fix: Remove backend lock from UI Removed lock that prevented using multiple datasets in cognify Fix COG-912 * COG 870 Remove duplicate edges from the code graph (#293) * feat: turn summarize_code into generator * feat: extract run_code_graph_pipeline, update the pipeline * feat: minimal code graph example * refactor: update argument * refactor: move run_code_graph_pipeline to cognify/code_graph_pipeline * refactor: indentation and whitespace nits * refactor: add deprecated use comments and warnings --------- Co-authored-by: Vasilije <[email protected]> Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: Boris <[email protected]> * test: Added test for getting of documents for search Added test to verify getting documents related to datasets intended for search Test COG-912 * Structured code summarization (#375) * feat: turn summarize_code into generator * feat: extract run_code_graph_pipeline, update the pipeline * feat: minimal code graph example * refactor: update argument * refactor: move run_code_graph_pipeline to cognify/code_graph_pipeline * refactor: indentation and whitespace nits * refactor: add deprecated use comments and warnings * Structured code summarization * add missing prompt file * Remove summarization_model argument from summarize_code and fix typehinting * minor refactors --------- Co-authored-by: lxobr <[email protected]> Co-authored-by: Vasilije <[email protected]> Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: Boris <[email protected]> * fix: Resolve issue with cognify router graph model default value Resolve issue with default value for graph model in cognify endpoint Fix * chore: Resolve typo in getting documents code Resolve typo in code chore COG-912 * Update .github/workflows/dockerhub.yml Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Update .github/workflows/dockerhub.yml * Update .github/workflows/dockerhub.yml * Update .github/workflows/dockerhub.yml * Update get_cognify_router.py * fix: Resolve syntax issue with cognify router Resolve syntax issue with cognify router Fix * feat: Add ruff pre-commit hook for linting and formatting Added formatting and linting on pre-commit hook Feature COG-650 * chore: Update ruff lint options in pyproject file Update ruff lint options in pyproject file Chore * test: Add ruff linter github action Added linting check with ruff in github actions Test COG-650 * feat: deletes executor limit from get_repo_file_dependencies * feat: implements mock feature in LiteLLM engine * refactor: Remove changes to cognify router Remove changes to cognify router Refactor COG-650 * fix: fixing boolean env for github actions * test: Add test for ruff format for cognee code Test if code is formatted for cognee Test COG-650 * refactor: Rename ruff gh actions Rename ruff gh actions to be more understandable Refactor COG-650 * chore: Remove checking of ruff lint and format on push Remove checking of ruff lint and format on push Chore COG-650 * feat: Add deletion of local files when deleting data Delete local files when deleting data from cognee Feature COG-475 * fix: changes back the max workers to 12 * feat: Adds mock summary for codegraph pipeline * refacotr: Add current development status Save current development status Refactor * Fix langfuse * Fix langfuse * Fix langfuse * Add evaluation notebook * Rename eval notebook * chore: Add temporary state of development Add temp development state to branch Chore * fix: Add poetry.lock file, make langfuse mandatory Added langfuse as mandatory dependency, added poetry.lock file Fix * Fix: fixes langfuse config settings * feat: Add deletion of local files made by cognee through data endpoint Delete local files made by cognee when deleting data from database through endpoint Feature COG-475 * test: Revert changes on test_pgvector Revert changes on test_pgvector which were made to test deletion of local files Test COG-475 * chore: deletes the old test for the codegraph pipeline * test: Add test to verify deletion of local files Added test that checks local files created by cognee will be deleted and those not created by cognee won't Test COG-475 * chore: deletes unused old version of the codegraph * chore: deletes unused imports from code_graph_pipeline * Ingest non-code files * Fixing review findings * Ingest non-code files (#395) * Ingest non-code files * Fixing review findings * test: Update test regarding message Update assertion message, add veryfing of file existence * Handle retryerrors in code summary (#396) * Handle retryerrors in code summary * Log instead of print * fix: updates the acreate_structured_output * chore: Add logging to sentry when file which should exist can't be found Log to sentry that a file which should exist can't be found Chore COG-475 * Fix diagram * fix: refactor mcp * Add Smithery CLI installation instructions and badge * Move readme * Update README.md * Update README.md * Cog 813 source code chunks (#383) * fix: pass the list of all CodeFiles to enrichment task * feat: introduce SourceCodeChunk, update metadata * feat: get_source_code_chunks code graph pipeline task * feat: integrate get_source_code_chunks task, comment out summarize_code * Fix code summarization (#387) * feat: update data models * feat: naive parse long strings in source code * fix: get_non_py_files instead of get_non_code_files * fix: limit recursion, add comment * handle embedding empty input error (#398) * feat: robustly handle CodeFile source code * refactor: sort imports * todo: add support for other embedding models * feat: add custom logger * feat: add robustness to get_source_code_chunks Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * feat: improve embedding exceptions * refactor: format indents, rename module --------- Co-authored-by: alekszievr <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Fix diagram * Fix diagram * Fix instructions * Fix instructions * adding and fixing files * Update README.md * ruff format * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Fix linter issues * Implement PR review * Comment out profiling * Comment out profiling * Comment out profiling * fix: add allowed extensions * fix: adhere UnstructuredDocument.read() to Document * feat: time code graph run and add mock support * Fix ollama, work on visualization * fix: Fixes faulty logging format and sets up error logging in dynamic steps example * Overcome ContextWindowExceededError by checking token count while chunking (#413) * fix: Fixes duplicated edges in cognify by limiting the recursion depth in add datapoints * Adjust AudioDocument and handle None token limit * Handle azure models as well * Fix visualization * Fix visualization * Fix visualization * Add clean logging to code graph example * Remove setting envvars from arg * fix: fixes create_cognee_style_network_with_logo unit test * fix: removes accidental remained print * Fix visualization * Fix visualization * Fix visualization * Get embedding engine instead of passing it. Get it from vector engine instead of direct getter. * Fix visualization * Fix visualization * Fix poetry issues * Get embedding engine instead of passing it in code chunking. * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * chore: Update version of poetry install action * chore: Update action to trigger on pull request for any branch * chore: Remove if in github action to allow triggering on push * chore: Remove if condition to allow gh actions to trigger on push to PR * chore: Update poetry version in github actions * chore: Set fixed ubuntu version to 22.04 * chore: Update py lint to use ubuntu 22.04 * chore: update ubuntu version to 22.04 * feat: implements the first version of graph based completion in search * chore: Update python 3.9 gh action to use 3.12 instead * chore: Update formatting of utils.py * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Fix poetry issues * Adjust integration tests * fix: Fixes ruff formatting * Handle circular import * fix: Resolve profiler issue with partial and recursive logger imports Resolve issue for profiler with partial and recursive logger imports * fix: Remove logger from __init__.py file * test: Test profiling on HEAD branch * test: Return profiler to base branch * Set max_tokens in config * Adjust SWE-bench script to code graph pipeline call * Adjust SWE-bench script to code graph pipeline call * fix: Add fix for accessing dictionary elements that don't exits Using get for the text key instead of direct access to handle situation if the text key doesn't exist * feat: Add ability to change graph database configuration through cognee * feat: adds pydantic types to graph layer models * test: Test ubuntu 24.04 * test: change all actions to ubuntu-latest * feat: adds basic retriever for swe bench * Match Ruff version in config to the one in github actions * feat: implements code retreiver * Fix: fixes unit test for codepart search * Format with Ruff 0.9.0 * Fix: deleting incorrect repo path * docs: Add LlamaIndex Cognee integration notebook Added LlamaIndex Cognee integration notebook * test: Add github action for testing llama index cognee integration notebook * fix: resolve issue with langfuse dependency installation when integrating cognee in different packages * version: Increase version to 0.1.21 * fix: update dependencies of the mcp server * Update README.md * Fix: Fixes logging setup * feat: deletes on the fly embeddings as uses edge collections * fix: Change nbformat on llama index integration notebook * fix: Resolve api key issue with llama index integration notebook * fix: Attempt to resolve issue with Ubuntu 24.04 segmentation fault * version: Increase version to 0.1.22 --------- Co-authored-by: vasilije <[email protected]> Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: Igor Ilic <[email protected]> Co-authored-by: lxobr <[email protected]> Co-authored-by: alekszievr <[email protected]> Co-authored-by: hajdul88 <[email protected]> Co-authored-by: Vasilije <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Rita Aleksziev <[email protected]> Co-authored-by: Henry Mao <[email protected]>

lxobr and others added 13 commits December 10, 2024 15:58

feat: turn summarize_code into generator

6a0c1da

feat: extract run_code_graph_pipeline, update the pipeline

f46e8aa

feat: minimal code graph example

15cf708

Merge branch 'main' into COG-870-deduplicate-code-graph-edges

d5efe03

Merge branch 'dev' into COG-870-deduplicate-code-graph-edges

313395a

Merge branch 'dev' into COG-870-deduplicate-code-graph-edges

17a116a

refactor: update argument

52a0dac

refactor: move run_code_graph_pipeline to cognify/code_graph_pipeline

b78a011

refactor: indentation and whitespace nits

4023b4d

Merge branch 'dev' into COG-870-deduplicate-code-graph-edges

c55227c

refactor: add deprecated use comments and warnings

ad1b435

Merge branch 'dev' into COG-870-deduplicate-code-graph-edges

9e6406a

Structured code summarization

51271b1

Merge branch 'dev' into COG-870-deduplicate-code-graph-edges

c299995

alekszievr requested a review from lxobr December 16, 2024 15:27

dexters1 added the run-checks label Dec 16, 2024

lxobr reviewed Dec 16, 2024

View reviewed changes

alekszievr and others added 3 commits December 17, 2024 09:37

add missing prompt file

6096485

Merge branch 'COG-870-deduplicate-code-graph-edges' into COG-820-impr…

3529a70

…ove-summarization

Remove summarization_model argument from summarize_code and fix typeh…

d9374a0

…inting

alekszievr requested a review from lxobr December 17, 2024 10:10

Base automatically changed from COG-870-deduplicate-code-graph-edges to dev December 17, 2024 11:02

Merge branch 'dev' into COG-820-improve-summarization

cf1d53a

coderabbitai bot reviewed Dec 17, 2024

View reviewed changes

minor refactors

7523b0e

coderabbitai bot reviewed Dec 17, 2024

View reviewed changes

lxobr approved these changes Dec 17, 2024

View reviewed changes

alekszievr merged commit 9afd0ec into dev Dec 17, 2024
23 of 24 checks passed

alekszievr deleted the COG-820-improve-summarization branch December 17, 2024 12:05

coderabbitai bot mentioned this pull request Dec 17, 2024

feat: Add search by dataset for cognee #376

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Structured code summarization #375

Structured code summarization #375

alekszievr commented Dec 16, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 16, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

lxobr Dec 16, 2024

alekszievr Dec 17, 2024

lxobr Dec 16, 2024

alekszievr Dec 17, 2024

alekszievr Dec 17, 2024

coderabbitai bot left a comment

coderabbitai bot Dec 17, 2024

coderabbitai bot Dec 17, 2024

coderabbitai bot Dec 17, 2024

coderabbitai bot Dec 17, 2024

coderabbitai bot left a comment

Structured code summarization #375

Structured code summarization #375

Conversation

alekszievr commented Dec 16, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Dec 16, 2024 • edited Loading

Walkthrough

Changes

Suggested reviewers

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

lxobr Dec 16, 2024

Choose a reason for hiding this comment

alekszievr Dec 17, 2024

Choose a reason for hiding this comment

lxobr Dec 16, 2024

Choose a reason for hiding this comment

alekszievr Dec 17, 2024

Choose a reason for hiding this comment

alekszievr Dec 17, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Dec 17, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 17, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 17, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 17, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

alekszievr commented Dec 16, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 16, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)