-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/cog 971 preparing swe bench run #424
Conversation
WalkthroughThe changes span multiple files across the Cognee project, introducing a new optional Changes
Poem
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
…hub.com/topoteretes/cognee into feature/cog-971-preparing-swe-bench-run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Nitpick comments (22)
cognee/modules/retrieval/description_to_codepart_search.py (2)
48-48
: Remove the debuggingThe
print(include_docs)
statement appears to be for debugging purposes. It's recommended to remove it or replace it with a proper logging statement to maintain clean code in production.
130-132
: Add separators when concatenating code piecesWhen concatenating code pieces into
context
, consider adding separators or newlines between them to improve readability and maintainability.Apply this change:
context = "\n\n".join( code_piece.get_attribute("source_code") for code_piece in code_pieces_to_return )cognee/modules/engine/models/EntityType.py (1)
8-8
: Consider usingClassVar
for class-level attributeSince
pydantic_type
is a class-level attribute that does not represent an instance field, consider annotating it withClassVar
to clarify its intent.Apply this change:
from typing import ClassVar # ... pydantic_type: ClassVar[str] = "EntityType"cognee/modules/engine/models/Entity.py (1)
10-10
: Consider usingClassVar
for class-level attributeSince
pydantic_type
is a class-level attribute, annotating it withClassVar
will make its purpose clearer and align with best practices.Apply this change:
from typing import ClassVar # ... pydantic_type: ClassVar[str] = "Entity"cognee/modules/data/processing/document_types/Document.py (1)
14-15
: Add docstring to document the new parameter.The addition of the
max_tokens
parameter looks good. Consider adding a docstring to document its purpose and behavior.def read(self, chunk_size: int, chunker=str, max_tokens: Optional[int] = None) -> str: + """Read document content in chunks. + + Args: + chunk_size: Size of each chunk + chunker: Chunking strategy to use + max_tokens: Optional maximum number of tokens per chunk + + Returns: + Document content as string + """ passcognee/tasks/documents/extract_chunks_from_documents.py (1)
6-15
: Add docstring to document the new parameter.The propagation of the
max_tokens
parameter todocument.read()
is implemented correctly. Consider adding a docstring to document the parameter's purpose and behavior.async def extract_chunks_from_documents( documents: list[Document], chunk_size: int = 1024, chunker="text_chunker", max_tokens: Optional[int] = None, ): + """Extract chunks from a list of documents. + + Args: + documents: List of documents to process + chunk_size: Size of each chunk + chunker: Chunking strategy to use + max_tokens: Optional maximum number of tokens per chunk + + Yields: + Document chunks + """ for document in documents: for document_chunk in document.read( chunk_size=chunk_size, chunker=chunker, max_tokens=max_tokens ): yield document_chunkcognee/modules/data/processing/document_types/PdfDocument.py (1)
12-12
: Add docstring to document the new parameter.The implementation looks good, but please add a docstring to document the purpose and usage of the
max_tokens
parameter.def read(self, chunk_size: int, chunker: str, max_tokens: Optional[int] = None): + """ + Read and chunk the PDF document. + + Args: + chunk_size: Size of each chunk + chunker: Type of chunker to use + max_tokens: Optional maximum number of tokens to process + """cognee/modules/data/processing/document_types/TextDocument.py (1)
10-10
: Add docstring to document the new parameter.The implementation looks good, but please add a docstring to document the purpose and usage of the
max_tokens
parameter.def read(self, chunk_size: int, chunker: str, max_tokens: Optional[int] = None): + """ + Read and chunk the text document. + + Args: + chunk_size: Size of each chunk + chunker: Type of chunker to use + max_tokens: Optional maximum number of tokens to process + """cognee/modules/data/processing/document_types/AudioDocument.py (2)
16-16
: Add docstring to document the new parameter.The implementation looks good, but please add a docstring to document the purpose and usage of the
max_tokens
parameter.def read(self, chunk_size: int, chunker: str, max_tokens: Optional[int] = None): + """ + Read and chunk the audio document after transcription. + + Args: + chunk_size: Size of each chunk + chunker: Type of chunker to use + max_tokens: Optional maximum number of tokens to process + """
22-24
: Simplify the lambda function.The lambda function can be simplified by using a list literal instead.
chunker = chunker_func( - self, chunk_size=chunk_size, get_text=lambda: [text], max_tokens=max_tokens + self, chunk_size=chunk_size, get_text=iter([text]), max_tokens=max_tokens )cognee/modules/data/processing/document_types/ImageDocument.py (2)
16-16
: Add docstring for the read method.Please add a docstring to document the new
max_tokens
parameter and its purpose.def read(self, chunk_size: int, chunker: str, max_tokens: Optional[int] = None): + """Read and chunk the image content. + + Args: + chunk_size: Size of each chunk + chunker: Type of chunker to use + max_tokens: Optional maximum number of tokens per chunk + + Yields: + Iterator of text chunks + """
21-23
: Consider validating max_tokens parameter.The
max_tokens
parameter is passed directly to the chunker without validation. Consider adding validation to ensure it's a positive integer when provided.+ if max_tokens is not None and max_tokens <= 0: + raise ValueError("max_tokens must be a positive integer") chunker = chunker_func( self, chunk_size=chunk_size, get_text=lambda: [text], max_tokens=max_tokens )cognee/modules/data/processing/document_types/UnstructuredDocument.py (2)
13-13
: Add docstring and improve type hints.Please add a docstring and improve type hints for better code maintainability.
- def read(self, chunk_size: int, chunker: str, max_tokens: Optional[int] = None) -> str: + def read(self, chunk_size: int, chunker: str, max_tokens: Optional[int] = None) -> Iterator[str]: + """Read and chunk the document content using unstructured library. + + Args: + chunk_size: Size of each chunk + chunker: Type of chunker to use + max_tokens: Optional maximum number of tokens per chunk + + Yields: + Iterator of text chunks + + Raises: + UnstructuredLibraryImportError: If unstructured library is not installed + """
Line range hint
14-31
: Add type hints to the get_text generator function.Consider adding type hints to the inner generator function for better code maintainability.
- def get_text(): + def get_text() -> Iterator[str]:cognee/shared/CodeGraphEntities.py (1)
8-8
: Consider validating pydantic_type matches class name.The
pydantic_type
values match their respective class names, but there's no validation to ensure this remains true if class names change. Consider adding a validation in the base class or using a class decorator.def validate_pydantic_type(cls): """Decorator to validate pydantic_type matches class name.""" if cls.pydantic_type != cls.__name__: raise ValueError(f"pydantic_type must match class name: {cls.__name__}") return cls @validate_pydantic_type class Repository(DataPoint): # ... rest of the class definitionAlso applies to: 15-15, 27-27, 36-36
examples/python/code_graph_example.py (2)
33-33
: Consider making log level configurable.The logging level is hardcoded to ERROR. Consider making it configurable via command line arguments.
+ parser.add_argument( + "--log-level", + type=str, + choices=["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"], + default="ERROR", + help="Set the logging level", + ) args = parse_args() - setup_logging(logging.ERROR) + setup_logging(getattr(logging, args.log_level.upper()))
43-45
: Improve timing output format.Consider using a more structured format for timing output and adding more details.
- print("\n" + "=" * 50) - print(f"Pipeline Execution Time: {end_time - start_time:.2f} seconds") - print("=" * 50 + "\n") + execution_time = end_time - start_time + print("\nPipeline Execution Summary") + print("=" * 50) + print(f"Total time: {execution_time:.2f} seconds") + print(f"Average time per file: {execution_time/len(list(args.repo_path)):.2f} seconds") + print("=" * 50)cognee/tasks/repo_processor/expand_dependency_graph.py (1)
Line range hint
13-36
: LGTM! Well-structured implementation with good error handling.The function properly handles empty code parts and includes appropriate logging.
Remove commented out code.
The function contains commented out code that should be removed if no longer needed:
- # part_of = code_file, - # graph.add_node(part_node_id, source_code=code_part, node_type=part_type) - # graph.add_edge(parent_node_id, part_node_id, relation="contains")Document side effect in function docstring.
The function modifies the input
code_file
parameter by extending itscontains
list. This side effect should be documented in the function's docstring.- """Add code part nodes and edges for a specific part type.""" + """Add code part nodes and edges for a specific part type. + + Args: + code_file: The CodeFile object to be modified. Its contains list will be extended. + part_type: The type of code parts being added. + code_parts: List of code parts to process. + """cognee/tasks/chunks/chunk_by_paragraph.py (1)
30-32
: Simplify model name extraction.The current approach of splitting and taking the last part could be simplified.
-embedding_model = vector_engine.embedding_engine.model -embedding_model = embedding_model.split("/")[-1] +embedding_model = vector_engine.embedding_engine.model.rpartition("/")[-1]cognee/modules/chunking/TextChunker.py (1)
25-28
: Consider adding docstring to the new method.The
check_word_count_and_token_count
method would benefit from documentation explaining its purpose and parameters.def check_word_count_and_token_count(self, word_count_before, token_count_before, chunk_data): + """ + Check if adding a new chunk would exceed word count or token count limits. + + Args: + word_count_before (int): Current word count before adding the chunk + token_count_before (int): Current token count before adding the chunk + chunk_data (dict): Dictionary containing the chunk's word_count and token_count + + Returns: + bool: True if adding the chunk won't exceed limits, False otherwise + """ word_count_fits = word_count_before + chunk_data["word_count"] <= self.max_chunk_size token_count_fits = token_count_before + chunk_data["token_count"] <= self.max_tokens return word_count_fits and token_count_fitsevals/eval_swe_bench.py (1)
91-100
: Consider parallel processing for better performance.The sequential processing of predictions might be slower for large datasets. Consider using
asyncio.gather
to process predictions in parallel.- preds = [] - for instance in dataset: - instance_id = instance["instance_id"] - model_patch = await pred_func(instance) - preds.append( - { - "instance_id": instance_id, - "model_patch": model_patch, - "model_name_or_path": model_name, - } - ) + tasks = [ + pred_func(instance) for instance in dataset + ] + model_patches = await asyncio.gather(*tasks) + preds = [ + { + "instance_id": instance["instance_id"], + "model_patch": model_patch, + "model_name_or_path": model_name, + } + for instance, model_patch in zip(dataset, model_patches) + ]cognee/infrastructure/llm/prompts/patch_gen_kg_instructions.txt (1)
1-6
: Enhance clarity and grammar in the instructions.Consider these improvements:
- Add a comma before "and" in line 3: "with correct syntax, and"
- Replace "make sure" with "ensure" for stronger wording
-You are a senior software engineer. I need you to solve this issue by looking at the provided context and -generate a single patch file that I can apply directly to this repository using git apply. -Additionally, please make sure that you provide code only with correct syntax and -you apply the patch on the relevant files (together with their path that you can try to find out from the github issue). Don't change the names of existing -functions or classes, as they may be referenced from other code. -Please respond only with a single patch file in the following format without adding any additional context or string. +You are a senior software engineer. I need you to solve this issue by looking at the provided context and +generate a single patch file that I can apply directly to this repository using git apply. +Additionally, please ensure that you provide code only with correct syntax, and +you apply the patch on the relevant files (together with their path that you can try to find out from the github issue). Don't change the names of existing +functions or classes, as they may be referenced from other code. +Please respond only with a single patch file in the following format without adding any additional context or string.🧰 Tools
🪛 LanguageTool
[style] ~3-~3: Consider using a different verb to strengthen your wording.
Context: ...y using git apply. Additionally, please make sure that you provide code only with correct...(MAKE_SURE_ENSURE)
[uncategorized] ~3-~3: Use a comma before ‘and’ if it connects two independent clauses (unless they are closely connected and short).
Context: ...ou provide code only with correct syntax and you apply the patch on the relevant fil...(COMMA_COMPOUND_SENTENCE)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (29)
cognee/api/v1/cognify/code_graph_pipeline.py
(2 hunks)cognee/infrastructure/llm/prompts/patch_gen_kg_instructions.txt
(1 hunks)cognee/modules/chunking/TextChunker.py
(2 hunks)cognee/modules/chunking/models/DocumentChunk.py
(1 hunks)cognee/modules/cognify/config.py
(1 hunks)cognee/modules/data/processing/document_types/AudioDocument.py
(2 hunks)cognee/modules/data/processing/document_types/Document.py
(2 hunks)cognee/modules/data/processing/document_types/ImageDocument.py
(2 hunks)cognee/modules/data/processing/document_types/PdfDocument.py
(2 hunks)cognee/modules/data/processing/document_types/TextDocument.py
(2 hunks)cognee/modules/data/processing/document_types/UnstructuredDocument.py
(2 hunks)cognee/modules/engine/models/Entity.py
(1 hunks)cognee/modules/engine/models/EntityType.py
(1 hunks)cognee/modules/retrieval/description_to_codepart_search.py
(5 hunks)cognee/shared/CodeGraphEntities.py
(3 hunks)cognee/shared/data_models.py
(3 hunks)cognee/tasks/chunks/chunk_by_paragraph.py
(5 hunks)cognee/tasks/documents/extract_chunks_from_documents.py
(1 hunks)cognee/tasks/repo_processor/expand_dependency_graph.py
(1 hunks)cognee/tasks/repo_processor/extract_code_parts.py
(0 hunks)cognee/tasks/repo_processor/get_local_dependencies.py
(0 hunks)cognee/tasks/repo_processor/get_non_code_files.py
(1 hunks)cognee/tasks/repo_processor/get_source_code_chunks.py
(4 hunks)cognee/tasks/summarization/models.py
(1 hunks)cognee/tests/integration/documents/UnstructuredDocument_test.py
(2 hunks)cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py
(2 hunks)cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py
(1 hunks)evals/eval_swe_bench.py
(5 hunks)examples/python/code_graph_example.py
(2 hunks)
💤 Files with no reviewable changes (2)
- cognee/tasks/repo_processor/extract_code_parts.py
- cognee/tasks/repo_processor/get_local_dependencies.py
🧰 Additional context used
📓 Learnings (2)
cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (1)
Learnt from: 0xideas
PR: topoteretes/cognee#205
File: cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py:7-7
Timestamp: 2024-11-13T14:55:05.912Z
Learning: When changes are made to the chunking implementation in `cognee/tasks/chunks`, the ground truth values in the corresponding tests in `cognee/tests/unit/processing/chunks` need to be updated accordingly.
cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py (1)
Learnt from: 0xideas
PR: topoteretes/cognee#205
File: cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py:7-7
Timestamp: 2024-11-13T14:55:05.912Z
Learning: When changes are made to the chunking implementation in `cognee/tasks/chunks`, the ground truth values in the corresponding tests in `cognee/tests/unit/processing/chunks` need to be updated accordingly.
🪛 LanguageTool
cognee/infrastructure/llm/prompts/patch_gen_kg_instructions.txt
[style] ~3-~3: Consider using a different verb to strengthen your wording.
Context: ...y using git apply. Additionally, please make sure that you provide code only with correct...
(MAKE_SURE_ENSURE)
[uncategorized] ~3-~3: Use a comma before ‘and’ if it connects two independent clauses (unless they are closely connected and short).
Context: ...ou provide code only with correct syntax and you apply the patch on the relevant fil...
(COMMA_COMPOUND_SENTENCE)
🔇 Additional comments (13)
cognee/modules/chunking/models/DocumentChunk.py (1)
15-15
: LGTM!The addition of the
pydantic_type
attribute is consistent with other model classes and follows the established pattern.cognee/tasks/summarization/models.py (1)
20-20
: LGTM!The addition of the
pydantic_type
attribute is consistent with other model classes and follows the established pattern.cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (1)
30-34
: LGTM! Good improvement in code readability.Converting positional arguments to keyword arguments makes the code more maintainable and self-documenting.
Also applies to: 49-51
cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py (2)
52-54
: LGTM! Consistent with other test file changes.Converting to keyword arguments improves code clarity.
Line range hint
3-41
: Verify ground truth values.Based on the retrieved learning, ground truth values need to be verified when chunking implementation changes. Please ensure the
GROUND_TRUTH
dictionary values are still valid with the updated chunking implementation.✅ Verification successful
Ground truth values are still valid
The chunking implementation maintains the same behavior regarding paragraph boundaries, sentence preservation, and cut types. The test's ground truth values correctly reflect the current implementation's output format and behavior.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check if the chunking implementation has changed # Look for changes in the chunk_by_paragraph implementation rg -A 5 "def chunk_by_paragraph" cognee/tasks/chunks/Length of output: 498
Script:
#!/bin/bash # Get the full implementation of chunk_by_paragraph ast-grep --pattern 'def chunk_by_paragraph($$$) { $$$ }' # Look for related utility functions that might affect chunking rg -A 5 "def (get_word_count|get_cut_type|split_paragraph)" cognee/tasks/chunks/Length of output: 143
Script:
#!/bin/bash # Get the chunk_by_paragraph implementation with more context rg -A 20 "^def chunk_by_paragraph" cognee/tasks/chunks/ # Look for related imports and functions rg "from|import|def" cognee/tasks/chunks/chunk_by_paragraph.pyLength of output: 1941
Script:
#!/bin/bash # Get more context from the chunk_by_paragraph implementation rg -A 50 "^def chunk_by_paragraph" cognee/tasks/chunks/ # Look for chunk_by_sentence implementation rg -A 20 "^def chunk_by_sentence" cognee/tasks/chunks/Length of output: 5696
cognee/api/v1/cognify/code_graph_pipeline.py (2)
74-74
: LGTM! Good addition of token limit control.Adding max_tokens parameter to extract_chunks_from_documents task provides better control over chunk sizes.
63-63
: Verify impact of removing embedding_model parameter.The embedding_model parameter has been removed from the get_source_code_chunks task. Please ensure this doesn't affect the chunk generation functionality.
✅ Verification successful
Removal of embedding_model parameter is correct
The parameter was unused in the main function, and the embedding model configuration is properly handled by the helper function
get_source_code_chunks_from_code_part
with a default value of "text-embedding-3-large".🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check if get_source_code_chunks uses embedding_model internally ast-grep --pattern 'def get_source_code_chunks($$$) { $$$ embedding_model $$$ }'Length of output: 82
Script:
#!/bin/bash # Find the implementation of get_source_code_chunks ast-grep --pattern 'def get_source_code_chunks' # Backup search using ripgrep for better context rg "def get_source_code_chunks" -A 10Length of output: 2221
cognee/tests/integration/documents/UnstructuredDocument_test.py (1)
Line range hint
71-103
: Add test cases for the newmax_tokens
parameter.The test cases verify word counts and text lengths but don't cover the new
max_tokens
parameter functionality. Consider adding test cases that:
- Verify chunking behavior with different
max_tokens
values- Test edge cases (e.g., very small token limits)
- Validate the interaction between
chunk_size
andmax_tokens
parametersevals/eval_swe_bench.py (1)
Line range hint
130-148
: Uncomment and parameterize evaluation code.The evaluation code is commented out but seems important for the feature. Consider:
- Uncommenting the code
- Making the evaluation optional via command-line argument
- Adding error handling for the subprocess call
cognee/tasks/repo_processor/get_source_code_chunks.py (2)
130-132
: LGTM! Good refactoring to use vector_engine.The change to retrieve the embedding model from vector_engine centralizes the configuration and makes the code more maintainable.
157-157
: Verify the impact of removing the embedding_model parameter.The removal of the embedding_model parameter might affect existing callers of these functions.
Also applies to: 172-172
✅ Verification successful
The removal of embedding_model parameter appears safe
The codebase analysis shows no usage of the
embedding_model
parameter in any calls toget_source_code_chunks
. The function is only used as a Task in the pipeline system and all existing callers are compatible with the new signature.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for calls to get_source_code_chunks with embedding_model parameter rg -A 2 "get_source_code_chunks\([^)]*embedding_model"Length of output: 56
Script:
#!/bin/bash # Find all occurrences of get_source_code_chunks echo "=== All usages of get_source_code_chunks ===" rg "get_source_code_chunks" -A 2 # Find function definition echo -e "\n=== Function definition ===" ast-grep --pattern 'def get_source_code_chunks($$$)' # Check test files specifically echo -e "\n=== Test file usages ===" fd "test.*\.py" --exec rg "get_source_code_chunks" -A 2 {}Length of output: 1946
cognee/shared/data_models.py (1)
234-234
: LGTM! Good addition of type identification.Adding pydantic_type to the models improves type safety during serialization/deserialization. The consistent naming pattern across all models is appreciated.
Also applies to: 243-243, 251-251, 262-262
cognee/tasks/repo_processor/get_non_code_files.py (1)
127-130
: LGTM! Clean implementation of should_process.The function effectively combines the ALLOWED_EXTENSIONS check with IGNORED_PATTERNS filtering.
system_prompt="You are a Senior Software Engineer, summarize the context from documents" | ||
f" in a way that it is gonna be provided next to codeparts as context" | ||
f" while trying to solve this github issue connected to the project: {query}]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the syntax error and improve the system prompt
There is an unmatched closing bracket ]
at the end of the system_prompt
string, which will cause a syntax error. Additionally, consider rephrasing the prompt for clarity and grammatical correctness.
Apply this diff to fix the syntax error:
- f" while trying to solve this github issue connected to the project: {query}]",
+ f" while trying to solve this GitHub issue connected to the project: {query}",
Revised system_prompt
for clarity:
system_prompt = (
"You are a Senior Software Engineer. "
"Summarize the context from the documents so it can be provided "
"as context alongside code parts while trying to solve "
f"the following GitHub issue connected to the project: {query}"
)
async def code_description_to_code_part( | ||
query: str, user: User, top_k: int, include_docs: bool | ||
) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure the return type matches the function's annotation
The function code_description_to_code_part
is annotated to return List[str]
, but it currently returns a single string context
. This mismatch can lead to errors for callers expecting a list. Please adjust the return type or modify the implementation to ensure consistency.
Modify the implementation to return a list of code pieces:
code_pieces = [code_piece.get_attribute("source_code") for code_piece in code_pieces_to_return]
if include_docs:
code_pieces.insert(0, context_from_documents)
return code_pieces
Also applies to: 130-137
cognee/modules/cognify/config.py
Outdated
|
||
|
||
class CognifyConfig(BaseSettings): | ||
classification_model: object = DefaultContentPrediction | ||
summarization_model: object = SummarizedContent | ||
|
||
max_tokens: Optional[int] = os.getenv("MAX_TOKENS") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Convert MAX_TOKENS environment variable to int.
The environment variable needs to be converted to an integer when it's present.
- max_tokens: Optional[int] = os.getenv("MAX_TOKENS")
+ max_tokens: Optional[int] = int(os.getenv("MAX_TOKENS")) if os.getenv("MAX_TOKENS") is not None else None
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
max_tokens: Optional[int] = os.getenv("MAX_TOKENS") | |
max_tokens: Optional[int] = int(os.getenv("MAX_TOKENS")) if os.getenv("MAX_TOKENS") is not None else None |
tokenizer = tiktoken.encoding_for_model(embedding_model) | ||
token_count = len(tokenizer.encode(sentence)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Move tokenizer initialization outside the loop for better performance.
Creating a new tokenizer instance for each sentence is inefficient. Consider initializing it once before the loop.
+tokenizer = tiktoken.encoding_for_model(embedding_model)
for paragraph_id, sentence, word_count, end_type in chunk_by_sentence(
data, maximum_length=paragraph_length
):
- tokenizer = tiktoken.encoding_for_model(embedding_model)
token_count = len(tokenizer.encode(sentence))
Committable suggestion skipped: line range outside the PR's diff.
repo_path = "/Users/laszlohajdu/Documents/GitHub/graph_rag/" | ||
async for result in run_code_graph_pipeline(repo_path, include_docs=include_docs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove hardcoded repository path.
The hardcoded repository path /Users/laszlohajdu/Documents/GitHub/graph_rag/
should be parameterized to make the code more portable and configurable.
-repo_path = "/Users/laszlohajdu/Documents/GitHub/graph_rag/"
+repo_path = os.getenv("REPO_PATH") or os.path.join(os.path.dirname(__file__), "..", "graph_rag")
Committable suggestion skipped: line range outside the PR's diff.
ALLOWED_EXTENSIONS = { | ||
".txt", | ||
".md", | ||
".csv", | ||
".json", | ||
".xml", | ||
".yaml", | ||
".yml", | ||
".html", | ||
".css", | ||
".js", | ||
".ts", | ||
".jsx", | ||
".tsx", | ||
".sql", | ||
".log", | ||
".ini", | ||
".toml", | ||
".properties", | ||
".sh", | ||
".bash", | ||
".dockerfile", | ||
".gitignore", | ||
".gitattributes", | ||
".makefile", | ||
".pyproject", | ||
".requirements", | ||
".env", | ||
".pdf", | ||
".doc", | ||
".docx", | ||
".dot", | ||
".dotx", | ||
".rtf", | ||
".wps", | ||
".wpd", | ||
".odt", | ||
".ott", | ||
".ottx", | ||
".txt", | ||
".wp", | ||
".sdw", | ||
".sdx", | ||
".docm", | ||
".dotm", | ||
# Additional extensions for other programming languages | ||
".java", | ||
".c", | ||
".cpp", | ||
".h", | ||
".cs", | ||
".go", | ||
".php", | ||
".rb", | ||
".swift", | ||
".pl", | ||
".lua", | ||
".rs", | ||
".scala", | ||
".kt", | ||
".sh", | ||
".sql", | ||
".v", | ||
".asm", | ||
".pas", | ||
".d", | ||
".ml", | ||
".clj", | ||
".cljs", | ||
".erl", | ||
".ex", | ||
".exs", | ||
".f", | ||
".fs", | ||
".r", | ||
".pyi", | ||
".pdb", | ||
".ipynb", | ||
".rmd", | ||
".cabal", | ||
".hs", | ||
".nim", | ||
".vhdl", | ||
".verilog", | ||
".svelte", | ||
".html", | ||
".css", | ||
".scss", | ||
".less", | ||
".json5", | ||
".yaml", | ||
".yml", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Remove duplicate extensions and organize the list.
The ALLOWED_EXTENSIONS set contains several duplicate entries:
- .txt (lines 33, 71)
- .html (lines 40, 117)
- .css (lines 41, 118)
- .yaml (lines 38, 122)
- .yml (lines 39, 123)
- .sql (lines 46, 93)
- .sh (lines 51, 92)
Consider organizing the extensions by categories (documentation, source code, config files, etc.) using comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (2)
cognee/modules/retrieval/description_to_codepart_search.py (2)
Line range hint
29-44
: Update docstring to match implementationThe docstring indicates a return type of
Set[str]
but the function returns a single concatenated string. Update the docstring to reflect the actual return type.- Returns: - Set[str]: A set of unique code parts matching the query. + Returns: + str: A concatenated string of code parts and optional document context.🧰 Tools
🪛 GitHub Actions: ruff format
[warning] File requires formatting with Ruff formatter
Line range hint
1-142
: Fix formatting issuesThe file requires formatting with Ruff formatter according to the pipeline failures.
Run the following command to fix formatting:
ruff format cognee/modules/retrieval/description_to_codepart_search.py🧰 Tools
🪛 GitHub Actions: ruff format
[warning] File requires formatting with Ruff formatter
♻️ Duplicate comments (1)
cognee/modules/retrieval/description_to_codepart_search.py (1)
80-82
:⚠️ Potential issueFix the syntax error in the system prompt
There is an unmatched closing bracket
]
at the end of thesystem_prompt
string.🧰 Tools
🪛 GitHub Actions: ruff format
[warning] File requires formatting with Ruff formatter
🧹 Nitpick comments (1)
cognee/modules/retrieval/description_to_codepart_search.py (1)
117-126
: Refactor node traversal logic for better maintainabilityThe nested conditional blocks for different
pydantic_type
values make the code hard to maintain. Consider extracting this logic into separate methods.+ def _process_source_code_chunk(code_file, code_pieces): + for edge in code_file.get_skeleton_edges(): + if edge.get_attribute("relationship_name") == "code_chunk_of": + code_pieces.add(edge.get_destination_node()) + + def _process_code_file(code_file, code_pieces): + for edge in code_file.get_skeleton_edges(): + if edge.get_attribute("relationship_name") == "contains": + code_pieces.add(edge.get_destination_node()) + for node in code_summaries: # ... existing code ... for code_file in node_to_search_from.get_skeleton_neighbours(): - if code_file.get_attribute("pydantic_type") == "SourceCodeChunk": - for code_file_edge in code_file.get_skeleton_edges(): - if code_file_edge.get_attribute("relationship_name") == "code_chunk_of": - code_pieces_to_return.add(code_file_edge.get_destination_node()) - elif code_file.get_attribute("pydantic_type") == "CodePart": + pydantic_type = code_file.get_attribute("pydantic_type") + if pydantic_type == "SourceCodeChunk": + _process_source_code_chunk(code_file, code_pieces_to_return) + elif pydantic_type == "CodePart": code_pieces_to_return.add(code_file) - elif code_file.get_attribute("pydantic_type") == "CodeFile": - for code_file_edge in code_file.get_skeleton_edges(): - if code_file_edge.get_attribute("relationship_name") == "contains": - code_pieces_to_return.add(code_file_edge.get_destination_node()) + elif pydantic_type == "CodeFile": + _process_code_file(code_file, code_pieces_to_return) + else: + logging.warning("Unknown pydantic_type: %s", pydantic_type)🧰 Tools
🪛 GitHub Actions: ruff format
[warning] File requires formatting with Ruff formatter
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/modules/retrieval/description_to_codepart_search.py
(4 hunks)
🧰 Additional context used
🪛 GitHub Actions: ruff format
cognee/modules/retrieval/description_to_codepart_search.py
[warning] File requires formatting with Ruff formatter
if include_docs: | ||
search_results = await search(SearchType.INSIGHTS, query_text=query) | ||
|
||
concatenated_descriptions = " ".join( | ||
obj["description"] | ||
for tpl in search_results | ||
for obj in tpl | ||
if isinstance(obj, dict) and "description" in obj | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling for search results
The search results processing lacks error handling and validation. Consider adding checks for empty results and proper error handling.
if include_docs:
search_results = await search(SearchType.INSIGHTS, query_text=query)
+ if not search_results:
+ logging.warning("No document insights found for query: '%s'", query)
+ context_from_documents = ""
+ return context_from_documents
concatenated_descriptions = " ".join(
obj["description"]
for tpl in search_results
for obj in tpl
if isinstance(obj, dict) and "description" in obj
)
Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 GitHub Actions: ruff format
[warning] File requires formatting with Ruff formatter
context = "" | ||
for code_piece in code_pieces_to_return: | ||
context = context + code_piece.get_attribute("source_code") | ||
|
||
if include_docs: | ||
context = context_from_documents + context | ||
|
||
return context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve code concatenation for better memory efficiency and readability
The current string concatenation approach could be inefficient for large code bases and lacks separation between different code pieces.
- context = ""
- for code_piece in code_pieces_to_return:
- context = context + code_piece.get_attribute("source_code")
+ code_pieces_content = [
+ code_piece.get_attribute("source_code")
+ for code_piece in code_pieces_to_return
+ ]
+
+ # Add separator between code pieces for better readability
+ context = "\n\n---\n\n".join(code_pieces_content)
if include_docs:
- context = context_from_documents + context
+ context = f"{context_from_documents}\n\n---\n\n{context}"
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
context = "" | |
for code_piece in code_pieces_to_return: | |
context = context + code_piece.get_attribute("source_code") | |
if include_docs: | |
context = context_from_documents + context | |
return context | |
code_pieces_content = [ | |
code_piece.get_attribute("source_code") | |
for code_piece in code_pieces_to_return | |
] | |
# Add separator between code pieces for better readability | |
context = "\n\n---\n\n".join(code_pieces_content) | |
if include_docs: | |
context = f"{context_from_documents}\n\n---\n\n{context}" | |
return context |
🧰 Tools
🪛 GitHub Actions: ruff format
[warning] File requires formatting with Ruff formatter
Summary by CodeRabbit
Based on the comprehensive summary, here are the updated release notes:
Configuration
max_tokens
configuration to control token limits during document processing.Document Processing
Code Retrieval
Type Identification
pydantic_type
attributes to multiple data models for improved type tracking.Performance
These updates provide more flexible document processing, enhanced code retrieval, and improved type management across the system.