-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cog 656 deployment state #368
Conversation
Add ability to send data directories to cognee Feature COG-656
Resolve issue with UUID concat by casting to string Fix COG-656
Remove code comments that are not needed Chore COG-656
Added directory resolution as step in cognee add function Feature COG-656
Add support for text data to resolving data directory task Fix COG-656
…nto COG-656-deployment-state
Add resolving of directories as task for the add pipeline Feature COG-656
No need to handle different data types in resolving directories, focus on just handling case when it's a directory Fix COG-656
Rewrote endpoint which adds users to groups Fix COG-656
Resolve issue with adding permissions to groups Fix COG-656
… permission already given to group Added error handling in case permission is already given to group and user is already part of group Feature COG-656
Verify user has access to data before returning it Feature COG-656
Add compute search to cognee which makes searches human readable Feature COG-656
Warning Rate limit exceeded@dexters1 has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 17 minutes and 10 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
WalkthroughThe changes in this pull request involve multiple modifications across various files in the Cognee API. Key updates include the introduction of a new asynchronous task in the Changes
Possibly related PRs
Suggested reviewers
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Add simple instruction for system prompt Feature COG-656
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 10
🧹 Outside diff range and nitpick comments (8)
cognee/tasks/compute/exceptions/__init__.py (1)
1-5
: Enhance module docstring with more detailsConsider expanding the docstring to include:
- List of available exceptions
- Example usage
- Common scenarios where these exceptions are raised
Example enhancement:
""" Custom exceptions for the Cognee API. This module defines a set of exceptions for handling various compute errors + + Available Exceptions: + - NoRelevantDataFound: Raised when search operations yield no results + + Example Usage: + from cognee.tasks.compute.exceptions import NoRelevantDataFound + if not results: + raise NoRelevantDataFound """cognee/tasks/compute/exceptions/exceptions.py (1)
4-11
: LGTM! Consider adding type hints for status_codeThe exception class is well-structured with appropriate default values. Consider adding type hints for the status_code parameter for consistency.
def __init__( self, message: str = "Search did not find any data.", name: str = "NoRelevantDataFound", - status_code=status.HTTP_404_NOT_FOUND, + status_code: int = status.HTTP_404_NOT_FOUND, ):cognee/tasks/compute/query_compute.py (3)
17-17
: Consider making the chunk limit configurableHard-coding
limit = 1
might be too restrictive. Consider making this a parameter with a reasonable default value to allow for more comprehensive context when needed.- found_chunks = await vector_engine.search("document_chunk_text", query, limit = 1) + found_chunks = await vector_engine.search( + "document_chunk_text", + query, + limit=config.DEFAULT_CHUNK_LIMIT + )
26-27
: Consider externalizing prompt template pathsThe hardcoded template paths should be moved to a configuration file for better maintainability.
- user_prompt = render_prompt("context_for_question.txt", args) - system_prompt = read_query_prompt("answer_simple_question.txt") + user_prompt = render_prompt(config.CONTEXT_QUESTION_TEMPLATE, args) + system_prompt = read_query_prompt(config.ANSWER_QUESTION_TEMPLATE)
7-7
: Consider using a structured response modelUsing
str
as the response model might be too simple. Consider defining a structured response model for better type safety and validation.Example:
from pydantic import BaseModel class ComputeResponse(BaseModel): answer: str confidence: float source_chunks: list[str] async def query_compute(query: str) -> list[ComputeResponse]: # ... existing code ... return [ComputeResponse( answer=computed_answer, confidence=0.95, # from LLM if available source_chunks=[chunk.payload["text"] for chunk in found_chunks] )]Also applies to: 36-36
cognee/api/v1/search/search_v2.py (1)
Line range hint
17-55
: Implementation looks good with some architectural considerationsThe integration of the COMPUTE search type follows the existing patterns and maintains important cross-cutting concerns:
- User permission filtering
- Query logging
- Telemetry
- Error handling
Consider documenting the following aspects:
- Expected response format from query_compute to maintain consistency
- Performance characteristics of compute operations
- Rate limiting requirements if compute operations are resource-intensive
cognee/tasks/ingestion/resolve_data_directories.py (2)
4-14
: Documentation looks good, but consider adding more detailsThe function documentation is clear but could benefit from:
- Example usage
- Possible exceptions that might be raised
- Limitations on directory size/depth
19-37
: Consider adding file filtering and progress trackingFor better control and monitoring:
- Add file extension filtering
- Add progress tracking for large directories
- Consider implementing batch processing for memory efficiency
Here's a suggested implementation:
+from typing import Optional, Set + async def resolve_data_directories( data: Union[BinaryIO, List[BinaryIO], str, List[str]], - include_subdirectories: bool = True + include_subdirectories: bool = True, + allowed_extensions: Optional[Set[str]] = None, + batch_size: int = 1000 ): # ... existing docstring ... resolved_data = [] + processed_count = 0 for item in data: if isinstance(item, str): # Check if the item is a path if os.path.isdir(item): # If it's a directory if include_subdirectories: for root, _, files in os.walk(item): - resolved_data.extend([os.path.join(root, f) for f in files]) + for f in files: + if processed_count >= batch_size: + yield resolved_data + resolved_data = [] + processed_count = 0 + + file_path = os.path.join(root, f) + if allowed_extensions and not any( + file_path.lower().endswith(ext.lower()) + for ext in allowed_extensions + ): + continue + + resolved_data.append(file_path) + processed_count += 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
cognee/api/v1/add/add_v2.py
(2 hunks)cognee/api/v1/datasets/routers/get_datasets_router.py
(3 hunks)cognee/api/v1/permissions/routers/get_permissions_router.py
(1 hunks)cognee/api/v1/search/search_v2.py
(2 hunks)cognee/infrastructure/llm/prompts/answer_simple_question.txt
(1 hunks)cognee/modules/data/methods/get_data.py
(2 hunks)cognee/modules/users/models/GroupPermission.py
(1 hunks)cognee/modules/users/models/__init__.py
(1 hunks)cognee/tasks/compute/__init__.py
(1 hunks)cognee/tasks/compute/exceptions/__init__.py
(1 hunks)cognee/tasks/compute/exceptions/exceptions.py
(1 hunks)cognee/tasks/compute/query_compute.py
(1 hunks)cognee/tasks/ingestion/__init__.py
(1 hunks)cognee/tasks/ingestion/resolve_data_directories.py
(1 hunks)
✅ Files skipped from review due to trivial changes (4)
- cognee/tasks/ingestion/init.py
- cognee/infrastructure/llm/prompts/answer_simple_question.txt
- cognee/tasks/compute/init.py
- cognee/modules/users/models/init.py
🧰 Additional context used
🪛 Ruff (0.8.2)
cognee/api/v1/permissions/routers/get_permissions_router.py
36-36: Local variable e
is assigned to but never used
Remove assignment to unused variable e
(F841)
57-57: Local variable e
is assigned to but never used
Remove assignment to unused variable e
(F841)
🔇 Additional comments (10)
cognee/tasks/compute/query_compute.py (1)
1-5
: Verify vector engine and LLM client imports
Let's ensure these infrastructure components are properly initialized and configured.
✅ Verification successful
Vector engine and LLM client imports are properly used
The imports are correctly used throughout the codebase:
get_vector_engine()
is properly initialized and used in multiple tasks for searching, indexing, and testing.get_llm_client()
is correctly utilized for text generation, transcription, and structured output creation.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for proper initialization of dependencies
rg -A 5 "get_vector_engine\(\)" --type py
rg -A 5 "get_llm_client\(\)" --type py
# Check for configuration patterns
rg -l "DEFAULT_CHUNK_LIMIT|CONTEXT_QUESTION_TEMPLATE|ANSWER_QUESTION_TEMPLATE" --type py
Length of output: 17927
cognee/api/v1/search/search_v2.py (3)
23-23
: Verify existing SearchType enum usage
The addition of a new enum value could affect existing code that handles SearchType values.
✅ Verification successful
Let me analyze the search results. The code shows that:
- The SearchType enum is used in a dictionary mapping in
search_v2.py
where all enum values are handled exhaustively in thesearch_tasks
dictionary - The new
COMPUTE
value is already properly mapped toquery_compute
function in this dictionary - There are no direct if/else comparisons with SearchType values in the codebase
- All usages of SearchType in the codebase are through the search function parameter, passing one of the enum values
- The implementation follows the existing pattern where each search type maps to a specific query function
The addition of COMPUTE search type is properly integrated
The new enum value is safely added with its corresponding query handler and follows the existing pattern. The code uses a dictionary-based dispatch pattern rather than if/else statements, making it resilient to enum additions.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for exhaustive SearchType handling in switch/if statements
# Look for switch/if statements handling SearchType
rg -B 2 -A 5 "SearchType\." --type py
# Look for explicit enum value checks
ast-grep --pattern 'if $_ == SearchType.$_:'
Length of output: 15497
55-55
: Verify query_compute signature and error handling
Please ensure that the query_compute
function follows the same contract as other search tasks:
- It should be an async function
- It should accept a query string parameter
- It should return a list of results
- It should handle errors consistently with other search tasks
✅ Verification successful
Based on the search results, I can now generate the final response as we have found the implementation details of query_compute
and other query functions for comparison.
Function signature and error handling are correctly implemented
The query_compute
function follows the same contract as other search tasks in the codebase:
- It's an async function
- It accepts a query string parameter with proper type annotation
- It returns a list as specified in the return type annotation
- It follows the same pattern as other query functions (query_summaries, query_chunks, query_graph_connections) by:
- Using the vector engine
- Having proper docstring documentation
- Following the same parameter and return type structure
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify query_compute implementation details
# Check function signature and error handling
ast-grep --pattern 'async def query_compute($$$) {
$$$
}'
# Look for error handling patterns
rg -B 2 -A 5 "raise.*Error" "cognee/tasks/compute"
Length of output: 112
Script:
#!/bin/bash
# Let's try a different approach to locate and analyze the query_compute function
# First, let's find where query_compute is defined
rg -l "def query_compute"
# Then get its implementation with context
rg -B 2 -A 10 "def query_compute"
# Also search for other query functions to compare signatures
rg -B 2 -A 10 "def query_\w+"
# Look for error handling patterns in the entire codebase
rg -B 2 -A 5 "raise.*Error" --type python
Length of output: 6513
17-17
: Verify the query_compute module structure
Let's ensure the compute module is properly structured and exported.
✅ Verification successful
Import of query_compute module is correctly structured
The module is properly implemented and exported:
query_compute
function exists incognee/tasks/compute/query_compute.py
- It's correctly exported in
cognee/tasks/compute/__init__.py
- The import path used in
search_v2.py
matches the actual implementation
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the compute module structure and exports
# Check if the module exists and has proper exports
fd "^(query_compute|__init__)\.py$" -p "cognee/tasks/compute"
# Check the module's interface
rg -A 5 "^def query_compute" "cognee/tasks/compute"
Length of output: 113
Script:
#!/bin/bash
# Let's try a broader search to locate the query_compute module
# First, find all Python files in the tasks directory
echo "=== Python files in tasks directory ==="
fd ".py$" "cognee/tasks"
# Then, search for any mentions of query_compute
echo -e "\n=== Mentions of query_compute ==="
rg "query_compute" --type py
# Check if there's a compute directory or file anywhere in the codebase
echo -e "\n=== Files/directories with 'compute' ==="
fd "compute"
Length of output: 3296
Script:
#!/bin/bash
# Let's verify the implementation and exports of query_compute
echo "=== query_compute implementation ==="
cat cognee/tasks/compute/query_compute.py
echo -e "\n=== __init__.py exports ==="
cat cognee/tasks/compute/__init__.py
Length of output: 1392
cognee/modules/data/methods/get_data.py (1)
6-6
: LGTM: Well-documented parameter addition
The addition of user_id
parameter is properly documented in the function signature and docstring.
Also applies to: 10-10
cognee/api/v1/datasets/routers/get_datasets_router.py (2)
79-79
: LGTM: Proper user context propagation
The get_data
call correctly includes the user context for permission checking.
Line range hint 144-171
: Optimize data retrieval and standardize error handling
The current implementation has several areas for improvement:
- Multiple database calls that could be optimized
- Inconsistent error response formats (JSONResponse vs EntityNotFoundError)
- Redundant data validation
Consider this refactoring:
@router.get("/{dataset_id}/data/{data_id}/raw", response_class=FileResponse)
async def get_raw_data(dataset_id: str, data_id: str, user: User = Depends(get_authenticated_user)):
from cognee.modules.data.methods import get_data
from cognee.modules.data.methods import get_dataset, get_dataset_data
# First verify dataset access
dataset = await get_dataset(user.id, dataset_id)
if dataset is None:
raise EntityNotFoundError(message=f"Dataset ({dataset_id}) not found.")
- dataset_data = await get_dataset_data(dataset.id)
- if dataset_data is None:
- raise EntityNotFoundError(message=f"No data found in dataset ({dataset_id}).")
-
- matching_data = [data for data in dataset_data if str(data.id) == data_id]
- if len(matching_data) == 0:
- raise EntityNotFoundError(message= f"Data ({data_id}) not found in dataset ({dataset_id}).")
# Get and verify data access
data = await get_data(user.id, data_id)
if data is None:
raise EntityNotFoundError(message=f"Data ({data_id}) not found.")
return data.raw_data_location
This refactoring:
- Removes redundant data validation
- Standardizes on EntityNotFoundError
- Reduces database calls
Let's verify the relationship between datasets and data:
cognee/api/v1/add/add_v2.py (2)
17-17
: Consider adding error handling for the new task
The new resolve_data_directories
task is added to the pipeline, but there's no explicit error handling if the resolution fails. Consider adding try-catch blocks or error handling mechanisms.
Let's check if there's any error handling in the pipeline implementation:
16-19
: Verify pipeline changes impact
The addition of resolve_data_directories
task before ingest_data_with_metadata
modifies the existing pipeline. Please ensure this change doesn't break existing functionality.
Let's check for existing usages:
✅ Verification successful
Let me gather more context about the pipeline and task usage patterns.
Let me check the implementation of these specific functions to understand their relationship and dependencies.
Pipeline modification is safe and logically correct
The addition of resolve_data_directories
task before ingest_data_with_metadata
is a safe change because:
ingest_data_with_metadata
is only used in this specific pipeline (add_v2.py)resolve_data_directories
prepares the input by resolving directories into their contained files, which is a logical prerequisite for data ingestion- The function signatures show that
resolve_data_directories
accepts the same input types thatingest_data_with_metadata
can process (files and binary streams)
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for other pipeline configurations using ingest_data_with_metadata
rg "Task\(ingest_data_with_metadata" --type py
Length of output: 136
Script:
#!/bin/bash
# Search for other pipeline configurations and task definitions
rg "Task\(" -A 3 --type py
# Look for the implementation of ingest_data_with_metadata
ast-grep --pattern 'def ingest_data_with_metadata($$$)'
# Look for resolve_data_directories implementation
ast-grep --pattern 'def resolve_data_directories($$$)'
Length of output: 5718
Script:
#!/bin/bash
# Search for the implementation and usages of both functions with context
rg "def (ingest_data_with_metadata|resolve_data_directories)" -A 5 --type py
# Look for any imports or references to these functions
rg "(ingest_data_with_metadata|resolve_data_directories)" --type py
Length of output: 2086
cognee/api/v1/permissions/routers/get_permissions_router.py (1)
18-18
: Validate permission
parameter before database operations
Ensure that the permission
string is valid and properly formatted to prevent SQL injection or unexpected behavior.
cognee/api/v1/search/search_v2.py
Outdated
|
||
class SearchType(Enum): | ||
SUMMARIES = "SUMMARIES" | ||
INSIGHTS = "INSIGHTS" | ||
CHUNKS = "CHUNKS" | ||
COMPUTE = "COMPUTE" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant completion
like it usually exists in LLM APIs -> https://llm-engine.scale.com/guides/completions/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will make the change
Raise unauth access error when trying to read data without access Feature COG-656
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (1)
cognee/modules/data/methods/get_data.py (1)
Line range hint
7-16
: Enhance docstring with exception documentationThe docstring should document the possible exceptions that can be raised.
Apply this enhancement:
async def get_data(user_id: UUID, data_id: UUID) -> Optional[Data]: """Retrieve data by ID. Args: user_id (UUID): user ID data_id (UUID): ID of the data to retrieve Returns: Optional[Data]: The requested data object if found, None otherwise + + Raises: + UnauthorizedDataAccessError: If the user does not own the requested data """
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
cognee/modules/data/exceptions/__init__.py
(1 hunks)cognee/modules/data/exceptions/exceptions.py
(1 hunks)cognee/modules/data/methods/get_data.py
(2 hunks)
🔇 Additional comments (3)
cognee/modules/data/exceptions/__init__.py (1)
9-9
: LGTM: Clean import addition
The import statement is properly placed and follows Python's standard practices for exception handling.
cognee/modules/data/exceptions/exceptions.py (1)
10-11
: LGTM: Proper super() initialization added
The addition of the super() call ensures proper initialization of the parent class.
cognee/modules/data/methods/get_data.py (1)
22-24
: Consider explicit handling of "not found" case
The current implementation implicitly returns None for not found cases. Consider making this more explicit for better maintainability.
Consider this approach:
- if data and data.owner_id != user_id:
+ if not data:
+ return None
+
+ if data.owner_id != user_id:
raise UnauthorizedDataAccessError(message=f"User {user_id} is not authorized to access data {data_id}")
return data
Rename searching type from compute to completion Refactor COG-656
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (3)
cognee/tasks/completion/exceptions/__init__.py (1)
7-9
: Consider future-proofing the import statement.While the current import is clean, consider using parentheses for imports from the start, as it makes adding future exceptions easier without changing the syntax.
cognee/tasks/completion/exceptions/exceptions.py (1)
4-11
: Consider enhancing the error message for better debugging.While the implementation is solid, the default error message could be more informative for debugging purposes.
Consider this enhancement:
def __init__( self, - message: str = "Search did not find any data.", + message: str = "Search operation completed but no relevant data was found in the index.", name: str = "NoRelevantDataFound", status_code=status.HTTP_404_NOT_FOUND, ): super().__init__(message, name, status_code)Also, consider adding a docstring to the class to document its specific use case and when it should be raised.
cognee/tasks/completion/query_completion.py (1)
7-14
: Function signature needs improvementThe function signature could be enhanced with:
- Type hints for exceptions that may be raised
- Additional parameters for customization (e.g., limit, threshold)
Consider updating the docstring:
async def query_completion(query: str) -> list: """ + Computes an answer to a query by searching relevant chunks and using LLM. + Parameters: - query (str): The query string to compute. Returns: - list: Answer to the query. + + Raises: + - NoRelevantDataFound: When no relevant chunks are found """
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
cognee/api/v1/search/search_v2.py
(2 hunks)cognee/tasks/completion/__init__.py
(1 hunks)cognee/tasks/completion/exceptions/__init__.py
(1 hunks)cognee/tasks/completion/exceptions/exceptions.py
(1 hunks)cognee/tasks/completion/query_completion.py
(1 hunks)
🔇 Additional comments (5)
cognee/tasks/completion/__init__.py (1)
1-1
: LGTM! Clean package-level import.
The import statement follows Python conventions and provides clean package-level access to the query_completion function.
cognee/tasks/completion/exceptions/__init__.py (1)
1-5
: LGTM! Clear and concise module documentation.
The docstring effectively communicates the module's purpose and scope.
cognee/tasks/completion/exceptions/exceptions.py (1)
1-2
: LGTM! Appropriate imports.
The imports are minimal and correctly bring in the required dependencies.
cognee/tasks/completion/query_completion.py (1)
22-28
: Verify prompt template paths
The code assumes the existence of template files but doesn't handle missing templates.
Let's verify the template files exist:
✅ Verification successful
Template files are present and correctly located
Both template files exist at the expected location in cognee/infrastructure/llm/prompts/
:
context_for_question.txt
answer_simple_question.txt
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if prompt template files exist
fd -t f "context_for_question.txt|answer_simple_question.txt"
Length of output: 183
cognee/api/v1/search/search_v2.py (1)
23-23
: Consider renaming COMPLETION based on previous feedback
Based on the previous review discussion about LLM API conventions, consider a more standard name.
Consider renaming to align with standard LLM API terminology:
- COMPLETION = "COMPLETION"
+ COMPLETIONS = "COMPLETIONS"
Update typo in string in code Chore COG-656
Summary by CodeRabbit
Release Notes
New Features
COMPLETION
, enhancing search functionality.query_completion
for processing queries.resolve_data_directories
to manage data paths.Enhancements
UnauthorizedDataAccessError
for better security.Documentation
Chores