-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix dlt for metadata #247
Fix dlt for metadata #247
Conversation
Resolve issue caused by dlt for ingest_data_with_metadata task Fix
Resolve issue with llama index type resolution Fix
WalkthroughThe pull request introduces significant modifications to the data ingestion process in the Changes
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (4)
cognee/tasks/ingestion/ingest_data_with_metadata.py (4)
28-29
: Use asynchronous file I/O for better performance indata_resources
.Since
data_resources
is an asynchronous function, using asynchronous file operations can improve performance and avoid blocking the event loop.Consider using
aiofiles
for asynchronous file handling:import aiofiles async with aiofiles.open(file_path.replace("file://", ""), mode="rb") as file: classified_data = ingestion.classify(file) # Rest of the code...Ensure that
aiofiles
is added to your project's dependencies.
Line range hint
41-99
: Refactor to eliminate duplication betweendata_storing
anddata_resources
.Both functions process files and extract metadata. To adhere to the DRY (Don't Repeat Yourself) principle, consider extracting the shared logic into a separate function.
For example, create a helper function:
async def process_file(file_path: str): with open(file_path.replace("file://", ""), mode="rb") as file: classified_data = ingestion.classify(file) data_id = ingestion.identify(classified_data) file_metadata = classified_data.get_metadata() return data_id, file_metadataThen update
data_storing
anddata_resources
to use this function.
Line range hint
55-99
: Add error handling todata_storing
function.Exceptions during file processing or database operations may cause the function to fail silently. Adding try-except blocks with proper logging will enhance robustness.
Consider wrapping critical sections with exception handling:
import logging # Initialize logger logger = logging.getLogger(__name__) # Inside data_storing function try: # Existing code... except Exception as e: logger.error(f"An error occurred while processing data: {e}") # Handle exception as neededEnsure that the logging configuration is set up appropriately in your application.
103-125
: Simplifypipeline.run
calls to reduce code duplication.The two
pipeline.run
calls differ only in thedataset_name
parameter. Streamline the code by settingdataset_name
conditionally and making a single call.Apply this refactor:
if db_engine.engine.dialect.name == "sqlite": dataset_name_for_run = "main" # SQLite doesn't support schemas else: dataset_name_for_run = dataset_name run_info = pipeline.run( data_resources(file_paths), table_name="file_metadata", dataset_name=dataset_name_for_run, write_disposition="merge", )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
cognee/tasks/ingestion/ingest_data_with_metadata.py
(2 hunks)cognee/tasks/ingestion/save_data_item_with_metadata_to_storage.py
(1 hunks)
@dlt.resource(standalone = True, merge_key = "id") | ||
async def data_resources(data: Any, user: User): | ||
@dlt.resource(standalone=True, merge_key="id") | ||
async def data_resources(file_paths: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct the type annotation for file_paths
in data_resources
.
The file_paths
parameter should be annotated as a list of strings (List[str]
), not str
, since it is iterated over in the function.
Apply this diff to fix the type annotation:
-from typing import Any
+from typing import Any, List
@dlt.resource(standalone=True, merge_key="id")
-async def data_resources(file_paths: str):
+async def data_resources(file_paths: List[str]):
Committable suggestion skipped: line range outside the PR's diff.
# Check if data is of type Document or any of it's subclasses | ||
if str(type(data_item)).startswith("llama_index"): | ||
|
||
if "llama_index" in str(type(data_item)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve type checking for data_item
to avoid using str(type(data_item))
.
Using str(type(data_item))
for type checking is not robust and can lead to maintenance issues. Consider using isinstance
, duck typing, or checking for specific attributes.
Since llama_index
is an optional dependency, you can perform type checking as follows:
try:
from llama_index import LlamaIndexType # Replace with the actual type
except ImportError:
LlamaIndexType = None
if LlamaIndexType and isinstance(data_item, LlamaIndexType):
# Process data_item
Alternatively, use attribute checks:
if hasattr(data_item, 'some_unique_attribute'):
# Process data_item
This approach is more reliable and easier to maintain.
Summary by CodeRabbit
New Features
Bug Fixes
llama_index
module.Refactor