-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cog 685 more document types #269
Conversation
Added unstructured library and handling of certain document types through their library Feature COG-685
Remove the need for libmagic so for unstructured documents by providing mime_type information Feature COG-685
Add unstructured library as docs optional extension to pyproject.toml Chore COG-685
Added exception when unstructured libary is called but not installed Feature COG-685
Added pptx example file and tested Unstructured pptx document type handling Test COG-685
Update library gh actions to install docs extra to test unstructured integration tests Chore COG-685
WalkthroughThe pull request includes modifications to workflow configurations for testing Python 3.9, 3.10, and 3.11, specifically enhancing dependency management by adding the Changes
Possibly related PRs
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Update unstructured so it would install support for different document types Chore COG-685
Add tests for unstructured reading for different document types Test COG-685
…cognee into COG-685-more-document-types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Outside diff range and nitpick comments (6)
cognee/modules/data/processing/document_types/UnstructuredDocument.py (1)
14-16
: Handle exceptions during partitioning for robust error managementCurrently, the
try-except
block only catchesModuleNotFoundError
when importingpartition
. If thepartition
function itself raises an exception (e.g., due to an unsupported file type or a read error), it will not be handled gracefully. Consider adding exception handling for potential runtime errors during partitioning.Apply this diff to handle exceptions during partitioning:
try: from unstructured.partition.auto import partition except ModuleNotFoundError: raise UnstructuredLibraryImportError + except Exception as e: + raise UnstructuredLibraryImportError(f"An error occurred during partitioning: {e}")cognee/tests/integration/documents/PdfDocument_test.py (1)
20-21
: Enhance test coverage for mime_type handlingThe test currently uses an empty string for mime_type. Consider:
- Adding test cases with actual PDF mime types (e.g., "application/pdf")
- Testing behavior with invalid mime types
- mime_type="", + mime_type="application/pdf",Also consider adding a new test case:
def test_PdfDocument_invalid_mime_type(): with pytest.raises(ValueError): PdfDocument( id=uuid.uuid4(), name="Test document.pdf", raw_data_location=test_file_path, metadata_id=uuid.uuid4(), mime_type="invalid/type" )cognee/tests/integration/documents/TextDocument_test.py (1)
32-32
: Enhance mime_type test coverageThe test should be extended to cover mime_type handling:
- Add mime_type to the parameterized test inputs
- Test with various text mime types (e.g., "text/plain", "text/markdown")
@pytest.mark.parametrize( - "input_file,chunk_size", - [("code.txt", 256), ("Natural_language_processing.txt", 128)], + "input_file,chunk_size,mime_type", + [ + ("code.txt", 256, "text/plain"), + ("Natural_language_processing.txt", 128, "text/markdown"), + ], ) -def test_TextDocument(input_file, chunk_size): +def test_TextDocument(input_file, chunk_size, mime_type):Then update the document instantiation:
- mime_type="", + mime_type=mime_type,cognee/tests/integration/documents/AudioDocument_test.py (1)
30-30
: Consider adding MIME type validation testsWhile the empty MIME type works for the test, consider:
- Using a realistic audio MIME type (e.g., "audio/mpeg", "audio/wav")
- Adding test cases to verify MIME type handling
- id=uuid.uuid4(), name="audio-dummy-test", raw_data_location="", metadata_id=uuid.uuid4(), mime_type="", + id=uuid.uuid4(), name="audio-dummy-test", raw_data_location="", metadata_id=uuid.uuid4(), mime_type="audio/mpeg",cognee/tests/integration/documents/ImageDocument_test.py (1)
19-19
: Consider adding MIME type validation testsWhile the empty MIME type works for the test, consider:
- Using a realistic image MIME type (e.g., "image/jpeg", "image/png")
- Adding test cases to verify MIME type handling
- id=uuid.uuid4(), name="image-dummy-test", raw_data_location="", metadata_id=uuid.uuid4(), mime_type="", + id=uuid.uuid4(), name="image-dummy-test", raw_data_location="", metadata_id=uuid.uuid4(), mime_type="image/jpeg",cognee/tests/integration/documents/UnstructuredDocument_test.py (1)
8-13
: Simplify file path constructionThe repeated file path construction logic can be simplified and made more maintainable.
+def get_test_file_path(filename): + return os.path.join( + os.path.dirname(__file__), "..", "..", + "test_data", filename + ) - pptx_file_path = os.path.join( - os.sep, - *(os.path.dirname(__file__).split(os.sep)[:-2]), - "test_data", - "example.pptx", - ) + pptx_file_path = get_test_file_path("example.pptx")Also applies to: 15-20, 22-27, 29-34
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (5)
cognee/tests/test_data/example.csv
is excluded by!**/*.csv
cognee/tests/test_data/example.docx
is excluded by!**/*.docx
cognee/tests/test_data/example.pptx
is excluded by!**/*.pptx
cognee/tests/test_data/example.xlsx
is excluded by!**/*.xlsx
poetry.lock
is excluded by!**/*.lock
📒 Files selected for processing (15)
.github/workflows/test_python_3_10.yml
(1 hunks).github/workflows/test_python_3_11.yml
(1 hunks).github/workflows/test_python_3_9.yml
(1 hunks)cognee/modules/data/exceptions/__init__.py
(1 hunks)cognee/modules/data/exceptions/exceptions.py
(1 hunks)cognee/modules/data/processing/document_types/Document.py
(1 hunks)cognee/modules/data/processing/document_types/UnstructuredDocument.py
(1 hunks)cognee/modules/data/processing/document_types/__init__.py
(1 hunks)cognee/tasks/documents/classify_documents.py
(2 hunks)cognee/tests/integration/documents/AudioDocument_test.py
(1 hunks)cognee/tests/integration/documents/ImageDocument_test.py
(1 hunks)cognee/tests/integration/documents/PdfDocument_test.py
(1 hunks)cognee/tests/integration/documents/TextDocument_test.py
(1 hunks)cognee/tests/integration/documents/UnstructuredDocument_test.py
(1 hunks)pyproject.toml
(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- cognee/modules/data/exceptions/init.py
🔇 Additional comments (12)
.github/workflows/test_python_3_10.yml (1)
50-50
: Verify dependency compatibility across Python versions
The addition of docs dependencies should be tested across all Python versions (3.9, 3.10, 3.11) to ensure compatibility.
✅ Verification successful
Python version compatibility verified for docs dependencies
The codebase is configured to support Python versions from 3.9 to 3.11 (python = ">=3.9.0,<3.12"
), and there are no specific version constraints for documentation dependencies that would conflict with this range. The addition of -E docs
to the poetry install command in the Python 3.10 workflow is safe.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for any Python version constraints in dependencies
# Look for Python version constraints in pyproject.toml
if [ -f "pyproject.toml" ]; then
echo "Checking Python version constraints in dependencies..."
rg 'python = ".*"' pyproject.toml
rg 'python_version = ".*"' pyproject.toml
fi
Length of output: 273
Script:
#!/bin/bash
# Check the docs dependencies section in pyproject.toml
echo "Checking docs dependencies..."
rg -A 5 '\[tool\.poetry\.extras\]' pyproject.toml
# Check if there are any version constraints in docs dependencies
echo -e "\nChecking version constraints in docs dependencies..."
rg -A 10 '\[tool\.poetry\.dependencies\]' pyproject.toml | rg 'sphinx|furo|myst-parser'
Length of output: 509
.github/workflows/test_python_3_11.yml (1)
50-50
: Verify CI pipeline execution with new dependencies
The addition of docs dependencies across all workflows may impact:
- CI pipeline execution time due to additional dependency installation
- Disk space usage in the CI environment
Consider:
- Monitoring CI pipeline execution times
- Reviewing if all docs dependencies are necessary for testing
✅ Verification successful
Documentation dependencies are necessary for testing
The -E docs
flag is justified because:
- The
unstructured
package (included in docs extras) is required for core functionality, as it handles various document formats (CSV, DOC, DOCX, etc.) - The workflow runs unit tests, integration tests, and a basic pipeline test that may depend on document processing capabilities
- The docs dependencies are minimal, only including
unstructured
and its format-specific extras, not the heavier documentation tools (which are correctly placed in the[tool.poetry.group.docs.dependencies]
section)
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check size of docs dependencies
# Look for documentation-related dependencies and their sizes
if [ -f "pyproject.toml" ]; then
echo "Analyzing potential documentation dependencies..."
# Check for common doc-related packages
rg -i "sphinx|mkdocs|pdoc|docutils|jupyter|ipython" pyproject.toml
fi
Length of output: 462
Script:
#!/bin/bash
# Let's check if these dependencies are actually used in the test workflow
rg -l "pytest|test" . --type yaml
# Then examine the content of test files to see if they require notebook/jupyter features
fd -e py "test_.*" -x rg -l "jupyter|notebook|ipykernel" {}
Length of output: 120
Script:
#!/bin/bash
# Let's check the actual test dependencies and where docs dependencies are used
cat pyproject.toml
# Also check the workflow file to understand the test steps
cat .github/workflows/test_python_3_11.yml
Length of output: 5695
cognee/modules/data/processing/document_types/UnstructuredDocument.py (2)
1-32
: Overall implementation aligns with project objectives
The UnstructuredDocument
class is well-implemented to handle unstructured documents using the unstructured
library. Error handling for missing dependencies is appropriately addressed, and the read
method effectively processes and yields text chunks.
18-18
: Pass appropriate MIME type parameter to partition
function
Ensure that the parameter content_type
used in the partition
function correctly corresponds to the expected parameter for setting the MIME type. If the partition
function expects a different parameter name (e.g., mime_type
), update it accordingly.
Run the following script to verify the correct parameter name:
cognee/modules/data/processing/document_types/__init__.py (1)
6-6
: Import statement added correctly for UnstructuredDocument
The UnstructuredDocument
class has been properly imported, extending the set of document types available for processing.
cognee/modules/data/processing/document_types/Document.py (1)
9-9
: Added mime_type
attribute to Document
class
Introducing the mime_type
attribute enhances the Document
class by allowing it to store and utilize the MIME type of documents, which is essential for processing various file types correctly.
cognee/modules/data/exceptions/exceptions.py (1)
1-11
: New exception class UnstructuredLibraryImportError
added correctly
The custom exception class is properly defined to handle cases where the unstructured
library is not installed, improving error clarity and providing a specific response to import errors.
cognee/tasks/documents/classify_documents.py (2)
61-61
: Verify mime_type handling across document types
The mime_type parameter has been added to document instantiation. Ensure consistent mime_type handling across all document classes.
✅ Verification successful
The previous ast-grep pattern might have failed due to syntax. Let's try a different approach to find Document classes and mime_type usage in the codebase.
Let's examine the Document base class and its implementations to verify mime_type handling.
mime_type is properly handled across document types
The verification shows that mime_type is consistently handled:
- It's properly defined in the base Document class as a required field
- All document types (Audio, Image, PDF, Text, Unstructured) inherit from Document and thus include mime_type
- Only UnstructuredDocument actively uses mime_type in its read() method for content parsing
- The mime_type parameter is correctly passed through the data ingestion flow
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check mime_type usage in document classes
ast-grep --pattern 'class $DOCUMENT extends Document {
$$$
def __init__($ARGS) {
$$$
mime_type
$$$
}
$$$
}'
Length of output: 129
Script:
#!/bin/bash
# Search for Document class definitions and their mime_type usage
rg -U "class.*Document.*:.*?def __init__" -A 10
# Search for mime_type usage in document-related files
rg "mime_type" --type py
Length of output: 2862
Script:
#!/bin/bash
# Let's check the Document base class definition
cat cognee/modules/data/processing/document_types/Document.py
# And check document class implementations
fd "Document.py" --exec cat {} \;
Length of output: 3952
15-23
: Verify support for all mapped document formats
The UnstructuredDocument class is now handling multiple office document formats. Please ensure that:
- The Unstructured library supports all these formats
- There's proper error handling if a format is unsupported
cognee/tests/integration/documents/UnstructuredDocument_test.py (1)
1-80
: Verify test data files existence
The test assumes test files exist but doesn't verify their presence. Let's check if the required test files are available.
✅ Verification successful
All test data files are present and correctly located
All required test files (example.pptx, example.docx, example.csv, example.xlsx) are present in the ./cognee/tests/test_data
directory, which matches the path construction in the test file. The test's file path assumptions are valid.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if test data files exist
# Find test_data directory
test_data_dir=$(find . -type d -name "test_data" | head -n 1)
if [ -z "$test_data_dir" ]; then
echo "Error: test_data directory not found"
exit 1
fi
# Check for required test files
required_files=("example.pptx" "example.docx" "example.csv" "example.xlsx")
for file in "${required_files[@]}"; do
if [ ! -f "$test_data_dir/$file" ]; then
echo "Missing test file: $file"
else
echo "Found test file: $file"
fi
done
Length of output: 844
pyproject.toml (2)
95-95
: LGTM! Good organization of optional dependencies.
The addition of the docs
extra group is well-structured and follows the project's pattern for optional features. This makes it easy for users to install document-related dependencies only when needed.
76-77
: Verify the unstructured library version and dependencies.
The addition of the unstructured
library with comprehensive document format support aligns well with the PR objective of supporting more document types. However, let's verify the version compatibility and security.
✅ Verification successful
Version 0.16.10 of unstructured is appropriate and secure
The specified version ^0.16.10 is the latest available release on PyPI, and:
- No security vulnerabilities were found for the package
- The version constraint (^0.16.10) allows for safe minor updates while preventing breaking changes
- The package version is compatible with the project's Python requirement (>=3.9.0,<3.12)
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify unstructured library version compatibility and security
# Check PyPI for latest versions and release date
curl -s https://pypi.org/pypi/unstructured/json | jq -r '.releases | keys_unsorted[-5:] as $versions | $versions[] as $v | [., (.releases[$v][0].upload_time[0:10])] | @tsv'
# Check for security advisories
gh api graphql -f query='
{
securityVulnerabilities(first: 5, ecosystem: PIP, package: "unstructured") {
nodes {
advisory {
summary
severity
publishedAt
}
vulnerableVersionRange
firstPatchedVersion {
identifier
}
}
}
}'
Length of output: 596
Script:
#!/bin/bash
# Let's try a different approach to check the version history
curl -s https://pypi.org/pypi/unstructured/json | jq -r '.releases | keys[]' | sort -V | tail -n 5
# Let's also check the dependencies of the project to ensure compatibility
cat pyproject.toml | grep -A 20 "^\[tool.poetry.dependencies\]"
Length of output: 677
# Test PPTX | ||
for paragraph_data in pptx_document.read(chunk_size=1024): | ||
assert 19 == paragraph_data.word_count, f' 19 != {paragraph_data.word_count = }' | ||
assert 104 == len(paragraph_data.text), f' 104 != {len(paragraph_data.text) = }' | ||
assert 'sentence_cut' == paragraph_data.cut_type, f' sentence_cut != {paragraph_data.cut_type = }' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve test structure and data organization
The test could benefit from:
- Moving expected values to constants
- Adding file existence checks
- Creating separate test functions for each file type
- Using parametrized tests
Example refactor:
import pytest
from pathlib import Path
TEST_CASES = [
{
'file': 'example.pptx',
'mime_type': 'application/vnd.openxmlformats-officedocument.presentationml.presentation',
'expected': {'word_count': 19, 'text_length': 104, 'cut_type': 'sentence_cut'}
},
# ... other test cases
]
@pytest.mark.parametrize("test_case", TEST_CASES)
def test_unstructured_document(test_case):
file_path = Path(__file__).parent.parent.parent / 'test_data' / test_case['file']
assert file_path.exists(), f"Test file {file_path} not found"
document = UnstructuredDocument(
id=uuid.uuid4(),
name=test_case['file'],
raw_data_location=str(file_path),
metadata_id=uuid.uuid4(),
mime_type=test_case['mime_type']
)
for paragraph_data in document.read(chunk_size=1024):
assert test_case['expected']['word_count'] == paragraph_data.word_count
assert test_case['expected']['text_length'] == len(paragraph_data.text)
assert test_case['expected']['cut_type'] == paragraph_data.cut_type
Also applies to: 63-67, 69-74, 76-80
cognee/tests/integration/documents/UnstructuredDocument_test.py
Outdated
Show resolved
Hide resolved
Update typo for file name in test Test COG-685
Add support for additional document types by using Unstructured python library
Summary by CodeRabbit
New Features
UnstructuredDocument
class for handling unstructured document types.mime_type
attribute added to various document classes for enhanced processing.Bug Fixes
Documentation
Tests
UnstructuredDocument
class.mime_type
parameter.