Cog 685 more document types #269

dexters1 · 2024-12-09T08:35:10Z

Add support for additional document types by using Unstructured python library

Summary by CodeRabbit

New Features
- Introduced the UnstructuredDocument class for handling unstructured document types.
- Added support for additional document formats in the classification process.
- New mime_type attribute added to various document classes for enhanced processing.
Bug Fixes
- Updated dependency installation to include documentation-related extras for Python workflows.
Documentation
- Added docstring to the exceptions module for clarity on custom exceptions.
Tests
- New test file created for validating the functionality of the UnstructuredDocument class.
- Updated tests for existing document classes to accommodate the new mime_type parameter.

Added unstructured library and handling of certain document types through their library Feature COG-685

Remove the need for libmagic so for unstructured documents by providing mime_type information Feature COG-685

Add unstructured library as docs optional extension to pyproject.toml Chore COG-685

Added exception when unstructured libary is called but not installed Feature COG-685

Added pptx example file and tested Unstructured pptx document type handling Test COG-685

Update library gh actions to install docs extra to test unstructured integration tests Chore COG-685

coderabbitai · 2024-12-09T08:35:18Z

Walkthrough

The pull request includes modifications to workflow configurations for testing Python 3.9, 3.10, and 3.11, specifically enhancing dependency management by adding the -E docs option to the poetry install command. This change ensures that documentation-related dependencies are included during installation. Additionally, a new exception class UnstructuredLibraryImportError is introduced, along with the UnstructuredDocument class and its associated methods. The Document class is updated to include a new mime_type attribute, and various document-related classes are modified to incorporate this attribute, affecting their initialization and testing.

Changes

File Path	Change Summary
.github/workflows/test_python_3_9.yml	Updated `poetry install` command to include `-E docs` for documentation dependencies.
.github/workflows/test_python_3_10.yml	Updated `poetry install` command to include `-E docs` for documentation dependencies.
.github/workflows/test_python_3_11.yml	Updated `poetry install` command to include `-E docs` for documentation dependencies.
cognee/modules/data/exceptions/init.py	Added docstring describing the module for custom exceptions.
cognee/modules/data/exceptions/exceptions.py	Introduced `UnstructuredLibraryImportError` exception class for handling specific import errors.
cognee/modules/data/processing/document_types/Document.py	Added `mime_type: str` attribute to `Document` class.
cognee/modules/data/processing/document_types/UnstructuredDocument.py	Added `UnstructuredDocument` class with `read` method for handling unstructured documents.
cognee/modules/data/processing/document_types/init.py	Imported `UnstructuredDocument` class to expand document types.
cognee/tasks/documents/classify_documents.py	Updated `classify_documents` function to include `mime_type` parameter for document instantiation.
cognee/tests/integration/documents/AudioDocument_test.py	Updated `AudioDocument` test to include `mime_type` parameter during instantiation.
cognee/tests/integration/documents/ImageDocument_test.py	Updated `ImageDocument` test to include `mime_type` parameter during instantiation.
cognee/tests/integration/documents/PdfDocument_test.py	Updated `PdfDocument` test to include `mime_type` parameter during instantiation.
cognee/tests/integration/documents/TextDocument_test.py	Updated `TextDocument` test to include `mime_type` parameter during instantiation.
cognee/tests/integration/documents/UnstructuredDocument_test.py	Added new tests for `UnstructuredDocument` class functionality.
pyproject.toml	Added new optional dependency `unstructured` and created `docs` extra group for documentation.

Possibly related PRs

Add manual execution for python-3.X tests #185: Modifications to the workflow configuration for testing Python 3.10, focusing on dependency management.
Cog 337 llama index support #186: Updates to the workflow for testing Python 3.11, related to dependency management.
Cog 577 add unit test task #197: Changes to the workflow for testing Python 3.9, refining the dependency installation process.
chore: Add optional dependencies #241: Introduction of optional dependencies in the pyproject.toml, complementing the main PR's focus.
Feature: Integrate Milvus as a Vector Database Provider #243: Updates related to dependency management and workflow configurations for integrating Milvus.
Milvus vector db #244: Modifications to the .env.template for Milvus, tying into dependency management.
test: Update gh actions so they can run outside of PR to main #263: Updates to GitHub Actions workflows to run on pull requests, relevant to workflow configuration changes.

Suggested reviewers

hajdul88
Vasilije1990

Poem

🐰 In the land of code where rabbits play,
New dependencies hop in, brightening the day.
With mime_type added, our documents gleam,
Unstructured wonders, a coder's sweet dream!
Let's test and ensure all functions align,
For in this code garden, our projects will shine! 🌼

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between e0a3563 and acf5952.

📒 Files selected for processing (1)

cognee/tests/integration/documents/UnstructuredDocument_test.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

cognee/tests/integration/documents/UnstructuredDocument_test.py

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

Update unstructured so it would install support for different document types Chore COG-685

Add tests for unstructured reading for different document types Test COG-685

…cognee into COG-685-more-document-types

coderabbitai

Actionable comments posted: 5

🧹 Outside diff range and nitpick comments (6)

cognee/modules/data/processing/document_types/UnstructuredDocument.py (1)
14-16: Handle exceptions during partitioning for robust error management

Currently, the try-except block only catches ModuleNotFoundError when importing partition. If the partition function itself raises an exception (e.g., due to an unsupported file type or a read error), it will not be handled gracefully. Consider adding exception handling for potential runtime errors during partitioning.

Apply this diff to handle exceptions during partitioning:
             try:
                 from unstructured.partition.auto import partition
             except ModuleNotFoundError:
                 raise UnstructuredLibraryImportError
+            except Exception as e:
+                raise UnstructuredLibraryImportError(f"An error occurred during partitioning: {e}")
cognee/tests/integration/documents/PdfDocument_test.py (1)
20-21: Enhance test coverage for mime_type handling

The test currently uses an empty string for mime_type. Consider:

Adding test cases with actual PDF mime types (e.g., "application/pdf")

Testing behavior with invalid mime types
-        mime_type="",
+        mime_type="application/pdf",
Also consider adding a new test case:
def test_PdfDocument_invalid_mime_type():
    with pytest.raises(ValueError):
        PdfDocument(
            id=uuid.uuid4(),
            name="Test document.pdf",
            raw_data_location=test_file_path,
            metadata_id=uuid.uuid4(),
            mime_type="invalid/type"
        )
cognee/tests/integration/documents/TextDocument_test.py (1)
32-32: Enhance mime_type test coverage

The test should be extended to cover mime_type handling:

Add mime_type to the parameterized test inputs

Test with various text mime types (e.g., "text/plain", "text/markdown")
 @pytest.mark.parametrize(
-    "input_file,chunk_size",
-    [("code.txt", 256), ("Natural_language_processing.txt", 128)],
+    "input_file,chunk_size,mime_type",
+    [
+        ("code.txt", 256, "text/plain"),
+        ("Natural_language_processing.txt", 128, "text/markdown"),
+    ],
 )
-def test_TextDocument(input_file, chunk_size):
+def test_TextDocument(input_file, chunk_size, mime_type):
Then update the document instantiation:
-        mime_type="",
+        mime_type=mime_type,
cognee/tests/integration/documents/AudioDocument_test.py (1)
30-30: Consider adding MIME type validation tests

While the empty MIME type works for the test, consider:

Using a realistic audio MIME type (e.g., "audio/mpeg", "audio/wav")

Adding test cases to verify MIME type handling
-        id=uuid.uuid4(), name="audio-dummy-test", raw_data_location="", metadata_id=uuid.uuid4(), mime_type="",
+        id=uuid.uuid4(), name="audio-dummy-test", raw_data_location="", metadata_id=uuid.uuid4(), mime_type="audio/mpeg",
cognee/tests/integration/documents/ImageDocument_test.py (1)
19-19: Consider adding MIME type validation tests

While the empty MIME type works for the test, consider:

Using a realistic image MIME type (e.g., "image/jpeg", "image/png")

Adding test cases to verify MIME type handling
-        id=uuid.uuid4(), name="image-dummy-test", raw_data_location="", metadata_id=uuid.uuid4(), mime_type="",
+        id=uuid.uuid4(), name="image-dummy-test", raw_data_location="", metadata_id=uuid.uuid4(), mime_type="image/jpeg",
cognee/tests/integration/documents/UnstructuredDocument_test.py (1)
8-13: Simplify file path construction

The repeated file path construction logic can be simplified and made more maintainable.
+def get_test_file_path(filename):
+    return os.path.join(
+        os.path.dirname(__file__), "..", "..",
+        "test_data", filename
+    )

-    pptx_file_path = os.path.join(
-        os.sep,
-        *(os.path.dirname(__file__).split(os.sep)[:-2]),
-        "test_data",
-        "example.pptx",
-    )
+    pptx_file_path = get_test_file_path("example.pptx")
Also applies to: 15-20, 22-27, 29-34

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between ce96431 and e0a3563.

⛔ Files ignored due to path filters (5)

cognee/tests/test_data/example.csv is excluded by !**/*.csv
cognee/tests/test_data/example.docx is excluded by !**/*.docx
cognee/tests/test_data/example.pptx is excluded by !**/*.pptx
cognee/tests/test_data/example.xlsx is excluded by !**/*.xlsx
poetry.lock is excluded by !**/*.lock

📒 Files selected for processing (15)

.github/workflows/test_python_3_10.yml (1 hunks)
.github/workflows/test_python_3_11.yml (1 hunks)
.github/workflows/test_python_3_9.yml (1 hunks)
cognee/modules/data/exceptions/__init__.py (1 hunks)
cognee/modules/data/exceptions/exceptions.py (1 hunks)
cognee/modules/data/processing/document_types/Document.py (1 hunks)
cognee/modules/data/processing/document_types/UnstructuredDocument.py (1 hunks)
cognee/modules/data/processing/document_types/__init__.py (1 hunks)
cognee/tasks/documents/classify_documents.py (2 hunks)
cognee/tests/integration/documents/AudioDocument_test.py (1 hunks)
cognee/tests/integration/documents/ImageDocument_test.py (1 hunks)
cognee/tests/integration/documents/PdfDocument_test.py (1 hunks)
cognee/tests/integration/documents/TextDocument_test.py (1 hunks)
cognee/tests/integration/documents/UnstructuredDocument_test.py (1 hunks)
pyproject.toml (2 hunks)

✅ Files skipped from review due to trivial changes (1)

cognee/modules/data/exceptions/init.py

🔇 Additional comments (12)

.github/workflows/test_python_3_10.yml (1)

50-50: Verify dependency compatibility across Python versions

The addition of docs dependencies should be tested across all Python versions (3.9, 3.10, 3.11) to ensure compatibility.

✅ Verification successful

Python version compatibility verified for docs dependencies

The codebase is configured to support Python versions from 3.9 to 3.11 (python = ">=3.9.0,<3.12"), and there are no specific version constraints for documentation dependencies that would conflict with this range. The addition of -E docs to the poetry install command in the Python 3.10 workflow is safe.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any Python version constraints in dependencies

# Look for Python version constraints in pyproject.toml
if [ -f "pyproject.toml" ]; then
    echo "Checking Python version constraints in dependencies..."
    rg 'python = ".*"' pyproject.toml
    rg 'python_version = ".*"' pyproject.toml
fi

Length of output: 273

Script:

#!/bin/bash
# Check the docs dependencies section in pyproject.toml
echo "Checking docs dependencies..."
rg -A 5 '\[tool\.poetry\.extras\]' pyproject.toml

# Check if there are any version constraints in docs dependencies
echo -e "\nChecking version constraints in docs dependencies..."
rg -A 10 '\[tool\.poetry\.dependencies\]' pyproject.toml | rg 'sphinx|furo|myst-parser'

Length of output: 509

.github/workflows/test_python_3_11.yml (1)

50-50: Verify CI pipeline execution with new dependencies

The addition of docs dependencies across all workflows may impact:

CI pipeline execution time due to additional dependency installation
Disk space usage in the CI environment

Consider:

Monitoring CI pipeline execution times
Reviewing if all docs dependencies are necessary for testing

✅ Verification successful

Documentation dependencies are necessary for testing

The -E docs flag is justified because:

The unstructured package (included in docs extras) is required for core functionality, as it handles various document formats (CSV, DOC, DOCX, etc.)
The workflow runs unit tests, integration tests, and a basic pipeline test that may depend on document processing capabilities
The docs dependencies are minimal, only including unstructured and its format-specific extras, not the heavier documentation tools (which are correctly placed in the [tool.poetry.group.docs.dependencies] section)

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check size of docs dependencies

# Look for documentation-related dependencies and their sizes
if [ -f "pyproject.toml" ]; then
    echo "Analyzing potential documentation dependencies..."
    # Check for common doc-related packages
    rg -i "sphinx|mkdocs|pdoc|docutils|jupyter|ipython" pyproject.toml
fi

Length of output: 462

Script:

#!/bin/bash
# Let's check if these dependencies are actually used in the test workflow
rg -l "pytest|test" . --type yaml

# Then examine the content of test files to see if they require notebook/jupyter features
fd -e py "test_.*" -x rg -l "jupyter|notebook|ipykernel" {}

Length of output: 120

Script:

#!/bin/bash
# Let's check the actual test dependencies and where docs dependencies are used
cat pyproject.toml

# Also check the workflow file to understand the test steps
cat .github/workflows/test_python_3_11.yml

Length of output: 5695

cognee/modules/data/processing/document_types/UnstructuredDocument.py (2)

1-32: Overall implementation aligns with project objectives

The UnstructuredDocument class is well-implemented to handle unstructured documents using the unstructured library. Error handling for missing dependencies is appropriately addressed, and the read method effectively processes and yields text chunks.

18-18: Pass appropriate MIME type parameter to partition function

Ensure that the parameter content_type used in the partition function correctly corresponds to the expected parameter for setting the MIME type. If the partition function expects a different parameter name (e.g., mime_type), update it accordingly.

Run the following script to verify the correct parameter name:

cognee/modules/data/processing/document_types/__init__.py (1)

6-6: Import statement added correctly for UnstructuredDocument

The UnstructuredDocument class has been properly imported, extending the set of document types available for processing.

cognee/modules/data/processing/document_types/Document.py (1)

9-9: Added mime_type attribute to Document class

Introducing the mime_type attribute enhances the Document class by allowing it to store and utilize the MIME type of documents, which is essential for processing various file types correctly.

cognee/modules/data/exceptions/exceptions.py (1)

1-11: New exception class UnstructuredLibraryImportError added correctly

The custom exception class is properly defined to handle cases where the unstructured library is not installed, improving error clarity and providing a specific response to import errors.

cognee/tasks/documents/classify_documents.py (2)

61-61: Verify mime_type handling across document types

The mime_type parameter has been added to document instantiation. Ensure consistent mime_type handling across all document classes.

✅ Verification successful

The previous ast-grep pattern might have failed due to syntax. Let's try a different approach to find Document classes and mime_type usage in the codebase.

Let's examine the Document base class and its implementations to verify mime_type handling.

mime_type is properly handled across document types

The verification shows that mime_type is consistently handled:

It's properly defined in the base Document class as a required field
All document types (Audio, Image, PDF, Text, Unstructured) inherit from Document and thus include mime_type
Only UnstructuredDocument actively uses mime_type in its read() method for content parsing
The mime_type parameter is correctly passed through the data ingestion flow

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check mime_type usage in document classes
ast-grep --pattern 'class $DOCUMENT extends Document {
  $$$
  def __init__($ARGS) {
    $$$
    mime_type
    $$$
  }
  $$$
}'

Length of output: 129

Script:

#!/bin/bash
# Search for Document class definitions and their mime_type usage
rg -U "class.*Document.*:.*?def __init__" -A 10

# Search for mime_type usage in document-related files
rg "mime_type" --type py

Length of output: 2862

Script:

#!/bin/bash
# Let's check the Document base class definition
cat cognee/modules/data/processing/document_types/Document.py

# And check document class implementations
fd "Document.py" --exec cat {} \;

Length of output: 3952

15-23: Verify support for all mapped document formats

The UnstructuredDocument class is now handling multiple office document formats. Please ensure that:

The Unstructured library supports all these formats
There's proper error handling if a format is unsupported

cognee/tests/integration/documents/UnstructuredDocument_test.py (1)

1-80: Verify test data files existence

The test assumes test files exist but doesn't verify their presence. Let's check if the required test files are available.

✅ Verification successful

All test data files are present and correctly located

All required test files (example.pptx, example.docx, example.csv, example.xlsx) are present in the ./cognee/tests/test_data directory, which matches the path construction in the test file. The test's file path assumptions are valid.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if test data files exist

# Find test_data directory
test_data_dir=$(find . -type d -name "test_data" | head -n 1)

if [ -z "$test_data_dir" ]; then
    echo "Error: test_data directory not found"
    exit 1
fi

# Check for required test files
required_files=("example.pptx" "example.docx" "example.csv" "example.xlsx")
for file in "${required_files[@]}"; do
    if [ ! -f "$test_data_dir/$file" ]; then
        echo "Missing test file: $file"
    else
        echo "Found test file: $file"
    fi
done

Length of output: 844

pyproject.toml (2)

95-95: LGTM! Good organization of optional dependencies.

The addition of the docs extra group is well-structured and follows the project's pattern for optional features. This makes it easy for users to install document-related dependencies only when needed.

76-77: Verify the unstructured library version and dependencies.

The addition of the unstructured library with comprehensive document format support aligns well with the PR objective of supporting more document types. However, let's verify the version compatibility and security.

✅ Verification successful

Version 0.16.10 of unstructured is appropriate and secure

The specified version ^0.16.10 is the latest available release on PyPI, and:

No security vulnerabilities were found for the package
The version constraint (^0.16.10) allows for safe minor updates while preventing breaking changes
The package version is compatible with the project's Python requirement (>=3.9.0,<3.12)

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify unstructured library version compatibility and security

# Check PyPI for latest versions and release date
curl -s https://pypi.org/pypi/unstructured/json | jq -r '.releases | keys_unsorted[-5:] as $versions | $versions[] as $v | [., (.releases[$v][0].upload_time[0:10])] | @tsv'

# Check for security advisories
gh api graphql -f query='
{
  securityVulnerabilities(first: 5, ecosystem: PIP, package: "unstructured") {
    nodes {
      advisory {
        summary
        severity
        publishedAt
      }
      vulnerableVersionRange
      firstPatchedVersion {
        identifier
      }
    }
  }
}'

Length of output: 596

Script:

#!/bin/bash
# Let's try a different approach to check the version history
curl -s https://pypi.org/pypi/unstructured/json | jq -r '.releases | keys[]' | sort -V | tail -n 5

# Let's also check the dependencies of the project to ensure compatibility
cat pyproject.toml | grep -A 20 "^\[tool.poetry.dependencies\]"

Length of output: 677

.github/workflows/test_python_3_9.yml

cognee/modules/data/processing/document_types/UnstructuredDocument.py

coderabbitai · 2024-12-09T14:31:46Z

cognee/tests/integration/documents/UnstructuredDocument_test.py

+    # Test PPTX
+    for paragraph_data in pptx_document.read(chunk_size=1024):
+        assert 19 == paragraph_data.word_count, f' 19 != {paragraph_data.word_count = }'
+        assert 104 == len(paragraph_data.text), f' 104 != {len(paragraph_data.text) = }'
+        assert 'sentence_cut' == paragraph_data.cut_type, f' sentence_cut != {paragraph_data.cut_type = }'


🛠️ Refactor suggestion

Improve test structure and data organization

The test could benefit from:

Moving expected values to constants

Adding file existence checks

Creating separate test functions for each file type

Using parametrized tests

Example refactor:

import pytest from pathlib import Path TEST_CASES = [ { 'file': 'example.pptx', 'mime_type': 'application/vnd.openxmlformats-officedocument.presentationml.presentation', 'expected': {'word_count': 19, 'text_length': 104, 'cut_type': 'sentence_cut'} }, # ... other test cases ] @pytest.mark.parametrize("test_case", TEST_CASES) def test_unstructured_document(test_case): file_path = Path(__file__).parent.parent.parent / 'test_data' / test_case['file'] assert file_path.exists(), f"Test file {file_path} not found" document = UnstructuredDocument( id=uuid.uuid4(), name=test_case['file'], raw_data_location=str(file_path), metadata_id=uuid.uuid4(), mime_type=test_case['mime_type'] ) for paragraph_data in document.read(chunk_size=1024): assert test_case['expected']['word_count'] == paragraph_data.word_count assert test_case['expected']['text_length'] == len(paragraph_data.text) assert test_case['expected']['cut_type'] == paragraph_data.cut_type

Also applies to: 63-67, 69-74, 76-80

cognee/tests/integration/documents/UnstructuredDocument_test.py

Update typo for file name in test Test COG-685

dexters1 added 6 commits December 6, 2024 17:50

feat: Add unstructured document handler

7821445

Added unstructured library and handling of certain document types through their library Feature COG-685

feat: Remove the need for libmagic for unstructured documents

62db3f8

Remove the need for libmagic so for unstructured documents by providing mime_type information Feature COG-685

chore: Update pyproject file with unstructured library

53b7806

Add unstructured library as docs optional extension to pyproject.toml Chore COG-685

feat: Add UnstructuredLibraryImportError

07d9330

Added exception when unstructured libary is called but not installed Feature COG-685

test: Add test for Unstructured pptx document type

596b3ed

Added pptx example file and tested Unstructured pptx document type handling Test COG-685

chore: Update gh actions to install docs extra

5567370

Update library gh actions to install docs extra to test unstructured integration tests Chore COG-685

dexters1 self-assigned this Dec 9, 2024

dexters1 added the run-checks label Dec 9, 2024

dexters1 and others added 4 commits December 9, 2024 09:49

chore: Update dependencies to handle different document types

df289de

Update unstructured so it would install support for different document types Chore COG-685

Merge branch 'main' into COG-685-more-document-types

344865f

test: Add tests for different document types

d7d559f

Add tests for unstructured reading for different document types Test COG-685

Merge branch 'COG-685-more-document-types' of github.com:topoteretes/…

e0a3563

…cognee into COG-685-more-document-types

dexters1 marked this pull request as ready for review December 9, 2024 14:28

dexters1 requested a review from borisarzentar December 9, 2024 14:29

coderabbitai bot reviewed Dec 9, 2024

View reviewed changes

test: Update typo in unstructured test

acf5952

Update typo for file name in test Test COG-685

Vasilije1990 self-requested a review December 9, 2024 17:02

Vasilije1990 approved these changes Dec 9, 2024

View reviewed changes

Vasilije1990 merged commit 5ffbebd into main Dec 9, 2024
40 checks passed

Vasilije1990 deleted the COG-685-more-document-types branch December 9, 2024 17:03

This was referenced Dec 11, 2024

deleted files #342

Merged

Added basic profiling #255

Merged

Cog 656 deployment state #368

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cog 685 more document types #269

Cog 685 more document types #269

dexters1 commented Dec 9, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 9, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Dec 9, 2024

Cog 685 more document types #269

Cog 685 more document types #269

Conversation

dexters1 commented Dec 9, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Dec 9, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Dec 9, 2024

Choose a reason for hiding this comment

dexters1 commented Dec 9, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 9, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)