Skip to content

Commit

Permalink
_parse_pdf_to_string: catch and ignore PDFException.
Browse files Browse the repository at this point in the history
This function is a best effort attempt to get text from a PDF file -
if the file is malformed (PDFSyntaxError), this function returns an
empty string.  However, there are other exceptions such as
PDFPasswordIncorrect that can occur even if the file is well-formed.

Although it would be better to handle these exceptions at a higher
level, this is a temporary fix to allow training applications
containing encrypted files to be rejected.
  • Loading branch information
Benjamin Moody committed Jan 19, 2024
1 parent ac23bf7 commit b21a2e1
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions physionet-django/console/services.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
from typing import Optional

from pdfminer.high_level import extract_text
from pdfminer.pdfparser import PDFSyntaxError
from pdfminer.pdfparser import PDFException
from django.conf import settings

from user.models import Training
Expand All @@ -29,7 +29,7 @@ def _get_regex_value_from_text(text: str, regex: str) -> Optional[str]:
def _parse_pdf_to_string(training_path: str) -> str:
try:
text = extract_text(training_path)
except PDFSyntaxError:
except PDFException:
text = ''
logging.error(f'Failed to extract text from {training_path}')
return ' '.join(text.split())
Expand Down

0 comments on commit b21a2e1

Please sign in to comment.