_parse_pdf_to_string: catch and ignore PDFException.

This function is a best effort attempt to get text from a PDF file - if the file is malformed (PDFSyntaxError), this function returns an empty string. However, there are other exceptions such as PDFPasswordIncorrect that can occur even if the file is well-formed. Although it would be better to handle these exceptions at a higher level, this is a temporary fix to allow training applications containing encrypted files to be rejected.
MIT-LCP · Jan 19, 2024 · b21a2e1 · b21a2e1
1 parent ac23bf7
commit b21a2e1
Showing 1 changed file with 2 additions and 2 deletions.
diff --git a/physionet-django/console/services.py b/physionet-django/console/services.py
@@ -4,7 +4,7 @@
 from typing import Optional
 
 from pdfminer.high_level import extract_text
-from pdfminer.pdfparser import PDFSyntaxError
+from pdfminer.pdfparser import PDFException
 from django.conf import settings
 
 from user.models import Training
@@ -29,7 +29,7 @@ def _get_regex_value_from_text(text: str, regex: str) -> Optional[str]:
 def _parse_pdf_to_string(training_path: str) -> str:
     try:
         text = extract_text(training_path)
-    except PDFSyntaxError:
+    except PDFException:
         text = ''
         logging.error(f'Failed to extract text from {training_path}')
     return ' '.join(text.split())