-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check all the PDF's pages for oversized page dimensions #19987
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clean solution! I like it!
Assuming that you evaluated performance of checking every page in a PDF with a very large number of pages (like 215,000 pages, to grab our largest outlier from the monthly report) and that the file can still be processed successfully (I think you may have already mentioned doing this in standup), this looks good to me!
@@ -91,7 +91,7 @@ def pdf_metadata(pdf) | |||
dimensions: { | |||
height: dimensions[:height].round(2), | |||
width: dimensions[:width].round(2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how useful these height
and width
values in the metadata are anymore, since they're only for the first page. 🤔
Not suggesting any specific changes, just something I'm thinking about.
@@ -18,7 +18,7 @@ def self.read(file_or_path) | |||
|
|||
def initialize(path) | |||
@stdout = [] | |||
Open3.popen2e(Settings.binaries.pdfinfo, path) do |_stdin, stdout, wait| | |||
Open3.popen2e(Settings.binaries.pdfinfo, '-l', '-1', path) do |_stdin, stdout, wait| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's worth adding a comment here to explain what this option with the -1
magic number passed in does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also looks like we are the only team using this library (vba_documents
and appeals_api
modules), so we won't be affecting anyone else's implementations.
Summary
Benefits Intake PDF uploads include a check for PDFs who's pages physical dimensions exceed a specific width or height.
The command line tool pdfinfo is used to fetch the PDF's dimensions. The way we were calling pdfinfo only provides the size of the fist page within the PDF, however, individual pages within the PDF can(and do) have different dimensions.
Only checking the size only the first page in the PDF will cause a downstream error with a somewhat unspecific error message(~corrupt PDF detected, guess it comes from doc conversions service) if the unchecked pages dimensions are too large. The error message returned from the downstream system does not clearly state that the PDF is oversized.
We can't really get rid of the dimension restrictions, we can't prevent the consumer from uploading oversized PDFs, but we can check ALL the pages prior to submitting downstream and provide our current pretty clear\specific oversized error message so at least the consumer knows what the issue is.
No flipper flag in play here.
The solution was to provide pdfinfo command line arguments that instructs it to check all the pages' dimensions, and we had to slightly modify the pdfinfo results parsing code.
Team Banana-peels, part of this is our code, but part of this is shared code.
For reviewers, modules/vba_documents/spec/fixtures/10x102.pdf was changed, now the oversized page is the second page in the pdf, not the first page like it as prior to this PR.
Related issue(s)
API-43483
Testing done
In addition to unit tests, manually tested locally
Screenshots
Note: Optional
What areas of the site does it impact?
Benefits Intake PDF validation
Acceptance criteria