Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check all the PDF's pages for oversized page dimensions #19987

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

michelpmcdonald
Copy link
Contributor

@michelpmcdonald michelpmcdonald commented Dec 20, 2024

Summary

Benefits Intake PDF uploads include a check for PDFs who's pages physical dimensions exceed a specific width or height.

The command line tool pdfinfo is used to fetch the PDF's dimensions. The way we were calling pdfinfo only provides the size of the fist page within the PDF, however, individual pages within the PDF can(and do) have different dimensions.

Only checking the size only the first page in the PDF will cause a downstream error with a somewhat unspecific error message(~corrupt PDF detected, guess it comes from doc conversions service) if the unchecked pages dimensions are too large. The error message returned from the downstream system does not clearly state that the PDF is oversized.

We can't really get rid of the dimension restrictions, we can't prevent the consumer from uploading oversized PDFs, but we can check ALL the pages prior to submitting downstream and provide our current pretty clear\specific oversized error message so at least the consumer knows what the issue is.

No flipper flag in play here.

The solution was to provide pdfinfo command line arguments that instructs it to check all the pages' dimensions, and we had to slightly modify the pdfinfo results parsing code.

Team Banana-peels, part of this is our code, but part of this is shared code.

For reviewers, modules/vba_documents/spec/fixtures/10x102.pdf was changed, now the oversized page is the second page in the pdf, not the first page like it as prior to this PR.

  • *This work is behind a feature toggle (flipper): NO

Related issue(s)

API-43483

Testing done

  • New code is covered by unit tests
    In addition to unit tests, manually tested locally

Screenshots

Note: Optional

What areas of the site does it impact?

Benefits Intake PDF validation

Acceptance criteria

  • I fixed|updated|added unit tests and integration tests for each feature (if applicable).
  • No error nor warning in the console.

@va-vfs-bot va-vfs-bot temporarily deployed to API-43484/main/main December 20, 2024 23:07 Inactive
@michelpmcdonald michelpmcdonald added Lighthouse lighthouse benefits-intake Lighthouse Benefits Intake API banana-peels Lighthouse Banana Peels Team labels Jan 6, 2025
@michelpmcdonald michelpmcdonald self-assigned this Jan 6, 2025
@va-vfs-bot va-vfs-bot temporarily deployed to API-43484/main/main January 6, 2025 16:10 Inactive
Copy link
Contributor

@kristen-brown kristen-brown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean solution! I like it!

Assuming that you evaluated performance of checking every page in a PDF with a very large number of pages (like 215,000 pages, to grab our largest outlier from the monthly report) and that the file can still be processed successfully (I think you may have already mentioned doing this in standup), this looks good to me!

@@ -91,7 +91,7 @@ def pdf_metadata(pdf)
dimensions: {
height: dimensions[:height].round(2),
width: dimensions[:width].round(2),
Copy link
Contributor

@kristen-brown kristen-brown Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how useful these height and width values in the metadata are anymore, since they're only for the first page. 🤔

Not suggesting any specific changes, just something I'm thinking about.

@@ -18,7 +18,7 @@ def self.read(file_or_path)

def initialize(path)
@stdout = []
Open3.popen2e(Settings.binaries.pdfinfo, path) do |_stdin, stdout, wait|
Open3.popen2e(Settings.binaries.pdfinfo, '-l', '-1', path) do |_stdin, stdout, wait|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's worth adding a comment here to explain what this option with the -1 magic number passed in does.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also looks like we are the only team using this library (vba_documents and appeals_api modules), so we won't be affecting anyone else's implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
banana-peels Lighthouse Banana Peels Team benefits-intake Lighthouse Benefits Intake API Lighthouse lighthouse require-backend-approval test-passing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants