Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an allow list for specific file types to be uploaded into the search index #3497

Closed
1 task
chouinar opened this issue Jan 13, 2025 · 3 comments · Fixed by #3544
Closed
1 task

Add an allow list for specific file types to be uploaded into the search index #3497

chouinar opened this issue Jan 13, 2025 · 3 comments · Fixed by #3544
Assignees

Comments

@chouinar
Copy link
Collaborator

chouinar commented Jan 13, 2025

Summary

Before we turn on the process to load opportunities into our search index, we want to create a filter so only certain file types get uploaded. For example, we don't want to upload an mp4 file as the file needs to be roughly a text file.

Filter to only files with the following suffixes (case-insensitive):

  • .txt
  • .pdf
  • .docx
  • .doc
  • .xlsx
  • .xlsm
  • .html
  • .htm
  • .pptx
  • .ppt
  • .rtf

Note that it's more efficient to filter by doing something like than any sort of looping:

ALLOWED_ATTACHMENT_SUFFIXES = set(["pdf", "docx", ...])

file_suffix = my_attachment.file_name.lower().split(".")[-1]

if file_suffix in ALLOWED_ATTACHMENT_SUFFIXES:
    # do something with it

Acceptance criteria

  • Filter added
@chouinar
Copy link
Collaborator Author

Some data for deciding the open questions.

Mime type counts by file suffix:

text/richtext 1
	.rtf: 1
/ 1
	.pdf: 1
application/vnd.fdf 1
	.fdf: 1
audio/mp3 1
	.mp3: 1
audio/x-m4a 1
	.m4a: 1
multipart/related 1
	.mht: 1
application/octet-stream; name="LM2023456789123 (1)" 1
	.LM: 1
application/octet-stream; name=IMLS_Library_4_0-V4.0_F831.xls 1
	.xls: 1
application/octet-stream; name=1234-rover_and_rocks_medium.mp4 1
	.mp4: 1
text/plain; charset=us-ascii; name=RDFile1.txt 1
	.txt: 1
application/octet-stream; name=1234-instructions5.docx 1
	.docx: 1
application/octect-stream 2
	.pdf: 2
application/vnd.ms-xpsdocument 2
	.xps: 2
application/vnd.oasis.opendocument.text 2
	.odt: 2
application/vnd.openxmlformats-officedocument.wordprocessingml.template 2
	.dotx: 2
video/mp4 2
	.mp4: 2
audio/wav 2
	.wav: 2
text/calendar 3
	.ics: 3
text/plain; charset=us-ascii 3
	.txt: 2
	.LM2023456789123456: 1
image/pjpeg 5
	.jpg: 5
application/vnd.ms-excel.sheet.binary.macroEnabled.12 5
	.xlsb: 5
application/vnd.ms-word.document.macroEnabled.12 6
	.docm: 6
audio/mpeg 7
	.mp3: 7
image/png 9
	.png: 6
	.PNG: 3
message/rfc822 23
	.mht: 23
application/vnd.ms-excel.sheet.macroEnabled.12 30
	.xlsm: 30
image/jpeg 32
	.jpg: 21
	.JPG: 11
text/plain 39
	.txt: 37
	.htm: 2
application/vnd.ms-powerpoint 66
	.ppt: 64
	.PPT: 2
application/zip 154
	.zip: 154
application/vnd.openxmlformats-officedocument.presentationml.presentation 270
	.pptx: 268
	.PPTX: 2
application/vnd.ms-excel 464
	.xls: 453
	.XLS: 7
	.csv: 4
application/x-zip-compressed 606
	.zip: 604
	.docx: 2
text/html 2428
	.html: 2385
	.htm: 43
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 3910
	.xlsx: 3907
	.XLSX: 3
application/octet-stream 5932
	.pdf: 2102
	.doc: 1692
	.docx: 1528
	.xlsx: 277
	.pptx: 199
	.xls: 27
	.msg: 15
	.rtf: 13
	.zip: 12
	.0]: 10
	.zipx: 7
	.url: 5
	.ART Method I RFA (Attachment A) FINAL: 4
	.ART Method I Attachment B FINAL: 4
	.PPTX: 4
	.ART Method I Cover Letter'09 FINAL: 3
	.4]: 3
	.23: 3
	.htm: 2
	.DOC: 2
	.aoa cms affordable care act: 2
	.baa: 2
	.price sheet: 2
	.dwg: 2
	.docm: 1
	.ART Method II Cover letter '09 FINAL: 1
	.ART Method II  RFA '09 FINAL: 1
	.wbk: 1
	.gov: 1
	.lyr: 1
	.dotx: 1
	.far: 1
	.All Qs and As NOFO: 1
	.Budget: 1
	.PDF: 1
	.LMTOMCA1T: 1
application/msword 7531
	.doc: 7388
	.rtf: 71
	.DOC: 67
	.dot: 3
	.Doc: 2
application/vnd.openxmlformats-officedocument.wordprocessingml.document 17213
	.docx: 17188
	.DOCX: 25
application/pdf 62123
	.pdf: 61748
	.PDF: 353
	.Budget: 11
	.RFA 521-08-006 Questions and Answers: 2
	.noi: 1
	.participants list: 1
	.One WASH National Program Document: 1
	.P18AS00496: 1
	.P18AS00502 NOI: 1
	.P18AS00498 NOI: 1
	.720-668-RFI-19-SLO: 1
	.LMTOMCAT: 1
	.Key_Contacts: 1

File suffixes in prod data:

.pdf: 64207
.docx: 18744
.doc: 9151
.xlsx: 4187
.html: 2385
.zip: 770
.xls: 488
.pptx: 473
.rtf: 85
.ppt: 66
.htm: 47
.txt: 40
.jpg: 37
.xlsm: 30
.mht: 24
.msg: 15
.budget: 12
.0]: 10
.png: 9
.mp3: 8
.docm: 7
.zipx: 7
.url: 5
.xlsb: 5
.art method i rfa (attachment a) final: 4
.art method i attachment b final: 4
.csv: 4
.art method i cover letter'09 final: 3
.dot: 3
.dotx: 3
.mp4: 3
.4]: 3
.ics: 3
.23: 3
.rfa 521-08-006 questions and answers: 2
.aoa cms affordable care act: 2
.baa: 2
.price sheet: 2
.xps: 2
.odt: 2
.dwg: 2
.wav: 2
.art method ii cover letter '09 final: 1
.art method ii  rfa '09 final: 1
.wbk: 1
.gov: 1
.noi: 1
.participants list: 1
.lyr: 1
.fdf: 1
.one wash national program document: 1
.far: 1
.all qs and as nofo: 1
.p18as00496: 1
.p18as00502 noi: 1
.p18as00498 noi: 1
.m4a: 1
.720-668-rfi-19-slo: 1
.lm: 1
.lmtomcat: 1
.lm2023456789123456: 1
.lmtomca1t: 1
.key_contacts: 1

@chouinar
Copy link
Collaborator Author

A lot of the weird mime types are caused by weird files. I think that means it's best to ignore those.

Assuming we just want text-like file types, mime type wouldn't work well. Octet-stream is too generic and has too much in it. application/pdf similarly has a few extra ones.

I think if we assume the file types on the file name are generally right, we could use those.

If so, we could limit it to the following file types:

  • .pdf
  • .docx
  • .doc
  • .xlsx
  • .html
  • .pptx (maybe?)
  • .rtf
  • .ppt (maybe?)
  • .htm
  • .xlsm

And still have every text file type with at least 10 records. Need to check if any of these are prohibitively large.

@chouinar
Copy link
Collaborator Author

Skimming through some of the largest PDFs (50mb+), it seems like the issue is just poor compression/optimization? Some are clearly scans of printouts, and some are just oddly formatted, but they all seem valid?

It might be a case where we just need to test the file size. I verified the lower env doesn't have anything too big either.

@babebe babebe self-assigned this Jan 16, 2025
@babebe babebe moved this from Todo to In Progress in Simpler.Grants.gov Product Backlog Jan 16, 2025
@babebe babebe moved this from In Progress to In Review in Simpler.Grants.gov Product Backlog Jan 16, 2025
babebe added a commit that referenced this issue Jan 17, 2025
## Summary
Fixes [#{3497}](#3497)

### Time to review: __5 mins__

## Changes proposed
Added Filter to only files with the allowed suffixes (case-insensitive)
Update test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging a pull request may close this issue.

2 participants