Working filtering pipeline #31

eiennohito · 2023-11-24T07:41:57Z

No description provided.

The primary update in this commit is the changes in the document processing logic in `DeduplicateParagraphs.scala`. Instead of directly returning the filtered document, the method `processDocumentParts` now returns an array of `ProcessedDocument`s, containing both the document text and the filter that was applied. The `filterDuplicateDocs` method has been updated to accommodate this change. Additionally, fix invalid . The `DataframeUtils.scala` is renamed as `BuilderSyntax.scala`, and its methods have been updated to reflect the more general use beyond just DataFrames. In `Pipeline.scala`, the `splitByFilteredParagraphs` method is introduced to splitting paragraphs into other documents based on certain criteria. In the keyword filter lists, lines starting with `#` characters are treated as comments and are not used to filter. Overall, these changes should improve the accuracy and efficiency of the document processing pipeline.

eiennohito added 14 commits November 23, 2023 16:34

full filtering wip

25f68ca

clean documents while repackaging

c5928ff

fix LargeFreqParagraphs logic

834e688

use another flag for debugging filtering

c5a72ff

experimental treatment for high-freq paragraphs

3c95eda

use spark sql functionality for aggregation without instantiating RDDs

4fe0fc8

fix NoContentDOM filter

46bdfa9

clean up dom-snooping filters

cf4e38c

fix randomized hashes

728cd65

fix text-only output

6edfc3c

update testing config

304e4af

reformat code

e3fc5b2

add documentation on Zipf workaround

b2b2825

eiennohito merged commit 60783e2 into main Nov 24, 2023
2 checks passed

eiennohito deleted the feature/arseny/full-filtering branch November 24, 2023 08:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Working filtering pipeline #31

Working filtering pipeline #31

eiennohito commented Nov 24, 2023

Working filtering pipeline #31

Working filtering pipeline #31

Conversation

eiennohito commented Nov 24, 2023