Skip to content

Commit

Permalink
Ignore encryption (smalot#653)
Browse files Browse the repository at this point in the history
* Add ability to ingore PDF encryption check

* Switch to ! syntax

* Update src/Smalot/PdfParser/Parser.php

* Additional changes for smalot#488

doc/Usage.md:

  - Moved description of `setIgnoreEncryption` option to doc/CustomConfig.md
  - Added brief "PDF encryption" section

doc/CustomConfig.md: added `setIgnoreEncryption` option and section to describe it.

src/Smalot/PdfParser/Config.php: Doc comment for Config::setIgnoreEncryption()

Added tests/PHPUnit/Integration/EncryptionTest.php

Added samples/not_really_encrypted.pdf (thanks to @parijke who
orginially created this as test.pdf)

See smalot#653

* src/Smalot/PdfParser/Config.php: PHP-CS-Fixer issue fixed

* Update CustomConfig.md

refined texts

* Config.php: use explicit PHP doc entities

* ParserTest.php: moved tests

* removed EncryptionTest.php

---------

Co-authored-by: Jordan Hall <[email protected]>
Co-authored-by: Konrad Abicht <[email protected]>
  • Loading branch information
3 people authored Dec 1, 2023
1 parent feaf39e commit 268a620
Show file tree
Hide file tree
Showing 6 changed files with 83 additions and 1 deletion.
15 changes: 15 additions & 0 deletions doc/CustomConfig.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ The `Config` class has the following options:
|--------------------------|---------|-----------------|------------------------------------------------------------------------------------------------------------------------------------------------------|
| `setDecodeMemoryLimit` | Integer | `0` | If parsing fails because of memory exhaustion, you can set a lower memory limit for decoding operations. |
| `setFontSpaceLimit` | Integer | `-50` | Changing font space limit can be helpful when `Parser::getText()` returns a text with too many spaces. |
| `setIgnoreEncryption` | Boolean | `false` | Read PDFs that are not encrypted but have the encryption flag set. This is a temporary workaround, don't rely on it. |
| `setHorizontalOffset` | String | ` ` | When words are broken up or when the structure of a table is not preserved, you may get better results when adapting `setHorizontalOffset`. |
| `setPdfWhitespaces` | String | `\0\t\n\f\r ` | |
| `setPdfWhitespacesRegex` | String | `[\0\t\n\f\r ]` | |
Expand Down Expand Up @@ -63,3 +64,17 @@ $config->setFontSpaceLimit(-60);
$parser = new \Smalot\PdfParser\Parser([], $config);
$pdf = $parser->parseFile('document.pdf');
```

## option setIgnoreEncryption

In some cases PDF files may be internally marked as encrypted even though the content is not encrypted and can be read.
This can be caused by the PDF being created by a tool that does not properly set the encryption flag.
If you are sure that the PDF is not encrypted, you can ignore the encryption flag by setting the `ignoreEncryption` flag to `true` in a custom `Config` instance.

```php
$config = new \Smalot\PdfParser\Config();
$config->setIgnoreEncryption(true);

$parser = new \Smalot\PdfParser\Parser([], $config);
$pdf = $parser->parseFile('document.pdf');
```
11 changes: 11 additions & 0 deletions doc/Usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -230,3 +230,14 @@ foreach ($pages as $page) {
];
}
```

## PDF encryption

This library cannot currently read encrypted PDF files, i.e. those with
a read password. Attempting to do so produces this error:
```
Exception: Secured pdf file are currently not supported.
```

See `setIgnoreEncryption` option in [CustomConfig.md](CustomConfig.md)
for how to override the check in specific cases.
Binary file added samples/not_really_encrypted.pdf
Binary file not shown.
21 changes: 21 additions & 0 deletions src/Smalot/PdfParser/Config.php
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,13 @@ class Config
*/
private $dataTmFontInfoHasToBeIncluded = false;

/**
* Whether to attempt to read PDFs even if they are marked as encrypted.
*
* @var bool
*/
private $ignoreEncryption = false;

public function getFontSpaceLimit()
{
return $this->fontSpaceLimit;
Expand Down Expand Up @@ -151,4 +158,18 @@ public function setDataTmFontInfoHasToBeIncluded(bool $dataTmFontInfoHasToBeIncl
{
$this->dataTmFontInfoHasToBeIncluded = $dataTmFontInfoHasToBeIncluded;
}

public function getIgnoreEncryption(): bool
{
return $this->ignoreEncryption;
}

/**
* @deprecated this is a temporary workaround, don't rely on it
* @see https://github.com/smalot/pdfparser/pull/653
*/
public function setIgnoreEncryption(bool $ignoreEncryption): void
{
$this->ignoreEncryption = $ignoreEncryption;
}
}
2 changes: 1 addition & 1 deletion src/Smalot/PdfParser/Parser.php
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ public function parseContent(string $content): Document
// Create structure from raw data.
list($xref, $data) = $this->rawDataParser->parseData($content);

if (isset($xref['trailer']['encrypt'])) {
if (isset($xref['trailer']['encrypt']) && false === $this->config->getIgnoreEncryption()) {
throw new \Exception('Secured pdf file are currently not supported.');
}

Expand Down
35 changes: 35 additions & 0 deletions tests/PHPUnit/Integration/ParserTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -403,6 +403,41 @@ public function testRetainImageContentImpact(): void
$this->assertLessThan($baselineMemory * 1.05, $usedMemory, 'Memory is '.$usedMemory);
$this->assertTrue('' !== $document->getText());
}

/**
* Tests handling of encrypted PDF.
*
* @see https://github.com/smalot/pdfparser/pull/653
*/
public function testNoIgnoreEncryption(): void
{
$filename = $this->rootDir.'/samples/not_really_encrypted.pdf';
$threw = false;
try {
(new Parser([]))->parseFile($filename);
} catch (\Exception $e) {
// we expect an exception to be thrown if an encrypted PDF is encountered.
$threw = true;
}
$this->assertTrue($threw);
}

/**
* Tests behavior if encryption is ignored.
*
* @see https://github.com/smalot/pdfparser/pull/653
*/
public function testIgnoreEncryption(): void
{
$config = new Config();
$config->setIgnoreEncryption(true);

$filename = $this->rootDir.'/samples/not_really_encrypted.pdf';

$this->assertTrue((new Parser([], $config))->parseFile($filename) instanceof Document);

// without the configuration option set, an exception would be thrown.
}
}

class ParserSub extends Parser
Expand Down

0 comments on commit 268a620

Please sign in to comment.