Skip to content

Commit

Permalink
Merge pull request #77 from pymupdf/v0.0.10
Browse files Browse the repository at this point in the history
Changes for v0.0.10
  • Loading branch information
JorjMcKie authored Jul 21, 2024
2 parents f779e1e + 52ed7aa commit 4e0d6c6
Show file tree
Hide file tree
Showing 5 changed files with 245 additions and 94 deletions.
10 changes: 8 additions & 2 deletions docs/src/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ API

Prints the version of the library.

.. method:: to_markdown(doc: pymupdf.Document | str, *, pages: list | range | None = None, hdr_info: Any = None, write_images: bool = False, dpi: int = 150, margins=(0, 50, 0, 50), page_chunks: bool = False, page_width: float = 612, page_height: float = None, table_strategy="lines_strict", graphics_limit: int = None) -> str | list[dict]
.. method:: to_markdown(doc: pymupdf.Document | str, *, pages: list | range | None = None, hdr_info: Any = None, write_images: bool = False, dpi: int = 150, image_path="", image_format="png", force_text=True, margins=(0, 50, 0, 50), page_chunks: bool = False, page_width: float = 612, page_height: float = None, table_strategy="lines_strict", graphics_limit: int = None) -> str | list[dict]

Read the pages of the file and outputs the text of its pages in |Markdown| format. How this should happen in detail can be influenced by a number of parameters. Please note that there exists **support for building page chunks** from the |Markdown| text.

Expand All @@ -20,10 +20,16 @@ API

:arg hdr_info: optional. Use this if you want to provide your own header detection logic. This may be a callable or an object having a method named `get_header_id`. It must accept a text span (a span dictionary as contained in `extractDict <https://pymupdf.readthedocs.io/en/latest/textpage.html#span-dictionary>`_) and a keyword parameter "page" (which is the owning `Page <https://pymupdf.readthedocs.io/en/latest/page.html>`_ object). It must return a string "" or up to 6 "#" characters followed by 1 space. If omitted, a full document scan will be performed to find the most popular font sizes and derive header levels based on them. To completely avoid this behavior specify `hdr_info=lambda s, page=None: ""` or `hdr_info=False`.

:arg bool write_images: when encountering images or vector graphics, PNG images will be created from the respective page area and stored in the folder of the document. Markdown references will be generated pointing to these images. Any text contained in these areas will not be included in the text output (but appear as part of the images). Therefore, if your document has text written on full page images, make sure to set this parameter to `False`.
:arg bool write_images: when encountering images or vector graphics, images will be created from the respective page area and stored in the specified folder. Markdown references will be generated pointing to these images. Any text contained in these areas will not be included in the text output (but appear as part of the images). Therefore, if for instance your document has text written on full page images, make sure to set this parameter to `False`.

:arg int dpi: specify the desired image resolution in dots per inch. Relevant only if `write_images=True`. Default value is 150.

:arg str image_path: store images in this folder. Relevant if `write_images=True`. Default is the path of the script directory.

:arg str image_format: specify the desired image format via its extension. Default is "png" (portable network graphics). Another popular format may be "jpg". Possible values are all `supported output formats <https://pymupdf.readthedocs.io/en/latest/pixmap.html#supported-output-image-formats>`_.

:arg bool force_text: generate text output even when overlapping images / graphics. This text then appears after the respective image. If `write_images=True` this parameter may be `False` to suppress repetition of text on images.

:arg float,list margins: a float or a sequence of 2 or 4 floats specifying page borders. Only objects inside the margins will be considered for output.

* `margin=f` yields `(f, f, f, f)` for `(left, top, right, bottom)`.
Expand Down
22 changes: 22 additions & 0 deletions docs/src/changes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,28 @@
Change Log
===========================================================================

Changes in version 0.0.10
--------------------------

Fixes:
~~~~~~~

* `73 <https://github.com/pymupdf/RAG/issues/73>`_ "bug in to_markdown internal function"
* `74 <https://github.com/pymupdf/RAG/issues/74>`_ "minimum area for images & vector graphics"
* `75 <https://github.com/pymupdf/RAG/issues/75>`_ "Poor Markdown Generation for Particular PDF"
* `76 <https://github.com/pymupdf/RAG/issues/76>`_ "suggestion on useful api parameters"


Improvements:
~~~~~~~~~~~~~~
* Improved recognition of "insignificant" vector graphics. Graphics like text highlights or borders will be ignored.
* The format of saved images can now be controlled via new parameter `image_format`.
* Images can be stored in a specific folder via the new parameter `image_path`.
* Images are **not stored if contained** in another image on same page.
* Images are **not stored if too small:** if width or height are less than 5% of corresponding page dimension.
* All text is always written. If `write_images=True`, text on images / graphics can be suppressed by setting `force_text=False`.


Changes in version 0.0.9
--------------------------

Expand Down
2 changes: 1 addition & 1 deletion pymupdf4llm/pymupdf4llm/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from .helpers.pymupdf_rag import IdentifyHeaders, to_markdown

__version__ = "0.0.9"
__version__ = "0.0.10"
version = __version__
version_tuple = tuple(map(int, version.split(".")))

Expand Down
Loading

0 comments on commit 4e0d6c6

Please sign in to comment.