Add GOT-OCR 2.0 to Transformers #34721

yonigozlan · 2024-11-13T20:03:43Z

What does this PR do?

Add GOT-OCR 2.0 to Transformers.

Left TODOs:

Tests
Docs
Post-processing

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2024-11-25T20:22:13Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

molbap

Seems very clean, congrats! The modular file is a bit verbose still, I left some comments for possible leads to reduce it. Let's make sure slow tests run, otherwise LGTM, and left a couple comments :)

docs/source/en/model_doc/qwen2_vl.md

molbap · 2024-11-28T16:47:57Z

src/transformers/models/got_ocr2/convert_got_ocr2_weights_to_hf.py

+    write_tokenizer(
+        tokenizer_path="qwen.tiktoken",
+        save_dir=args.output_dir,
+        instruct=args.instruct,


Does this model have an instruct version needed?

Oh I forgot to remove that, thanks. I'd say the model is indeed instruct, but users are not really expected to have conversation with it, and the possible prompt are explicitly defined in the processor so I'm not sure in which categories this model falls and if we should add support for a chat template here.
I'd say it might make sense to add a chat template when and if we end up adding support for fine-tuning, but to me it doesn't make much sense in this state, I might be wrong though.
What do you think @molbap @Ucas-HaoranWei ?

From afar, maybe @Ucas-HaoranWei has a different opinion, I'd just remove the instruct - as you say users are not expected to converse with it. If it's instruct-tuned in a later version, it might make sense, but it's an OCR model as it stands, it sounds confusing to label it as instruct.

Yes, I agree that there should not be a chat template, as they are fixed prompts in this state. @molbap @yonigozlan. If the later versions can be fine-tuned, then the template can be maintained to ensure that users use their new prompts.

Sounds good thank you!

src/transformers/models/got_ocr2/modular_got_ocr2.py

molbap · 2024-11-28T17:57:37Z

src/transformers/models/got_ocr2/modular_got_ocr2.py

+        resized_image = image.resize((target_width, target_height))
+
+        # split the image into patches
+        processed_images = []


these are processed_patches, rather, yes?

Yes indeed, but the reason I did not rename it that is because we add to it a "thumbnail" image which is the whole image resized

molbap · 2024-11-28T19:54:25Z

src/transformers/models/got_ocr2/modular_got_ocr2.py

+        format = output_kwargs["text_kwargs"].pop("format", False)
+        num_image_tokens = output_kwargs["images_kwargs"].pop("num_image_tokens", 256)
+        box = output_kwargs["images_kwargs"].pop("box", [None])
+        color = output_kwargs["images_kwargs"].pop("color", None)
+        multi_page = output_kwargs["images_kwargs"].pop("multi_page", False)
+        crop_to_patches = output_kwargs["images_kwargs"].pop("crop_to_patches", False)


here we could use the default values that are preset above

Do you mean something like GotOcr2ProcessorKwargs._defaults["images_kwargs"].get("crop_to_patches")?

yes, for instance! what I mean is the default values in the pop method should not be specified in two places, because if they change, it's more likely to cause errors/mismatches

That makes sense. I might be missing something, but if we have default kwargs (for multi_page, crop_to_patches, min_patches, max_patches etc.), it seems to me that there shouldn't be any issues with using "pop" without default? if the kwarg is not specified by the user, pop will return the default, and it will return the user-specified value otherwise?

@molbap just pinging you on this last question

Ah, you're right - I guess it's just weird to me to see the default values written in two distinct places, does not seem necessary 🤔

src/transformers/models/got_ocr2/modular_got_ocr2.py

molbap · 2024-11-28T20:10:14Z

src/transformers/models/got_ocr2/modular_got_ocr2.py

+class GotOcr2VisionAdapter(nn.Module):
+    def __init__(self, language_hidden_size: int, vision_output_channels: int):
+        super().__init__()
+        self.conv_up1 = nn.Conv2d(


I assume up stands for upsampler like in swin2sr, but a more explicit name would be better

src/transformers/models/got_ocr2/modular_got_ocr2.py

tests/models/got_ocr2/test_modeling_got_ocr2.py

yonigozlan · 2024-11-29T16:05:40Z

Hey @ArthurZucker
This should be ready for you to review :).
There's one question left:

I'd say the model is indeed instruct, but users are not really expected to have conversation with it, and the possible prompt are explicitly defined in the processor so I'm not sure in which categories this model falls and if we should add support for a chat template here.
I'd say it might make sense to add a chat template when and if we end up adding support for fine-tuning, but to me it doesn't make much sense in this state, I might be wrong though.

piercelamb · 2024-12-18T15:31:37Z

HI all -- eager to try this model in transformers in the new year

GXKIM · 2025-01-06T06:38:16Z

HI all -- eager to try this model in transformers in the new year

pls

qubvel

Great work! Thanks for adding different examples to the documentation, this will help others explore the fantastic capabilities of this model! Added some comments regarding image processing and modeling below:

src/transformers/models/got_ocr2/modular_got_ocr2.py

docs/source/en/model_doc/got_ocr2.md

yonigozlan · 2025-01-06T19:15:33Z

Thanks @qubvel for the review! I have made the requested changes.

test3211234 · 2025-01-07T13:59:41Z

Does GOT OCR support returning the bounding boxes of detected text? Or else I have to go back to PaddleOCR which doesn't work on ZLUDA.

yonigozlan · 2025-01-07T17:01:56Z

Does GOT OCR support returning the bounding boxes of detected text? Or else I have to go back to PaddleOCR which doesn't work on ZLUDA.

No GOT OCR doesn't support returning the bounding boxes of detected text.

… conflict

ArthurZucker

Very nice leverage of modular, a few outstanding issues left IMO

ArthurZucker · 2025-01-23T11:46:35Z

docs/source/en/model_doc/got_ocr2.md

super nice !

docs/source/en/model_doc/qwen2_vl.md

src/transformers/models/got_ocr2/__init__.py

src/transformers/models/got_ocr2/convert_got_ocr2_weights_to_hf.py

src/transformers/models/got_ocr2/modular_got_ocr2.py

ArthurZucker · 2025-01-23T11:54:10Z

src/transformers/models/got_ocr2/modular_got_ocr2.py

+    def _check_call_arguments(self, images, box, color, multi_page, crop_to_patches):
+        if images is None:
+            raise ValueError("Images are required to be passed to the processor.")
+
+        if not isinstance(box, (list, tuple)):
+            raise ValueError("Box must be a list or tuple of lists in the form [x1, y1, x2, y2].")
+
+        if multi_page or crop_to_patches:
+            if multi_page and crop_to_patches:
+                raise ValueError("Cannot set both `multi_page` and `crop_to_patches` to `True`.")
+            if box[0] is not None or color is not None:
+                raise ValueError("Cannot pass `box` or `color` with multi-page inference.")
+
+        if box[0] is not None and color is not None:
+            raise ValueError("Both `box` and `color` cannot be set at the same time.")
+
+    def _make_list_of_inputs(self, images, text, box, color, multi_page):
+        if not isinstance(images, (list, tuple)):
+            if multi_page:
+                logger.warning("Multi-page inference is enabled but only one image is passed.")
+            images = [images]
+        elif isinstance(images[0], (list, tuple)) and not multi_page:
+            raise ValueError("Nested images are only supported with `multi_page` set to `True`.")
+        elif not isinstance(images[0], (list, tuple)) and multi_page:
+            images = [images]
+
+        if text is not None:
+            if not isinstance(text, (list, tuple)):
+                text = [text]
+            if len(text) != len(images):
+                raise ValueError("The number of `text` must match the number of images.")
+
+        if not isinstance(box[0], (list, tuple)):
+            # Use the same box for all images
+            box = [box for _ in range(len(images))]
+        if not isinstance(color, (list, tuple)):
+            color = [color for _ in range(len(images))]
+        if len(box) != len(images):
+            raise ValueError("The number of `box` must match the number of images.")
+        if len(color) != len(images):
+            raise ValueError("The number of `color` must match the number of images.")
+
+        return images, text, box, color


really not fan of this, it does not scale, and we do a lot of work for the user when we should be enforcing a single format IMO! See #34726 and why we don't want this

it's good to raise errors, but we need to put more thoughts into how to make our checks readable and find a fine line between enforcing a single input format but at the same time make it easy to use the model.

What are people going to use the model for and how -> most intuitive type of input etc

Yes I see, I've removed all the checks except the ones necessary to batch the inputs

src/transformers/models/got_ocr2/modular_got_ocr2.py

ArthurZucker · 2025-01-23T12:09:14Z

src/transformers/models/got_ocr2/modular_got_ocr2.py

+    def _check_call_arguments(self, images, box, color, multi_page, crop_to_patches):
+        if images is None:
+            raise ValueError("Images are required to be passed to the processor.")
+
+        if not isinstance(box, (list, tuple)):
+            raise ValueError("Box must be a list or tuple of lists in the form [x1, y1, x2, y2].")
+
+        if multi_page or crop_to_patches:
+            if multi_page and crop_to_patches:
+                raise ValueError("Cannot set both `multi_page` and `crop_to_patches` to `True`.")
+            if box[0] is not None or color is not None:
+                raise ValueError("Cannot pass `box` or `color` with multi-page inference.")
+
+        if box[0] is not None and color is not None:
+            raise ValueError("Both `box` and `color` cannot be set at the same time.")
+
+    def _make_list_of_inputs(self, images, text, box, color, multi_page):
+        if not isinstance(images, (list, tuple)):
+            if multi_page:
+                logger.warning("Multi-page inference is enabled but only one image is passed.")
+            images = [images]
+        elif isinstance(images[0], (list, tuple)) and not multi_page:
+            raise ValueError("Nested images are only supported with `multi_page` set to `True`.")
+        elif not isinstance(images[0], (list, tuple)) and multi_page:
+            images = [images]
+
+        if text is not None:
+            if not isinstance(text, (list, tuple)):
+                text = [text]
+            if len(text) != len(images):
+                raise ValueError("The number of `text` must match the number of images.")
+
+        if not isinstance(box[0], (list, tuple)):
+            # Use the same box for all images
+            box = [box for _ in range(len(images))]
+        if not isinstance(color, (list, tuple)):
+            color = [color for _ in range(len(images))]
+        if len(box) != len(images):
+            raise ValueError("The number of `box` must match the number of images.")
+        if len(color) != len(images):
+            raise ValueError("The number of `color` must match the number of images.")
+
+        return images, text, box, color


it's good to raise errors, but we need to put more thoughts into how to make our checks readable and find a fine line between enforcing a single input format but at the same time make it easy to use the model.

What are people going to use the model for and how -> most intuitive type of input etc

ArthurZucker · 2025-01-23T12:09:49Z

src/transformers/models/got_ocr2/modular_got_ocr2.py

+                prompt = (
+                    self.message_start_token
+                    + self.system_query
+                    + self.message_end_token
+                    + self.message_start_token
+                    + "user\n"
+                    + self.img_start_token
+                    + self.img_pad_token * num_image_tokens * num_images
+                    + self.img_end_token
+                    + "\n"
+                    + query
+                    + self.message_end_token
+                    + self.message_start_token
+                    + "assistant\n"
+                )
+                text.append(prompt)


is this not <=> to a chat template basically?

This is discussed here: #34721 (comment)
Indeed the model was trained as an instruct model with a chat template, but users are not really expected to converse with it, hence why there's this parametrized query instead of a full chat support.

ArthurZucker

MUCH much better 😉

src/transformers/models/got_ocr2/image_processing_got_ocr2.py

src/transformers/models/got_ocr2/modular_got_ocr2.py

ArthurZucker · 2025-01-30T10:30:52Z

src/transformers/models/got_ocr2/modular_got_ocr2.py

+
+        if pixel_values is not None:
+            image_features = self.get_image_features(pixel_values=pixel_values)
+            n_image_tokens = (input_ids == self.config.image_token_index).sum().item()


.item() is not really ideal here

Yes, we might want to remove it from llava as well?

ArthurZucker · 2025-01-30T10:31:44Z

src/transformers/models/got_ocr2/modular_got_ocr2.py

-    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        cache_position: Optional[torch.LongTensor] = None,
+        logits_to_keep: Union[int, torch.Tensor] = 0,
+    ) -> Union[Tuple, LlavaCausalLMOutputWithPast]:


forward looks like it does not need to be overwritten no>

I overwrote it only because I don't need vision_feature_layer and vision_feature_select_strategy

…mers into add-got-ocr2

* init modular got_ocr2 * Get correct got_ocr architecture * add processing * run modular with processing * add working inference * apply modular * Refactor and fix style * Refactor, cleanup, fix style * fix init order * Fix docs * add base modeling tests * fix style and consistency * rename doc file * fix repo consistency * fix inference with box * add image processing and support for crop_to_multi_page * Fix batch inference * add tests * fixup * fix slow test * fix docstrings * Add model doc * update to new init * fix input autocast pixel_values dtype * update doc * move doc to multimodal * Reformat crop_image_to_patches and add docstrings * Fix example in forward docstring * Address Pablo review * [run slow] got_ocr2 * remove defaults defined twice * apply modular * add torch_device to integration tests * update modular * follow-up Pavel review * add device variable in doc * fix doc multi-page * Force eager attention for vision encoder to avoid attn implementation conflict * revert qwen2vl doc changes * use Qwen2ForCausalLM instead of Qwen2Model * make fixup * refactor gotocr2 to llava style * uniformize function names and reduce checks * final nits * fix pixel_values dtype error * change checkpoint names * fix modular

yonigozlan mentioned this pull request Nov 13, 2024

Add support for GOT-OCR2.0 #34173

Closed

2 tasks

yonigozlan force-pushed the add-got-ocr2 branch from 93b1d19 to af8035d Compare November 14, 2024 17:30

yonigozlan mentioned this pull request Nov 14, 2024

Integrating GOT-OCR2.0 in Transformers 🤗 Ucas-HaoranWei/GOT-OCR2.0#137

Open

yonigozlan added New model Multimodal labels Nov 14, 2024

Ucas-HaoranWei approved these changes Nov 15, 2024

View reviewed changes

Ucas-HaoranWei approved these changes Nov 18, 2024

View reviewed changes

yonigozlan mentioned this pull request Nov 21, 2024

Fix support for image processors modifications in modular #34866

Merged

5 tasks

yonigozlan force-pushed the add-got-ocr2 branch from f8e1ac9 to 4007fb2 Compare November 25, 2024 19:29

yonigozlan requested review from qubvel and molbap November 25, 2024 21:24

molbap approved these changes Nov 28, 2024

View reviewed changes

yonigozlan added the run-slow label Nov 29, 2024

yonigozlan requested review from ArthurZucker and molbap November 29, 2024 16:06

yonigozlan force-pushed the add-got-ocr2 branch from 7e88dbe to df94db3 Compare December 5, 2024 18:02

qubvel reviewed Jan 6, 2025

View reviewed changes

qubvel mentioned this pull request Jan 7, 2025

OverflowError: out of range integral type conversion attempted #35540

Closed

4 tasks

yonigozlan mentioned this pull request Jan 16, 2025

image_transforms preprocess quite slow when run large image with qwen2vl #34272

Closed

4 tasks

yonigozlan added 2 commits January 21, 2025 14:51

init modular got_ocr2

cede640

Get correct got_ocr architecture

72a003c

yonigozlan added 6 commits January 21, 2025 14:53

add torch_device to integration tests

3ef43ee

update modular

088672f

follow-up Pavel review

7e6bab9

add device variable in doc

1f5f054

fix doc multi-page

0ff44e8

Force eager attention for vision encoder to avoid attn implementation…

3ae43ec

… conflict

yonigozlan force-pushed the add-got-ocr2 branch from 6989c07 to 3ae43ec Compare January 21, 2025 16:20

ArthurZucker reviewed Jan 23, 2025

View reviewed changes

yonigozlan and others added 7 commits January 24, 2025 16:53

Merge remote-tracking branch 'upstream/main' into add-got-ocr2

8289e69

revert qwen2vl doc changes

9cbd804

use Qwen2ForCausalLM instead of Qwen2Model

c87fa62

make fixup

34c716e

refactor gotocr2 to llava style

0178c68

uniformize function names and reduce checks

3120412

Merge branch 'main' into add-got-ocr2

6b5169b

yonigozlan requested review from ArthurZucker and removed request for molbap January 28, 2025 00:54

Merge branch 'main' into add-got-ocr2

3b626be

ArthurZucker approved these changes Jan 30, 2025

View reviewed changes

yonigozlan and others added 7 commits January 30, 2025 16:49

final nits

b5b56d8

Merge branch 'add-got-ocr2' of https://github.com/yonigozlan/transfor…

991b3f3

…mers into add-got-ocr2

Merge branch 'main' into add-got-ocr2

9046bf5

fix pixel_values dtype error

cc861d3

change checkpoint names

506d566

Merge branch 'main' into add-got-ocr2

2dbbf24

fix modular

9fb4e90

yonigozlan merged commit 2b46943 into huggingface:main Jan 31, 2025
25 checks passed

Add GOT-OCR 2.0 to Transformers #34721

Add GOT-OCR 2.0 to Transformers #34721

Conversation

yonigozlan commented Nov 13, 2024 • edited Loading

What does this PR do?

Who can review?

HuggingFaceDocBuilderDev commented Nov 25, 2024

molbap left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yonigozlan commented Nov 29, 2024 • edited Loading

piercelamb commented Dec 18, 2024

GXKIM commented Jan 6, 2025

qubvel left a comment

Choose a reason for hiding this comment

yonigozlan commented Jan 6, 2025

test3211234 commented Jan 7, 2025

yonigozlan commented Jan 7, 2025

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yonigozlan commented Nov 13, 2024 •

edited

Loading

yonigozlan commented Nov 29, 2024 •

edited

Loading