Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating bounding boxes with UDOP #87

Open
AleRosae opened this issue Jul 17, 2023 · 7 comments
Open

Generating bounding boxes with UDOP #87

AleRosae opened this issue Jul 17, 2023 · 7 comments

Comments

@AleRosae
Copy link

Hi,

by reading the UDOP paper, my understanding is that during pre-training the model is taught to predict the layout of a target (textual) sequence using special layout tokens.
I was wondering if it is possible to exploit such capability also during finetuning e.g. to finetune the model using target sequences such as: <key> Name <loc_100> <loc_200> <loc_150> <loc_250> </key> <value> Jane Doe <loc_110> <loc_210> <loc_160> <loc_260> </key>

Ideally, could this approach allow to have a correspondence between the generated text (e.g. the name) and its position within the page document?

@zinengtang
Copy link
Collaborator

We have some similar objectives. For example, question answering, the answer will be followed by its bounding box. So, this is possible indeed as long as the format follows "[text sequence] "

@AleRosae
Copy link
Author

Thank you for your answer @zinengtang!
So if I'm not mistaking, to do so we should first normalize the original bounding boxes in range [0, 1000] on the basis of width and height of the original image; then normalize them between [0, 1]; and then convert them into layout tokens by multiplying them for the layout vocabulary size (500). Am I getting it right?

Btw, I'm using the (not yet merged) code from the HuggingFace PR that is porting UDOP into Transformers. Works like a charm, but there might be some differences with your code.

@sromoam
Copy link

sromoam commented Jul 18, 2023

@AleRosae can you share any snippets of your use of the PR? I got stuck on an early step.

Thanks in advance.

@AleRosae
Copy link
Author

Hi @sromoam,
for inference you can use the standard generate() method:

model = UdopForConditionalGeneration.from_pretrained("udop_model")
outputs = model.generate(input_ids=input_ids,
                                      bbox=bbox,
                                      attention_mask=attention_mask,
                                      pixel_values=pixel_values,
                                   max_length=512,
                                   use_cache=False,
                                   num_beams=1,
                                   return_dict_in_generate=True)

You can obtain input_ids, bboxes, attention_mask and pixel values using the UdopProcessor:

processor = UdopProcessor.from_pretrained("udop_model", apply_ocr=True)
encoding = processor(images=image, return_tensors="pt").to(device)

For finetuning, you can follow the Pix2Struct tutorial. Just be sure to also include words and bboxes in your dataloader, as Pix2Struct only takes images as input.

@jainamhdoshi
Copy link

Hi @sromoam, for inference you can use the standard generate() method:

model = UdopForConditionalGeneration.from_pretrained("udop_model")
outputs = model.generate(input_ids=input_ids,
                                      bbox=bbox,
                                      attention_mask=attention_mask,
                                      pixel_values=pixel_values,
                                   max_length=512,
                                   use_cache=False,
                                   num_beams=1,
                                   return_dict_in_generate=True)

You can obtain input_ids, bboxes, attention_mask and pixel values using the UdopProcessor:

processor = UdopProcessor.from_pretrained("udop_model", apply_ocr=True)
encoding = processor(images=image, return_tensors="pt").to(device)

For finetuning, you can follow the Pix2Struct tutorial. Just be sure to also include words and bboxes in your dataloader, as Pix2Struct only takes images as input.

can you please provide which libraries did you import for UdopForConditionalGeneration as i am getting error like this
error : ImportError: cannot import name 'UdopForConditionalGeneration' from 'transformers'

@jainamhdoshi
Copy link

Hi @sromoam, for inference you can use the standard generate() method:

model = UdopForConditionalGeneration.from_pretrained("udop_model")
outputs = model.generate(input_ids=input_ids,
                                      bbox=bbox,
                                      attention_mask=attention_mask,
                                      pixel_values=pixel_values,
                                   max_length=512,
                                   use_cache=False,
                                   num_beams=1,
                                   return_dict_in_generate=True)

You can obtain input_ids, bboxes, attention_mask and pixel values using the UdopProcessor:

processor = UdopProcessor.from_pretrained("udop_model", apply_ocr=True)
encoding = processor(images=image, return_tensors="pt").to(device)

For finetuning, you can follow the Pix2Struct tutorial. Just be sure to also include words and bboxes in your dataloader, as Pix2Struct only takes images as input.

can you please provide which libraries did you import for UdopForConditionalGeneration as i am getting error like this error : ImportError: cannot import name 'UdopForConditionalGeneration' from 'transformers'

solved the issue we need to have transformers version = "4.39.0.dev0"
which can be cloned from here https://github.com/huggingface/transformers/blob/main/src/transformers/__init__.py
the commit on Mar 18, 2024

@Joao-M-Silva
Copy link

Joao-M-Silva commented Jun 7, 2024

@zinengtang I want to use the processor with my own OCR. What should be the format of the bouding boxes? 1. Normalized with heigh and width? 2. Normalized with height and width * 1000 3. Other option?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants