Generating bounding boxes with UDOP #87

AleRosae · 2023-07-17T12:40:47Z

Hi,

by reading the UDOP paper, my understanding is that during pre-training the model is taught to predict the layout of a target (textual) sequence using special layout tokens.
I was wondering if it is possible to exploit such capability also during finetuning e.g. to finetune the model using target sequences such as: <key> Name <loc_100> <loc_200> <loc_150> <loc_250> </key> <value> Jane Doe <loc_110> <loc_210> <loc_160> <loc_260> </key>

Ideally, could this approach allow to have a correspondence between the generated text (e.g. the name) and its position within the page document?

The text was updated successfully, but these errors were encountered:

zinengtang · 2023-07-17T23:43:11Z

We have some similar objectives. For example, question answering, the answer will be followed by its bounding box. So, this is possible indeed as long as the format follows "[text sequence] "

AleRosae · 2023-07-18T07:24:50Z

Thank you for your answer @zinengtang!
So if I'm not mistaking, to do so we should first normalize the original bounding boxes in range [0, 1000] on the basis of width and height of the original image; then normalize them between [0, 1]; and then convert them into layout tokens by multiplying them for the layout vocabulary size (500). Am I getting it right?

Btw, I'm using the (not yet merged) code from the HuggingFace PR that is porting UDOP into Transformers. Works like a charm, but there might be some differences with your code.

sromoam · 2023-07-18T16:16:41Z

@AleRosae can you share any snippets of your use of the PR? I got stuck on an early step.

Thanks in advance.

AleRosae · 2023-07-19T10:17:26Z

Hi @sromoam,
for inference you can use the standard generate() method:

model = UdopForConditionalGeneration.from_pretrained("udop_model")
outputs = model.generate(input_ids=input_ids,
                                      bbox=bbox,
                                      attention_mask=attention_mask,
                                      pixel_values=pixel_values,
                                   max_length=512,
                                   use_cache=False,
                                   num_beams=1,
                                   return_dict_in_generate=True)

You can obtain input_ids, bboxes, attention_mask and pixel values using the UdopProcessor:

processor = UdopProcessor.from_pretrained("udop_model", apply_ocr=True)
encoding = processor(images=image, return_tensors="pt").to(device)

For finetuning, you can follow the Pix2Struct tutorial. Just be sure to also include words and bboxes in your dataloader, as Pix2Struct only takes images as input.

jainamhdoshi · 2024-03-19T11:03:17Z

Hi @sromoam, for inference you can use the standard generate() method:
model = UdopForConditionalGeneration.from_pretrained("udop_model")
outputs = model.generate(input_ids=input_ids,
                                      bbox=bbox,
                                      attention_mask=attention_mask,
                                      pixel_values=pixel_values,
                                   max_length=512,
                                   use_cache=False,
                                   num_beams=1,
                                   return_dict_in_generate=True)
You can obtain input_ids, bboxes, attention_mask and pixel values using the UdopProcessor:
processor = UdopProcessor.from_pretrained("udop_model", apply_ocr=True)
encoding = processor(images=image, return_tensors="pt").to(device)
For finetuning, you can follow the Pix2Struct tutorial. Just be sure to also include words and bboxes in your dataloader, as Pix2Struct only takes images as input.

can you please provide which libraries did you import for UdopForConditionalGeneration as i am getting error like this
error : ImportError: cannot import name 'UdopForConditionalGeneration' from 'transformers'

jainamhdoshi · 2024-03-19T12:45:07Z

Hi @sromoam, for inference you can use the standard generate() method:
model = UdopForConditionalGeneration.from_pretrained("udop_model")
outputs = model.generate(input_ids=input_ids,
                                      bbox=bbox,
                                      attention_mask=attention_mask,
                                      pixel_values=pixel_values,
                                   max_length=512,
                                   use_cache=False,
                                   num_beams=1,
                                   return_dict_in_generate=True)
You can obtain input_ids, bboxes, attention_mask and pixel values using the UdopProcessor:
processor = UdopProcessor.from_pretrained("udop_model", apply_ocr=True)
encoding = processor(images=image, return_tensors="pt").to(device)
For finetuning, you can follow the Pix2Struct tutorial. Just be sure to also include words and bboxes in your dataloader, as Pix2Struct only takes images as input.
can you please provide which libraries did you import for UdopForConditionalGeneration as i am getting error like this error : ImportError: cannot import name 'UdopForConditionalGeneration' from 'transformers'

solved the issue we need to have transformers version = "4.39.0.dev0"
which can be cloned from here https://github.com/huggingface/transformers/blob/main/src/transformers/__init__.py
the commit on Mar 18, 2024

Joao-M-Silva · 2024-06-07T22:26:17Z

@zinengtang I want to use the processor with my own OCR. What should be the format of the bouding boxes? 1. Normalized with heigh and width? 2. Normalized with height and width * 1000 3. Other option?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating bounding boxes with UDOP #87

Generating bounding boxes with UDOP #87

AleRosae commented Jul 17, 2023

zinengtang commented Jul 17, 2023

AleRosae commented Jul 18, 2023

sromoam commented Jul 18, 2023

AleRosae commented Jul 19, 2023

jainamhdoshi commented Mar 19, 2024

jainamhdoshi commented Mar 19, 2024

Joao-M-Silva commented Jun 7, 2024 •

edited

Loading

Generating bounding boxes with UDOP #87

Generating bounding boxes with UDOP #87

Comments

AleRosae commented Jul 17, 2023

zinengtang commented Jul 17, 2023

AleRosae commented Jul 18, 2023

sromoam commented Jul 18, 2023

AleRosae commented Jul 19, 2023

jainamhdoshi commented Mar 19, 2024

jainamhdoshi commented Mar 19, 2024

Joao-M-Silva commented Jun 7, 2024 • edited Loading

Joao-M-Silva commented Jun 7, 2024 •

edited

Loading