-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generating bounding boxes with UDOP #87
Comments
We have some similar objectives. For example, question answering, the answer will be followed by its bounding box. So, this is possible indeed as long as the format follows "[text sequence] " |
Thank you for your answer @zinengtang! Btw, I'm using the (not yet merged) code from the HuggingFace PR that is porting UDOP into Transformers. Works like a charm, but there might be some differences with your code. |
@AleRosae can you share any snippets of your use of the PR? I got stuck on an early step. Thanks in advance. |
Hi @sromoam,
You can obtain input_ids, bboxes, attention_mask and pixel values using the UdopProcessor:
For finetuning, you can follow the Pix2Struct tutorial. Just be sure to also include words and bboxes in your dataloader, as Pix2Struct only takes images as input. |
can you please provide which libraries did you import for UdopForConditionalGeneration as i am getting error like this |
solved the issue we need to have transformers version = "4.39.0.dev0" |
@zinengtang I want to use the processor with my own OCR. What should be the format of the bouding boxes? 1. Normalized with heigh and width? 2. Normalized with height and width * 1000 3. Other option? |
Hi,
by reading the UDOP paper, my understanding is that during pre-training the model is taught to predict the layout of a target (textual) sequence using special layout tokens.
I was wondering if it is possible to exploit such capability also during finetuning e.g. to finetune the model using target sequences such as:
<key> Name <loc_100> <loc_200> <loc_150> <loc_250> </key> <value> Jane Doe <loc_110> <loc_210> <loc_160> <loc_260> </key>
Ideally, could this approach allow to have a correspondence between the generated text (e.g. the name) and its position within the page document?
The text was updated successfully, but these errors were encountered: