Skip to content

An extension of the LLM-based spatial layout generation from image description from https://github.com/Attention-Refocusing/attention-refocusing using OpenAI and Ollama structured outputs

Notifications You must be signed in to change notification settings

simoncwang/LLMSpatialLayout

Repository files navigation

LLM Spatial Layout

Using LLM structured output capabilities to generate reliable spatial layouts from image descriptions.

An extension of the GPT4 based box layout generation from the amazing Grounded Text-to-Image Synthesis with Attention Refocusing paper.

Approach & Motivation

In the original Attention Refocusing paper, layouts in the form of bounding boxes were generated by prompting GPT4. This was implemented through using in-context examples to ask the model to generate box coordinates for each object in a description of an image.

Furthermore, as shown in the results of the paper, while GPT4 was able to generate valid formats quite consistently, it was still not 100% (98.5% as mentioned in the LLM evaluation section).

To mitigate this issue, I re-implemented the layout generation scripts using both OpenAI's structured outputs beta as well as Ollama structured outputs to enable use of open-source models.

Improvements

  • Almost always ensures consistent output format to allow for reliable layout generation
  • Simplifies prompting and code in general, reduces the need for extensive in-context examples to enforce output structure
  • Use of open-source models makes the attention refocusing method more accessible (free and doesn't require API subscription) to allow more users to experiment locally

Running the code

Setup

Create a conda environment:

conda create -n "llm-layout" python=3.13

Install the required packages

pip install -r requirements.txt

If using Ollama, first check that your Ollama version is >= 0.5.1 because structured outputs are only available in newer versions

ollama --version

If your version is older, try to upgrade your version as specified here: Ollama docs. I had to manually uninstall and re-install a new version (0.5.4) from the Ollama website.

Then, be sure to pull the models you want to use first before running the scripts.

ollama pull [model name]

I used the following for my short experiment:

  • llama2:13b
  • llama3.1:8b
  • qwen2.5:7b

For OpenAI, first set your OpenAI API key:

    export OPENAI_API_KEY="your_api_key_here"

for Windows:

    setx OPENAI_API_KEY "your_api_key_here"

Generating layouts

Commands

Both scripts (OpenAI and Ollama) run the same, simply specify the model as a command line argument. Example here shown for qwen2.5:7b through ollama.

    python3 generate_layout_ollama.py --model qwen2.5:7b

When using the OpenAI script please only use the following models as specified in the OpenAI API Docs

  • o1-2024-12-17 and later
  • gpt-4o-mini-2024-07-18 and later
  • gpt-4o-2024-08-06 and later

Usage

Once the script is run, you will first be prompted for a description of your desired image. Example: "Three colorful parrots perching on cherry blossom tree branch"

Then, enter a name for the file you want the result to be saved as (you do not need to specify an extension, just the name). Example: "threeparrots"

Finally, the raw structured output will be printed out in the terminal for reference, and the image with drawn/labeled bounding boxes will be saved in the ./outputs folder.

Results

Here are several example results using different models and prompts. Some of the prompts are taken from the paper results and will be denoted with an asterisk *. All results here are generated with gpt-4o, but results from other models can be found already included in the outputs folder!

Several of the examples I created my self test if the models understand object relationships and interactions. As mentioned in my review of this paper, a potential drawback of the box-based attention refocusing I think could be investigated further is how it affects occlusions and object interactions.

*Prompt 1: "Three colorful parrots perching on cherry blossom tree branch"

Raw output:

    -----Image Description-----

    Three colorful parrots perching on cherry blossom tree branch
    
    -----Model Output-----
    
    objects=[Object(name='Parrot 1', x0=70, y0=100, x1=170, y1=200), Object(name='Parrot 2', x0=200, y0=100, x1=300, y1=200), Object(name='Parrot 3', x0=330, y0=100, x1=430, y1=200), Object(name='Cherry Blossom Branch', x0=50, y0=250, x1=462, y1=290)]

threeparrots

*Prompt 2: "a horse below a car."

horsecar

Prompt 3: "two rabbits enjoying a birthday cake on a hill at sunset"

Raw output:

    -----Image Description-----

    two rabbits enjoying a birthday cake on a hill at sunset
    
    -----Model Output-----
    
    objects=[Object(name='Rabbit 1', x0=50, y0=256, x1=150, y1=356), Object(name='Rabbit 2', x0=190, y0=256, x1=290, y1=356), Object(name='Birthday Cake', x0=110, y0=306, x1=230, y1=356), Object(name='Sunset', x0=0, y0=0, x1=512, y1=150), Object(name='Hill', x0=0, y0=206, x1=512, y1=512)]

rabbitcake

Prompt 4: “mother pushing a stroller with a baby inside”

momstroller

Evaluation

In section "4.5. Large language model evaluation", the authors evaluate how several LLMs perform at the layout generation task. They evaluated the following metrics: format (correctness of response format), valid (validity of bounding boxes), and correctness (output matches the text prompt). They find that GPT4 is the best performing with scores of 98.5%, 98.5%, and 88.5% on each metric respectively.

To replicate these results and test the improvements of using structured outputs in my scripts, I implement the evaluation script. I adhere as much as possible to the evaluation method described in the paper, randomly sampling 200 prompts from the four HRS categories that are provided in the paper repo. Then, following the metric descriptions I evaluate several models.

To run the evaluation yourself, use the following command:

    python3 evaluation.py --model model_name --type model_type
  • the model_type argument can be either "openai" or "ollama" depending on the model service used
  • again, for Ollama models you must first use ollama pull to install a model, then specify the model_name argument exactly as the name is written in Ollama (e.g. "llama2:13b)

Results

Below are the results of my benchmark evaluation, with the parentheses showing improvement over the paper results for Llama 2 13B.

Model Format (%) Valid (%) Runtime (s)
qwen2.5:7b 100 100 3197.89
gpt-4o 100 99 528.12
llama3.1:8b 100 88 2561.33
llama2:13b 100 (+1.5) 87 (+3) 4613.76

NOTE: the runtimes are included for reference, but the long times for Ollama models are likely due to the limitations of my personal laptop (M2 Mac Air, 24Gb) and will vary depending on your machine. Similarly, I was unable to test any models larger than the 8-13 billion parameter range. Running one model takes up the majority of my RAM as wired memory as seen here:

ollama_ram

Conclusions & Further Work

In conclusion, I demonstrate how leveraging new structured output capabilities of LLMs can improve upon the layout generation portion of the Attention Refocusing method. By evening the playing field in terms of format consistency in this way, we now have many more options to what kinds of models we can use, when previously only the best or largest models could achieve good performance at this task. More work and investigation can be done to further test the capability of this approach, and as mentioned earlier I have yet to implement actual image generation based on these layouts. By plugging int this layout generation tool to the diffusion models used in the paper, we can test if the smaller models can actually produce better results in terms of adherence to text prompts in addition to producing valid layouts.

About

An extension of the LLM-based spatial layout generation from image description from https://github.com/Attention-Refocusing/attention-refocusing using OpenAI and Ollama structured outputs

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages