Skip to content

Commit

Permalink
updates.
Browse files Browse the repository at this point in the history
  • Loading branch information
juexZZ committed Jun 25, 2024
1 parent 3b6833c commit 9d471dd
Show file tree
Hide file tree
Showing 3 changed files with 28 additions and 2 deletions.
8 changes: 6 additions & 2 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -37,11 +37,15 @@
flex-direction: column;
}
.stage {
margin-bottom: 20px;
max-width: 1000px;
margin-top: 20px;
margin-left: auto;
margin-right: auto;
}
.stage h3 {
text-align: left;
margin-left: 400px;
margin-left: 20px;
}
.result {
text-align: center;
Expand Down Expand Up @@ -129,7 +133,7 @@ <h1 class="title is-2 publication-title">Tell Me Where You Are: Multimodal LLMs

<div class="is-size-5 publication-authors">
<span class="author-block">
<a href="zonglinl.github.io">Zonglin Lyu</a>,</span>
<a href="https://zonglinl.github.io">Zonglin Lyu</a>,</span>

<span class="author-block">
<a href="https://juexzz.github.io">Juexiao Zhang</a>,</span>
Expand Down
Binary file added misc/images/LLM-VPR.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
22 changes: 22 additions & 0 deletions readme.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,28 @@
## *Tell Me Where You Are*: Multimodal LLMs Meet Place Recognition
[Zonglin Lyu](https://zonglinl.github.io/), [Juexiao Zhang](https://juexzz.github.io/), [Mingxuan Lu](https://scholar.google.com/citations?user=m4ChlREAAAAJ&hl=en), [Yiming Li](https://yimingli-page.github.io/), [Chen Feng](https://ai4ce.github.io/)

![image](./misc/images/Teaser.jpg)

### Abstract

Large language models (LLMs) exhibit a variety of promising capabilities in robotics,
including long-horizon planning and commonsense reasoning.
However, their performance in place recognition is still underexplored.
In this work, we introduce multimodal LLMs (MLLMs) to visual place recognition (VPR),
where a robot must localize itself using visual observations.
Our key design is to use *vision-based retrieval* to propose several candidates and then leverage *language-based reasoning*
to carefully inspect each candidate for a final decision.
Specifically, we leverage the robust visual features produced by off-the-shelf vision foundation models (VFMs) to obtain several candidate locations.
We then prompt an MLLM to describe the differences between the current observation and each candidate in a pairwise manner,
and reason about the best candidate based on these descriptions. Our method is termed **LLM-VPR**.
Results on three datasets demonstrate that integrating the *general-purpose visual features* from VFMs with the *reasoning capabilities* of MLLMs
already provides an effective place recognition solution, *without any VPR-specific supervised training*.
We believe LLM-VPR can inspire new possibilities for applying and designing foundation models, i.e. VFMs, LLMs, and MLLMs,
to enhance the localization and navigation of mobile robots.

![image](./misc/images/LLM-VPR.jpg)

**🔍 Please check out [project website](https://ai4ce.github.io/LLM4VPR/) for more details.**

### Datasets

Expand Down

0 comments on commit 9d471dd

Please sign in to comment.