updates.

ai4ce · Jun 25, 2024 · 9d471dd · 9d471dd
1 parent 3b6833c
commit 9d471dd
Show file tree

Hide file tree

Showing 3 changed files with 28 additions and 2 deletions.
diff --git a/index.html b/index.html
@@ -37,11 +37,15 @@
             flex-direction: column;
         }
         .stage {
+            margin-bottom: 20px;
+            max-width: 1000px;
             margin-top: 20px;
+            margin-left: auto;
+            margin-right: auto;
         }
         .stage h3 {
             text-align: left;
-            margin-left: 400px;
+            margin-left: 20px;
         }
         .result {
             text-align: center;
@@ -129,7 +133,7 @@ <h1 class="title is-2 publication-title">Tell Me Where You Are: Multimodal LLMs
 
                         <div class="is-size-5 publication-authors">
                             <span class="author-block">
-                                <a href="zonglinl.github.io">Zonglin Lyu</a>,</span>
+                                <a href="https://zonglinl.github.io">Zonglin Lyu</a>,</span>
 
                             <span class="author-block">
                                 <a href="https://juexzz.github.io">Juexiao Zhang</a>,</span>

diff --git a/misc/images/LLM-VPR.jpg b/misc/images/LLM-VPR.jpg
diff --git a/readme.md b/readme.md
@@ -1,6 +1,28 @@
 ## *Tell Me Where You Are*: Multimodal LLMs Meet Place Recognition
 [Zonglin Lyu](https://zonglinl.github.io/), [Juexiao Zhang](https://juexzz.github.io/), [Mingxuan Lu](https://scholar.google.com/citations?user=m4ChlREAAAAJ&hl=en), [Yiming Li](https://yimingli-page.github.io/), [Chen Feng](https://ai4ce.github.io/)
 
+![image](./misc/images/Teaser.jpg)
+
+### Abstract
+
+Large language models (LLMs) exhibit a variety of promising capabilities in robotics, 
+including long-horizon planning and commonsense reasoning. 
+However, their performance in place recognition is still underexplored. 
+In this work, we introduce multimodal LLMs (MLLMs) to visual place recognition (VPR), 
+where a robot must localize itself using visual observations. 
+Our key design is to use *vision-based retrieval* to propose several candidates and then leverage *language-based reasoning*
+to carefully inspect each candidate for a final decision. 
+Specifically, we leverage the robust visual features produced by off-the-shelf vision foundation models (VFMs) to obtain several candidate locations. 
+We then prompt an MLLM to describe the differences between the current observation and each candidate in a pairwise manner, 
+and reason about the best candidate based on these descriptions.  Our method is termed **LLM-VPR**.
+Results on three datasets demonstrate that integrating the *general-purpose visual features* from VFMs with the *reasoning capabilities* of MLLMs 
+already provides an effective place recognition solution, *without any VPR-specific supervised training*. 
+We believe LLM-VPR can inspire new possibilities for applying and designing foundation models, i.e. VFMs, LLMs, and MLLMs, 
+to enhance the localization and navigation of mobile robots.
+
+![image](./misc/images/LLM-VPR.jpg)
+
+**🔍  Please check out [project website](https://ai4ce.github.io/LLM4VPR/) for more details.**
 
 ### Datasets