index.html


<!DOCTYPE html>
<html>

<head lang="en">
    <meta charset="UTF-8">
    <meta http-equiv="x-ua-compatible" content="ie=edge">

    <title>Language-Conditioned Robotic Manipulation Fast and Slow Thinking</title>

    <meta name="description" content="">
    <meta name="viewport" content="width=device-width, initial-scale=1">

    <!-- <base href="/"> -->

        <!--FACEBOOK-->
    <meta name="og:image" content="https://innermonologue.github.io/img/teaser.png" />
    <meta property="og:image" content="https://innermonologue.github.io/img/teaser.png" />
    <meta property="og:image:type" content="image/png">
    <meta property="og:image:width" content="2000">
    <meta property="og:image:height" content="900">
    <meta property="og:type" content="website" />
    <meta property="og:url" content="https://inner-monologue.github.io/"/>
    <meta property="og:title" content="Inner Monologue: Embodied Reasoning through Planning with Language Models." />
    <meta property="og:description" content="Project page for Inner Monologue: Embodied Reasoning through Planning with Language Models" />

        <!--TWITTER-->
    <meta name="twitter:card" content="summary_large_image" />
    <meta name="twitter:title" content="Inner Monologue: Embodied Reasoning through Planning with Language Models." />
    <meta name="twitter:description" content="Project page for Project page for Inner Monologue: Embodied Reasoning through Planning with Language Models" />
    <meta name="twitter:image" content="https://innermonologue.github.io/img/teaser.png" />


<!--     <link rel="apple-touch-icon" href="apple-touch-icon.png"> -->
  <!-- <link rel="icon" type="image/png" href="img/seal_icon.png"> -->
    <!-- Place favicon.ico in the root directory -->

    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css">
    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.4.0/css/font-awesome.min.css">
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/codemirror/5.8.0/codemirror.min.css">
    <link rel="stylesheet" href="css/app.css">

    <link rel="stylesheet" href="css/bootstrap.min.css">

    <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
    <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/js/bootstrap.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/codemirror/5.8.0/codemirror.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/clipboard.js/1.5.3/clipboard.min.js"></script>
    
    <script src="js/app.js"></script>
</head>

<body>
    <div class="container" id="main">
        <div class="row">
            <h2 class="col-md-12 text-center">
                <b>Language-Conditioned Robotic Manipulation </br> with Fast and Slow Thinking</br> 
                <!--<small>
                    
                </small>-->
            </h2>
        </div>
        <div class="row">
            <div class="col-md-12 text-center">
               <ul class="list-inline">
                <br>

    <li>Minjie Zhu<sup>1,*</li> <li>Yichen Zhu<sup>2*</li> <li>Jinming Li<sup>3</li> <li>Junjie Wen<sup>1</li> <li>Zhiyuan Xu<sup>2</li> <li>Zhengping Che<sup>2</li> 
        <br><br><li>Chaomin Shen<sup>1</li> <li>Yaxin Peng<sup>3</li> <li>Dong Liu<sup>2</li> <li>Feifei Feng<sup>2</li> <li>Jian Tang<sup>2</li>
                <br><br>
                    <!-- <a href="http://g.co/robotics"> -->
                    <!-- <img src="img/robotics-at-google.png" height="40px"> Robotics at Google</a> <br> -->
                    <h5><sup>1</sup> School of Computer Science, East China Normal University, China</h5>
                    <h5><sup>2</sup> Midea Group, China</h5>
                    <h5><sup>3</sup> Department of Mathematics, School of Science, Shanghai University, China</h5>
                    <h5> * Equal contribution. This work was done during Minjie Zhu, Jinming Li, and Junjie Wen’s internship at Midea Group.</h5>
                </ul>
            </div>
        </div>

        
        <div class="row">
            <div class="col-md-4 col-md-offset-4 text-center">
                <ul class="nav nav-pills nav-justified">
                    <li>
                        <a href="https://arxiv.org/abs/2401.04181">
                            <image src="./img/paper.png" height="60px">
                                <h4><strong>Paper</strong></h4>
                            </a>
                        </li>
                        <li>
                            <a href="https://youtu.be/qe-z8Llmdt8">
                                <image src="img/youtube_icon.png" height="60px">
                                    <h4><strong>Video</strong></h4>
                                </a>
                            </li>
                                </ul>
                                
                            </div>
                        </div>

        <div class="row">
            <div class="col-md-8 col-md-offset-2">
                                
                <h3>
                    Abstract
                </h3>
                <p class="text-justify">
                    The language-conditioned robotic manipulation aims to transfer natural language instructions into executable actions, from simple pick-and-place to tasks requiring intent recognition and visual reasoning. Inspired by the dual-process theory in cognitive science—which suggests two parallel systems of fast and slow thinking in human decision-making—we introduce Robotics with Fast and Slow Thinking (RFST), a framework that mimics human cognitive architecture to classify tasks and makes decisions on two systems based on instruction types. Our RFST consists of two key components: 1) an instruction discriminator to determine which system should be activated based on the current user's instruction, and 2) a slow-thinking system that is comprised of a fine-tuned vision-language model aligned with the policy networks, which allow the robot to recognize user's intention or perform reasoning tasks. 
                    To assess our methodology, we built a dataset featuring real-world trajectories, capturing actions ranging from spontaneous impulses to tasks requiring deliberate contemplation. Our results, both in simulation and real-world scenarios, confirm that our approach adeptly manages intricate tasks that demand intent recognition and reasoning.
                </p>

                <div class="text-center">
                    <video id="v0" width="90%" playsinline loop controls autoplay muted>
                        <source src="img/rfst_demo_v3.mp4" type="video/mp4">
                        </video>
                </div>
            
            </div>
        </div>

        <div class="row">
            <div class="col-md-8 col-md-offset-2">
            	<br>
                <h3>
                    Framework
                </h3>
                <p class="text-justify">
                    
                    Dual-process model research indicates that individuals engage with decisions in two primary ways: a rapid, instinctive, subconscious manner (referred to as “System 1 or Fast-thinking”) 
                    and a measured, deliberate, conscious manner (“System 2 or Slow-thinking”). Based on this theory, we propose a framework that mimics human cognitive architecture to classify tasks and makes decisions on two systems based on instruction types.
                    
                    Upon receiving an instruction, the robot processes it through DistilRoBERTa to obtain an embedding. 
                    Leveraging embedding similarity search, we classified the instruction into either a fast-thinking system or a slow-thinking system.
                    The framework is shown in below.
                <div class="text-center">
                    <image src="img/fst.png" width="80%">
                </div>


            </div>
        </div>


        <div class="row">
            <div class="col-md-8 col-md-offset-2">
                <h3>
                    Experiments
                </h3>
		<p class="text-justify">
            We empirically assess the broad applicability of RFST across diverse tasks in both simulated and real-world settings. 
		</p>

        <h4>
            Experiments on simulator
        </h4>
         <p class="text-justify">
            Success rates on VIMA-Bench over six tasks. The Tasks 1 and 2 belong to fast-thinking system, and Task 3-6 belong to slow-thinking system. 
            Our proposed RFST significantly outperforms other methods in accomplishing slow-thinking tasks, achieving notably higher success rates.
        </p>

        <div class="text-center">
                    <image src="img/vima_bench_exp.png" width="80%">
            </div>

        <h4>
            Experiments on real world
        </h4>
        <p class="text-justify">
            The experiments on the real robot. Orange Bars: Slow-thinking tasks. Blue Bars: Fast-thinking tasks. 
            RFST empowers real robots to execute complex tasks such as mathematical reasoning and intent recognition, 
            which were traditionally beyond the scope of conventional robotic manipulation techniques.
        </p>

        <div class="text-center">
                <image src="img/real_exp.png" width="80%">
            </div>

	    </div>
        </div>
            
     <div class="col-md-8 col-md-offset-2">
                <h3>
                    Citation
                </h3>
                <div class="form-group col-md-10 col-md-offset-1">
                    <textarea id="bibtex" class="form-control" readonly="" style="display: none;">@article{zhu2024language,
title={Language-Conditioned Robotic Manipulation with Fast and Slow Thinking},
author={Zhu, Minjie and Zhu, Yichen and Li, Jinming and Wen, Junjie and Xu, Zhiyuan and Che, Zhengping and Shen, Chaomin and Peng, Yaxin and Liu, Dong and Feng, Feifei and others},
journal={arXiv preprint arXiv:2401.04181},
year={2024}
}</textarea>
            </div>
     </div>

</body>
</html>