diff --git a/arxiv_trends_year.png b/arxiv_trends_year.png
index c98ae16..53e1b57 100644
Binary files a/arxiv_trends_year.png and b/arxiv_trends_year.png differ
diff --git a/arxiv_visual_reasoning.jsonl b/arxiv_visual_reasoning.jsonl
index a2fc6e9..c697b29 100644
--- a/arxiv_visual_reasoning.jsonl
+++ b/arxiv_visual_reasoning.jsonl
@@ -1,3 +1,29 @@
+{"entry_id": "2407.21438", "title": "A Plug-and-Play Method for Rare Human-Object Interactions Detection by Bridging Domain Gap", "authors": ["Lijun Zhang", "Wei Suo", "Peng Wang", "Yanning Zhang"], "published": "2024-07-31 08:42:48", "updated": "2024-07-31 08:42:48", "summary": "Human-object interactions (HOI) detection aims at capturing human-object\npairs in images and corresponding actions. It is an important step toward\nhigh-level visual reasoning and scene understanding. However, due to the\nnatural bias from the real world, existing methods mostly struggle with rare\nhuman-object pairs and lead to sub-optimal results. Recently, with the\ndevelopment of the generative model, a straightforward approach is to construct\na more balanced dataset based on a group of supplementary samples.\nUnfortunately, there is a significant domain gap between the generated data and\nthe original data, and simply merging the generated images into the original\ndataset cannot significantly boost the performance. To alleviate the above\nproblem, we present a novel model-agnostic framework called\n\\textbf{C}ontext-\\textbf{E}nhanced \\textbf{F}eature \\textbf{A}lignment (CEFA)\nmodule, which can effectively align the generated data with the original data\nat the feature level and bridge the domain gap. Specifically, CEFA consists of\na feature alignment module and a context enhancement module. On one hand,\nconsidering the crucial role of human-object pairs information in HOI tasks,\nthe feature alignment module aligns the human-object pairs by aggregating\ninstance information. On the other hand, to mitigate the issue of losing\nimportant context information caused by the traditional discriminator-style\nalignment method, we employ a context-enhanced image reconstruction module to\nimprove the model's learning ability of contextual cues. Extensive experiments\nhave shown that our method can serve as a plug-and-play module to improve the\ndetection performance of HOI models on rare\ncategories\\footnote{https://github.com/LijunZhang01/CEFA}.", "comment": null, "links": []}
+{"entry_id": "2407.21333", "title": "Chat2Layout: Interactive 3D Furniture Layout with a Multimodal LLM", "authors": ["Can Wang", "Hongliang Zhong", "Menglei Chai", "Mingming He", "Dongdong Chen", "Jing Liao"], "published": "2024-07-31 04:49:46", "updated": "2024-07-31 04:49:46", "summary": "Automatic furniture layout is long desired for convenient interior design.\nLeveraging the remarkable visual reasoning capabilities of multimodal large\nlanguage models (MLLMs), recent methods address layout generation in a static\nmanner, lacking the feedback-driven refinement essential for interactive user\nengagement. We introduce Chat2Layout, a novel interactive furniture layout\ngeneration system that extends the functionality of MLLMs into the realm of\ninteractive layout design. To achieve this, we establish a unified\nvision-question paradigm for in-context learning, enabling seamless\ncommunication with MLLMs to steer their behavior without altering model\nweights. Within this framework, we present a novel training-free visual\nprompting mechanism. This involves a visual-text prompting technique that\nassist MLLMs in reasoning about plausible layout plans, followed by an\nOffline-to-Online search (O2O-Search) method, which automatically identifies\nthe minimal set of informative references to provide exemplars for visual-text\nprompting. By employing an agent system with MLLMs as the core controller, we\nenable bidirectional interaction. The agent not only comprehends the 3D\nenvironment and user requirements through linguistic and visual perception but\nalso plans tasks and reasons about actions to generate and arrange furniture\nwithin the virtual space. Furthermore, the agent iteratively updates based on\nvisual feedback from execution results. Experimental results demonstrate that\nour approach facilitates language-interactive generation and arrangement for\ndiverse and complex 3D furniture.", "comment": "Main paper with supplemental materials", "links": []}
+{"entry_id": "2407.20563", "title": "Pyramid Coder: Hierarchical Code Generator for Compositional Visual Question Answering", "authors": ["Ruoyue Shen", "Nakamasa Inoue", "Koichi Shinoda"], "published": "2024-07-30 05:36:43", "updated": "2024-07-30 05:36:43", "summary": "Visual question answering (VQA) is the task of providing accurate answers to\nnatural language questions based on visual input. Programmatic VQA (PVQA)\nmodels have been gaining attention recently. These use large language models\n(LLMs) to formulate executable programs that address questions requiring\ncomplex visual reasoning. However, there are challenges in enabling LLMs to\ncomprehend the usage of image processing modules and generate relevant code. To\novercome these challenges, this paper introduces PyramidCoder, a novel\nprompting framework for PVQA models. PyramidCoder consists of three\nhierarchical levels, each serving a distinct purpose: query rephrasing, code\ngeneration, and answer aggregation. Notably, PyramidCoder utilizes a single\nfrozen LLM and pre-defined prompts at each level, eliminating the need for\nadditional training and ensuring flexibility across various LLM architectures.\nCompared to the state-of-the-art PVQA model, our approach improves accuracy by\nat least 0.5% on the GQA dataset, 1.4% on the VQAv2 dataset, and 2.9% on the\nNLVR2 dataset.", "comment": "Accepted to the IEEE International Conference on Image Processing\n (IEEE ICIP) 2024", "links": []}
+{"entry_id": "2407.19666", "title": "Take A Step Back: Rethinking the Two Stages in Visual Reasoning", "authors": ["Mingyu Zhang", "Jiting Cai", "Mingyu Liu", "Yue Xu", "Cewu Lu", "Yong-Lu Li"], "published": "2024-07-29 02:56:19", "updated": "2024-07-29 02:56:19", "summary": "Visual reasoning, as a prominent research area, plays a crucial role in AI by\nfacilitating concept formation and interaction with the world. However, current\nworks are usually carried out separately on small datasets thus lacking\ngeneralization ability. Through rigorous evaluation of diverse benchmarks, we\ndemonstrate the shortcomings of existing ad-hoc methods in achieving\ncross-domain reasoning and their tendency to data bias fitting. In this paper,\nwe revisit visual reasoning with a two-stage perspective: (1) symbolization and\n(2) logical reasoning given symbols or their representations. We find that the\nreasoning stage is better at generalization than symbolization. Thus, it is\nmore efficient to implement symbolization via separated encoders for different\ndata domains while using a shared reasoner. Given our findings, we establish\ndesign principles for visual reasoning frameworks following the separated\nsymbolization and shared reasoning. The proposed two-stage framework achieves\nimpressive generalization ability on various visual reasoning tasks, including\npuzzles, physical prediction, and visual question answering (VQA), encompassing\nboth 2D and 3D modalities. We believe our insights will pave the way for\ngeneralizable visual reasoning.", "comment": "ECCV 2024, Project page:\n https://mybearyzhang.github.io/projects/TwoStageReason/", "links": []}
+{"entry_id": "2407.19094", "title": "Solving Robotics Problems in Zero-Shot with Vision-Language Models", "authors": ["Zidan Wang", "Rui Shen", "Bradly Stadie"], "published": "2024-07-26 21:18:57", "updated": "2024-07-26 21:18:57", "summary": "We introduce Wonderful Team, a multi-agent visual LLM (VLLM) framework for\nsolving robotics problems in the zero-shot regime. By zero-shot we mean that,\nfor a novel environment, we feed a VLLM an image of the robot's environment and\na description of the task, and have the VLLM output the sequence of actions\nnecessary for the robot to complete the task. Prior work on VLLMs in robotics\nhas largely focused on settings where some part of the pipeline is fine-tuned,\nsuch as tuning an LLM on robot data or training a separate vision encoder for\nperception and action generation. Surprisingly, due to recent advances in the\ncapabilities of VLLMs, this type of fine-tuning may no longer be necessary for\nmany tasks. In this work, we show that with careful engineering, we can prompt\na single off-the-shelf VLLM to handle all aspects of a robotics task, from\nhigh-level planning to low-level location-extraction and action-execution.\nWonderful Team builds on recent advances in multi-agent LLMs to partition tasks\nacross an agent hierarchy, making it self-corrective and able to effectively\npartition and solve even long-horizon tasks. Extensive experiments on VIMABench\nand real-world robotic environments demonstrate the system's capability to\nhandle a variety of robotic tasks, including manipulation, visual\ngoal-reaching, and visual reasoning, all in a zero-shot manner. These results\nunderscore a key point: vision-language models have progressed rapidly in the\npast year, and should strongly be considered as a backbone for robotics\nproblems going forward.", "comment": "aka Wonderful Team", "links": []}
+{"entry_id": "2407.17791", "title": "Investigating learning-independent abstract reasoning in artificial neural networks", "authors": ["Tomer Barak", "Yonatan Loewenstein"], "published": "2024-07-25 05:58:58", "updated": "2024-07-25 05:58:58", "summary": "Humans are capable of solving complex abstract reasoning tests. Whether this\nability reflects a learning-independent inference mechanism applicable to any\nnovel unlearned problem or whether it is a manifestation of extensive training\nthroughout life is an open question. Addressing this question in humans is\nchallenging because it is impossible to control their prior training. However,\nassuming a similarity between the cognitive processing of Artificial Neural\nNetworks (ANNs) and humans, the extent to which training is required for ANNs'\nabstract reasoning is informative about this question in humans. Previous\nstudies demonstrated that ANNs can solve abstract reasoning tests. However,\nthis success required extensive training. In this study, we examined the\nlearning-independent abstract reasoning of ANNs. Specifically, we evaluated\ntheir performance without any pretraining, with the ANNs' weights being\nrandomly-initialized, and only change in the process of problem solving. We\nfound that naive ANN models can solve non-trivial visual reasoning tests,\nsimilar to those used to evaluate human learning-independent reasoning. We\nfurther studied the mechanisms that support this ability. Our results suggest\nthe possibility of learning-independent abstract reasoning that does not\nrequire extensive training.", "comment": null, "links": []}
+{"entry_id": "2407.17773", "title": "KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models", "authors": ["Eunice Yiu", "Maan Qraitem", "Charlie Wong", "Anisa Noor Majhi", "Yutong Bai", "Shiry Ginosar", "Alison Gopnik", "Kate Saenko"], "published": "2024-07-25 05:02:39", "updated": "2024-07-25 05:02:39", "summary": "This paper investigates visual analogical reasoning in large multimodal\nmodels (LMMs) compared to human adults and children. A \"visual analogy\" is an\nabstract rule inferred from one image and applied to another. While benchmarks\nexist for testing visual reasoning in LMMs, they require advanced skills and\nomit basic visual analogies that even young children can make. Inspired by\ndevelopmental psychology, we propose a new benchmark of 1,400 visual\ntransformations of everyday objects to test LMMs on visual analogical reasoning\nand compare them to children and adults. We structure the evaluation into three\nstages: identifying what changed (e.g., color, number, etc.), how it changed\n(e.g., added one object), and applying the rule to new scenarios. Our findings\nshow that while models like GPT-4V, LLaVA-1.5, and MANTIS identify the \"what\"\neffectively, they struggle with quantifying the \"how\" and extrapolating this\nrule to new objects. In contrast, children and adults exhibit much stronger\nanalogical reasoning at all three stages. Additionally, the strongest tested\nmodel, GPT-4V, performs better in tasks involving simple visual attributes like\ncolor and size, correlating with quicker human adult response times.\nConversely, more complex tasks such as number, rotation, and reflection, which\nnecessitate extensive cognitive processing and understanding of the 3D physical\nworld, present more significant challenges. Altogether, these findings\nhighlight the limitations of training models on data that primarily consists of\n2D images and text.", "comment": "9 pages. For the KiVA benchmark, see https://github.com/ey242/KiVA", "links": []}
+{"entry_id": "2407.07053", "title": "Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model", "authors": ["Wenqi Zhang", "Zhenglin Cheng", "Yuanyu He", "Mengna Wang", "Yongliang Shen", "Zeqi Tan", "Guiyang Hou", "Mingqian He", "Yanna Ma", "Weiming Lu", "Yueting Zhuang"], "published": "2024-07-09 17:18:27", "updated": "2024-07-23 17:12:12", "summary": "Although most current large multimodal models (LMMs) can already understand\nphotos of natural scenes and portraits, their understanding of abstract images,\ne.g., charts, maps, or layouts, and visual reasoning capabilities remains quite\nrudimentary. They often struggle with simple daily tasks, such as reading time\nfrom a clock, understanding a flowchart, or planning a route using a road map.\nIn light of this, we design a multi-modal self-instruct, utilizing large\nlanguage models and their code capabilities to synthesize massive abstract\nimages and visual reasoning instructions across daily scenarios. Our strategy\neffortlessly creates a multimodal benchmark with 11,193 instructions for eight\nvisual scenarios: charts, tables, simulated maps, dashboards, flowcharts,\nrelation graphs, floor plans, and visual puzzles. \\textbf{This benchmark,\nconstructed with simple lines and geometric elements, exposes the shortcomings\nof most advanced LMMs} like Claude-3.5-Sonnet and GPT-4o in abstract image\nunderstanding, spatial relations reasoning, and visual element induction.\nBesides, to verify the quality of our synthetic data, we fine-tune an LMM using\n62,476 synthetic chart, table and road map instructions. The results\ndemonstrate improved chart understanding and map navigation performance, and\nalso demonstrate potential benefits for other visual reasoning tasks. Our code\nis available at: \\url{https://github.com/zwq2018/Multi-modal-Self-instruct}.", "comment": "code: https://github.com/zwq2018/Multi-modal-Self-instruct dataset:\n https://huggingface.co/datasets/zwq2018/Multi-modal-Self-instruct\n Leaderboard: https://multi-modal-self-instruct.github.io/", "links": []}
+{"entry_id": "2403.16921", "title": "PropTest: Automatic Property Testing for Improved Visual Programming", "authors": ["Jaywon Koo", "Ziyan Yang", "Paola Cascante-Bonilla", "Baishakhi Ray", "Vicente Ordonez"], "published": "2024-03-25 16:39:15", "updated": "2024-07-22 23:21:33", "summary": "Visual Programming has recently emerged as an alternative to end-to-end\nblack-box visual reasoning models. This type of method leverages Large Language\nModels (LLMs) to generate the source code for an executable computer program\nthat solves a given problem. This strategy has the advantage of offering an\ninterpretable reasoning path and does not require finetuning a model with\ntask-specific data. We propose PropTest, a general strategy that improves\nvisual programming by further using an LLM to generate code that tests for\nvisual properties in an initial round of proposed solutions. Our method\ngenerates tests for data-type consistency, output syntax, and semantic\nproperties. PropTest achieves comparable results to state-of-the-art methods\nwhile using publicly available LLMs. This is demonstrated across different\nbenchmarks on visual question answering and referring expression comprehension.\nParticularly, PropTest improves ViperGPT by obtaining 46.1\\% accuracy (+6.0\\%)\non GQA using Llama3-8B and 59.5\\% (+8.1\\%) on RefCOCO+ using CodeLlama-34B.", "comment": "Project Page: https://jaywonkoo17.github.io/PropTest/", "links": []}
+{"entry_id": "2407.02392", "title": "TokenPacker: Efficient Visual Projector for Multimodal LLM", "authors": ["Wentong Li", "Yuqian Yuan", "Jian Liu", "Dongqi Tang", "Song Wang", "Jianke Zhu", "Lei Zhang"], "published": "2024-07-02 16:10:55", "updated": "2024-07-22 12:55:46", "summary": "The visual projector serves as an essential bridge between the visual encoder\nand the Large Language Model (LLM) in a Multimodal LLM (MLLM). Typically, MLLMs\nadopt a simple MLP to preserve all visual contexts via one-to-one\ntransformation. However, the visual tokens are redundant and can be\nconsiderably increased when dealing with high-resolution images, impairing the\nefficiency of MLLMs significantly. Some recent works have introduced resampler\nor abstractor to reduce the number of resulting visual tokens. Unfortunately,\nthey fail to capture finer details and undermine the visual reasoning\ncapabilities of MLLMs. In this work, we propose a novel visual projector, which\nadopts a coarse-to-fine scheme to inject the enriched characteristics to\ngenerate the condensed visual tokens. In specific, we first interpolate the\nvisual features as a low-resolution point query, providing the overall visual\nrepresentation as the foundation. Then, we introduce a region-to-point\ninjection module that utilizes high-resolution, multi-level region-based cues\nas fine-grained reference keys and values, allowing them to be fully absorbed\nwithin the corresponding local context region. This step effectively updates\nthe coarse point query, transforming it into an enriched one for the subsequent\nLLM reasoning. Extensive experiments demonstrate that our approach compresses\nthe visual tokens by 75%~89%, while achieves comparable or even better\nperformance across diverse benchmarks with significantly higher efficiency. The\nsource codes can be found at https://github.com/CircleRadon/TokenPacker.", "comment": "16 pages, Codes:https://github.com/CircleRadon/TokenPacker", "links": []}
+{"entry_id": "2403.12884", "title": "HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning", "authors": ["Fucai Ke", "Zhixi Cai", "Simindokht Jahangard", "Weiqing Wang", "Pari Delir Haghighi", "Hamid Rezatofighi"], "published": "2024-03-19 16:31:30", "updated": "2024-07-21 08:48:55", "summary": "Recent advances in visual reasoning (VR), particularly with the aid of Large\nVision-Language Models (VLMs), show promise but require access to large-scale\ndatasets and face challenges such as high computational costs and limited\ngeneralization capabilities. Compositional visual reasoning approaches have\nemerged as effective strategies; however, they heavily rely on the commonsense\nknowledge encoded in Large Language Models (LLMs) to perform planning,\nreasoning, or both, without considering the effect of their decisions on the\nvisual reasoning process, which can lead to errors or failed procedures. To\naddress these challenges, we introduce HYDRA, a multi-stage dynamic\ncompositional visual reasoning framework designed for reliable and\nincrementally progressive general reasoning. HYDRA integrates three essential\nmodules: a planner, a Reinforcement Learning (RL) agent serving as a cognitive\ncontroller, and a reasoner. The planner and reasoner modules utilize an LLM to\ngenerate instruction samples and executable code from the selected instruction,\nrespectively, while the RL agent dynamically interacts with these modules,\nmaking high-level decisions on selection of the best instruction sample given\ninformation from the historical state stored through a feedback loop. This\nadaptable design enables HYDRA to adjust its actions based on previous feedback\nreceived during the reasoning process, leading to more reliable reasoning\noutputs and ultimately enhancing its overall effectiveness. Our framework\ndemonstrates state-of-the-art performance in various VR tasks on four different\nwidely-used datasets.", "comment": "Accepted by ECCV2024. Project page: https://hydra-vl4ai.github.io/", "links": []}
+{"entry_id": "2407.14834", "title": "Can VLMs be used on videos for action recognition? LLMs are Visual Reasoning Coordinators", "authors": ["Harsh Lunia"], "published": "2024-07-20 10:26:28", "updated": "2024-07-20 10:26:28", "summary": "Recent advancements have introduced multiple vision-language models (VLMs)\ndemonstrating impressive commonsense reasoning across various domains. Despite\ntheir individual capabilities, the potential of synergizing these complementary\nVLMs remains underexplored. The Cola Framework addresses this by showcasing how\na large language model (LLM) can efficiently coordinate multiple VLMs through\nnatural language communication, leveraging their distinct strengths. We have\nverified this claim on the challenging A-OKVQA dataset, confirming the\neffectiveness of such coordination. Building on this, our study investigates\nwhether the same methodology can be applied to surveillance videos for action\nrecognition. Specifically, we explore if leveraging the combined knowledge base\nof VLMs and LLM can effectively deduce actions from a video when presented with\nonly a few selectively important frames and minimal temporal information. Our\nexperiments demonstrate that LLM, when coordinating different VLMs, can\nsuccessfully recognize patterns and deduce actions in various scenarios despite\nthe weak temporal signals. However, our findings suggest that to enhance this\napproach as a viable alternative solution, integrating a stronger temporal\nsignal and exposing the models to slightly more frames would be beneficial.", "comment": "LLMs, VLMs, Action Recognition", "links": []}
+{"entry_id": "2407.14133", "title": "I Know About \"Up\"! Enhancing Spatial Reasoning in Visual Language Models Through 3D Reconstruction", "authors": ["Zaiqiao Meng", "Hao Zhou", "Yifang Chen"], "published": "2024-07-19 09:03:30", "updated": "2024-07-19 09:03:30", "summary": "Visual Language Models (VLMs) are essential for various tasks, particularly\nvisual reasoning tasks, due to their robust multi-modal information\nintegration, visual reasoning capabilities, and contextual awareness. However,\nexisting \\VLMs{}' visual spatial reasoning capabilities are often inadequate,\nstruggling even with basic tasks such as distinguishing left from right. To\naddress this, we propose the \\ours{} model, designed to enhance the visual\nspatial reasoning abilities of VLMS. ZeroVLM employs Zero-1-to-3, a 3D\nreconstruction model for obtaining different views of the input images and\nincorporates a prompting mechanism to further improve visual spatial reasoning.\nExperimental results on four visual spatial reasoning datasets show that our\n\\ours{} achieves up to 19.48% accuracy improvement, which indicates the\neffectiveness of the 3D reconstruction and prompting mechanisms of our ZeroVLM.", "comment": null, "links": []}
+{"entry_id": "2303.10428", "title": "RCA: Region Conditioned Adaptation for Visual Abductive Reasoning", "authors": ["Hao Zhang", "Yeo Keat Ee", "Basura Fernando"], "published": "2023-03-18 14:46:44", "updated": "2024-07-19 04:52:07", "summary": "Visual abductive reasoning aims to make likely explanations for visual\nobservations. We propose a simple yet effective Region Conditioned Adaptation,\na hybrid parameter-efficient fine-tuning method that equips the frozen CLIP\nwith the ability to infer explanations from local visual cues. We encode\n``local hints'' and ``global contexts'' into visual prompts of the CLIP model\nseparately at fine and coarse-grained levels. Adapters are used for fine-tuning\nCLIP models for downstream tasks and we design a new attention adapter, that\ndirectly steers the focus of the attention map with trainable query and key\nprojections of a frozen CLIP model. Finally, we train our new model with a\nmodified contrastive loss to regress the visual feature simultaneously toward\nfeatures of literal description and plausible explanations. The loss enables\nCLIP to maintain both perception and reasoning abilities. Experiments on the\nSherlock visual abductive reasoning benchmark show that the RCA significantly\noutstands previous SOTAs, ranking the \\nth{1} on the leaderboards (e.g., Human\nAcc: RCA 31.74 \\textit{vs} CPT-CLIP 29.58, higher =better). We also validate\nthe RCA is generalizable to local perception benchmarks like RefCOCO. We\nopen-source our project at\n\\textit{\\color{magenta}{\\url{https://github.com/LUNAProject22/RPA}}}.", "comment": "13 pages, 11 figures, ACM Multimedia 2024", "links": []}
+{"entry_id": "2407.13851", "title": "X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs", "authors": ["Sirnam Swetha", "Jinyu Yang", "Tal Neiman", "Mamshad Nayeem Rizve", "Son Tran", "Benjamin Yao", "Trishul Chilimbi", "Mubarak Shah"], "published": "2024-07-18 18:39:54", "updated": "2024-07-18 18:39:54", "summary": "Recent advancements in Multimodal Large Language Models (MLLMs) have\nrevolutionized the field of vision-language understanding by integrating visual\nperception capabilities into Large Language Models (LLMs). The prevailing trend\nin this field involves the utilization of a vision encoder derived from\nvision-language contrastive learning (CL), showing expertise in capturing\noverall representations while facing difficulties in capturing detailed local\npatterns. In this work, we focus on enhancing the visual representations for\nMLLMs by combining high-frequency and detailed visual representations, obtained\nthrough masked image modeling (MIM), with semantically-enriched low-frequency\nrepresentations captured by CL. To achieve this goal, we introduce X-Former\nwhich is a lightweight transformer module designed to exploit the complementary\nstrengths of CL and MIM through an innovative interaction mechanism.\nSpecifically, X-Former first bootstraps vision-language representation learning\nand multimodal-to-multimodal generative learning from two frozen vision\nencoders, i.e., CLIP-ViT (CL-based) and MAE-ViT (MIM-based). It further\nbootstraps vision-to-language generative learning from a frozen LLM to ensure\nvisual features from X-Former can be interpreted by the LLM. To demonstrate the\neffectiveness of our approach, we assess its performance on tasks demanding\ndetailed visual understanding. Extensive evaluations indicate that X-Former\nexcels in visual reasoning tasks involving both structural and semantic\ncategories in the GQA dataset. Assessment on fine-grained visual perception\nbenchmark further confirms its superior capabilities in visual understanding.", "comment": "Accepted at ECCV2024", "links": []}
+{"entry_id": "2407.13382", "title": "Open-World Visual Reasoning by a Neuro-Symbolic Program of Zero-Shot Symbols", "authors": ["Gertjan Burghouts", "Fieke Hillerström", "Erwin Walraven", "Michael van Bekkum", "Frank Ruis", "Joris Sijs", "Jelle van Mil", "Judith Dijk"], "published": "2024-07-18 10:40:22", "updated": "2024-07-18 10:40:22", "summary": "We consider the problem of finding spatial configurations of multiple objects\nin images, e.g., a mobile inspection robot is tasked to localize abandoned\ntools on the floor. We define the spatial configuration of objects by\nfirst-order logic in terms of relations and attributes. A neuro-symbolic\nprogram matches the logic formulas to probabilistic object proposals for the\ngiven image, provided by language-vision models by querying them for the\nsymbols. This work is the first to combine neuro-symbolic programming\n(reasoning) and language-vision models (learning) to find spatial\nconfigurations of objects in images in an open world setting. We show the\neffectiveness by finding abandoned tools on floors and leaking pipes. We find\nthat most prediction errors are due to biases in the language-vision model.", "comment": "12 pages", "links": []}
+{"entry_id": "2401.13311", "title": "ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models", "authors": ["Rohan Wadhawan", "Hritik Bansal", "Kai-Wei Chang", "Nanyun Peng"], "published": "2024-01-24 09:07:11", "updated": "2024-07-16 03:36:29", "summary": "Many real-world tasks require an agent to reason jointly over text and visual\nobjects, (e.g., navigating in public spaces), which we refer to as\ncontext-sensitive text-rich visual reasoning. Specifically, these tasks require\nan understanding of the context in which the text interacts with visual\nelements within an image. However, there is a lack of existing datasets to\nbenchmark the state-of-the-art multimodal models' capability on\ncontext-sensitive text-rich visual reasoning. In this paper, we introduce\nConTextual, a novel dataset featuring human-crafted instructions that require\ncontext-sensitive reasoning for text-rich images. We conduct experiments to\nassess the performance of 14 foundation models (GPT-4V, Gemini-Pro-Vision,\nLLaVA-Next) and establish a human performance baseline. Further, we perform\nhuman evaluations of the model responses and observe a significant performance\ngap of 30.8% between GPT-4V (the current best-performing Large Multimodal\nModel) and human performance. Our fine-grained analysis reveals that GPT-4V\nencounters difficulties interpreting time-related data and infographics.\nHowever, it demonstrates proficiency in comprehending abstract visual contexts\nsuch as memes and quotes. Finally, our qualitative analysis uncovers various\nfactors contributing to poor performance including lack of precise visual\nperception and hallucinations. Our dataset, code, and leaderboard can be found\non the project page https://con-textual.github.io/", "comment": null, "links": []}
+{"entry_id": "2407.10380", "title": "NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models", "authors": ["Pranshu Pandya", "Agney S Talwarr", "Vatsal Gupta", "Tushar Kataria", "Vivek Gupta", "Dan Roth"], "published": "2024-07-15 01:21:56", "updated": "2024-07-15 01:21:56", "summary": "Cognitive textual and visual reasoning tasks, such as puzzles, series, and\nanalogies, demand the ability to quickly reason, decipher, and evaluate\npatterns both textually and spatially. While LLMs and VLMs, through extensive\ntraining on large amounts of human-curated data, have attained a high level of\npseudo-human intelligence in some common sense reasoning tasks, they still\nstruggle with more complex reasoning tasks that require cognitive\nunderstanding. In this work, we introduce a new dataset, NTSEBench, designed to\nevaluate the cognitive multi-modal reasoning and problem-solving skills of\nlarge models. The dataset comprises 2,728 multiple-choice questions comprising\nof a total of 4,642 images across 26 categories sampled from the NTSE\nexamination conducted nationwide in India, featuring both visual and textual\ngeneral aptitude questions that do not rely on rote learning. We establish\nbaselines on the dataset using state-of-the-art LLMs and VLMs. To facilitate a\ncomparison between open source and propriety models, we propose four distinct\nmodeling strategies to handle different modalities (text and images) in the\ndataset instances.", "comment": "15 pages, 2 figures, 5 tables", "links": []}
+{"entry_id": "2407.10341", "title": "Affordance-Guided Reinforcement Learning via Visual Prompting", "authors": ["Olivia Y. Lee", "Annie Xie", "Kuan Fang", "Karl Pertsch", "Chelsea Finn"], "published": "2024-07-14 21:41:29", "updated": "2024-07-14 21:41:29", "summary": "Robots equipped with reinforcement learning (RL) have the potential to learn\na wide range of skills solely from a reward signal. However, obtaining a robust\nand dense reward signal for general manipulation tasks remains a challenge.\nExisting learning-based approaches require significant data, such as\ndemonstrations or examples of success and failure, to learn task-specific\nreward functions. Recently, there is also a growing adoption of large\nmulti-modal foundation models for robotics. These models can perform visual\nreasoning in physical contexts and generate coarse robot motions for various\nmanipulation tasks. Motivated by this range of capability, in this work, we\npropose and study rewards shaped by vision-language models (VLMs).\nState-of-the-art VLMs have demonstrated an impressive ability to reason about\naffordances through keypoints in zero-shot, and we leverage this to define\ndense rewards for robotic learning. On a real-world manipulation task specified\nby natural language description, we find that these rewards improve the sample\nefficiency of autonomous RL and enable successful completion of the task in 20K\nonline finetuning steps. Additionally, we demonstrate the robustness of the\napproach to reductions in the number of in-domain demonstrations used for\npretraining, reaching comparable performance in 35K online finetuning steps.", "comment": "15 pages, 9 figures. Robotics: Science and Systems (RSS) 2024, Task\n Specification for General-Purpose Intelligent Robots & Lifelong Robot\n Learning Workshops", "links": []}
+{"entry_id": "2306.06094", "title": "Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding", "authors": ["Mu Cai", "Zeyi Huang", "Yuheng Li", "Utkarsh Ojha", "Haohan Wang", "Yong Jae Lee"], "published": "2023-06-09 17:57:01", "updated": "2024-07-11 17:59:53", "summary": "Large language models (LLMs) have made significant advancements in natural\nlanguage understanding. However, through that enormous semantic representation\nthat the LLM has learnt, is it somehow possible for it to understand images as\nwell? This work investigates this question. To enable the LLM to process\nimages, we convert them into a representation given by Scalable Vector Graphics\n(SVG). To study what the LLM can do with this XML-based textual description of\nimages, we test the LLM on three broad computer vision tasks: (i) visual\nreasoning and question answering, (ii) image classification under distribution\nshift, few-shot learning, and (iii) generating new images using visual\nprompting. Even though we do not naturally associate LLMs with any visual\nunderstanding capabilities, our results indicate that the LLM can often do a\ndecent job in many of these tasks, potentially opening new avenues for research\ninto LLMs' ability to understand image data. Our code, data, and models can be\nfound here https://github.com/mu-cai/svg-llm.", "comment": null, "links": []}
+{"entry_id": "2407.08672", "title": "NODE-Adapter: Neural Ordinary Differential Equations for Better Vision-Language Reasoning", "authors": ["Yi Zhang", "Chun-Wun Cheng", "Ke Yu", "Zhihai He", "Carola-Bibiane Schönlieb", "Angelica I. Aviles-Rivero"], "published": "2024-07-11 17:04:19", "updated": "2024-07-11 17:04:19", "summary": "In this paper, we consider the problem of prototype-based vision-language\nreasoning problem. We observe that existing methods encounter three major\nchallenges: 1) escalating resource demands and prolonging training times, 2)\ncontending with excessive learnable parameters, and 3) fine-tuning based only\non a single modality. These challenges will hinder their capability to adapt\nVision-Language Models (VLMs) to downstream tasks. Motivated by this critical\nobservation, we propose a novel method called NODE-Adapter, which utilizes\nNeural Ordinary Differential Equations for better vision-language reasoning. To\nfully leverage both visual and textual modalities and estimate class prototypes\nmore effectively and accurately, we divide our method into two stages:\ncross-modal prototype construction and cross-modal prototype optimization using\nneural ordinary differential equations. Specifically, we exploit VLM to encode\nhand-crafted prompts into textual features and few-shot support images into\nvisual features. Then, we estimate the textual prototype and visual prototype\nby averaging the textual features and visual features, respectively, and\nadaptively combine the textual prototype and visual prototype to construct the\ncross-modal prototype. To alleviate the prototype bias, we then model the\nprototype optimization process as an initial value problem with Neural ODEs to\nestimate the continuous gradient flow. Our extensive experimental results,\nwhich cover few-shot classification, domain generalization, and visual\nreasoning on human-object interaction, demonstrate that the proposed method\nsignificantly outperforms existing state-of-the-art approaches.", "comment": null, "links": []}
+{"entry_id": "2406.09403", "title": "Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models", "authors": ["Yushi Hu", "Weijia Shi", "Xingyu Fu", "Dan Roth", "Mari Ostendorf", "Luke Zettlemoyer", "Noah A Smith", "Ranjay Krishna"], "published": "2024-06-13 17:59:31", "updated": "2024-07-10 18:09:56", "summary": "Humans draw to facilitate reasoning: we draw auxiliary lines when solving\ngeometry problems; we mark and circle when reasoning on maps; we use sketches\nto amplify our ideas and relieve our limited-capacity working memory. However,\nsuch actions are missing in current multimodal language models (LMs). Current\nchain-of-thought and tool-use paradigms only use text as intermediate reasoning\nsteps. In this work, we introduce Sketchpad, a framework that gives multimodal\nLMs a visual sketchpad and tools to draw on the sketchpad. The LM conducts\nplanning and reasoning according to the visual artifacts it has drawn.\nDifferent from prior work, which uses text-to-image models to enable LMs to\ndraw, Sketchpad enables LMs to draw with lines, boxes, marks, etc., which is\ncloser to human sketching and better facilitates reasoning. Sketchpad can also\nuse specialist vision models during the sketching process (e.g., draw bounding\nboxes with object detection models, draw masks with segmentation models), to\nfurther enhance visual perception and reasoning. We experiment with a wide\nrange of math tasks (including geometry, functions, graphs, and chess) and\ncomplex visual reasoning tasks. Sketchpad substantially improves performance on\nall tasks over strong base models with no sketching, yielding an average gain\nof 12.7% on math tasks, and 8.6% on vision tasks. GPT-4o with Sketchpad sets a\nnew state of the art on all tasks, including V*Bench (80.3%), BLINK spatial\nreasoning (83.9%), and visual correspondence (80.8%). All codes and data are in\nhttps://visualsketchpad.github.io/.", "comment": "Project and codes url: https://visualsketchpad.github.io/", "links": []}
+{"entry_id": "2407.02688", "title": "Funny-Valen-Tine: Planning Solution Distribution Enhances Machine Abstract Reasoning Ability", "authors": ["Ruizhuo Song", "Beiming Yuan"], "published": "2024-07-02 22:04:20", "updated": "2024-07-07 12:25:33", "summary": "Visual abstract reasoning problems hold immense importance in the field of\nimage processing. Both Bongard-Logo and Raven's Progressive Matrices (RPM)\nbelong to this domain, with Bongard-Logo categorized as image clustering\nreasoning and RPM involving image progression pattern reasoning. This paper\nintroduces Valen, a novel baseline model under probabilistic highlighting\nmodels. Valen exhibits remarkable performance in solving both RPM and\nBongard-Logo problems, offering a versatile solution. Our investigation delves\ninto the underlying mechanisms of probability-highlighting solvers, realizing\nthey approximate solutions to reasoning problem instances as distributions\ndelineated by primary and auxiliary samples. We propose that the learning\nobjective is not the distribution of correct solutions but one defined by both\nprimary and auxiliary samples. To bridge discrepancies, we introduced the Tine\nmethod, an adversarial learning-based approach to assist Valen in estimating a\nsolution distribution closer to the correct one, albeit with issues like\nunstable training. Reflecting on Tine, we propose modeling the sample\ndistribution of reasoning problems as a mixture of Gaussian distributions,\nleading to the Funny method. This effectively enables Valen to capture the true\nform of the correct solution distribution. Furthermore, we designed the SBR\nmethod to model the distribution of progressive patterns representation\nsimilarly. Overall, the Funny, Tine, and SBR methods significantly improve\nValen's performance, providing new ideas and methods for studying visual\nabstract reasoning problems.", "comment": "14 pages, 20 figures, 3 tables", "links": []}
+{"entry_id": "2407.01284", "title": "We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?", "authors": ["Runqi Qiao", "Qiuna Tan", "Guanting Dong", "Minhui Wu", "Chong Sun", "Xiaoshuai Song", "Zhuoma GongQue", "Shanglin Lei", "Zhe Wei", "Miaoxuan Zhang", "Runfeng Qiao", "Yifan Zhang", "Xiao Zong", "Yida Xu", "Muxi Diao", "Zhimin Bao", "Chen Li", "Honggang Zhang"], "published": "2024-07-01 13:39:08", "updated": "2024-07-01 13:39:08", "summary": "Visual mathematical reasoning, as a fundamental visual reasoning ability, has\nreceived widespread attention from the Large Multimodal Models (LMMs)\ncommunity. Existing benchmarks, such as MathVista and MathVerse, focus more on\nthe result-oriented performance but neglect the underlying principles in\nknowledge acquisition and generalization. Inspired by human-like mathematical\nreasoning, we introduce WE-MATH, the first benchmark specifically designed to\nexplore the problem-solving principles beyond end-to-end performance. We\nmeticulously collect and categorize 6.5K visual math problems, spanning 67\nhierarchical knowledge concepts and five layers of knowledge granularity. We\ndecompose composite problems into sub-problems according to the required\nknowledge concepts and introduce a novel four-dimensional metric, namely\nInsufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery\n(CM), and Rote Memorization (RM), to hierarchically assess inherent issues in\nLMMs' reasoning process. With WE-MATH, we conduct a thorough evaluation of\nexisting LMMs in visual mathematical reasoning and reveal a negative\ncorrelation between solving steps and problem-specific performance. We confirm\nthe IK issue of LMMs can be effectively improved via knowledge augmentation\nstrategies. More notably, the primary challenge of GPT-4o has significantly\ntransitioned from IK to IG, establishing it as the first LMM advancing towards\nthe knowledge generalization stage. In contrast, other LMMs exhibit a marked\ninclination towards Rote Memorization - they correctly solve composite problems\ninvolving multiple knowledge concepts yet fail to answer sub-problems. We\nanticipate that WE-MATH will open new pathways for advancements in visual\nmathematical reasoning for LMMs. The WE-MATH data and evaluation code are\navailable at https://github.com/We-Math/We-Math.", "comment": "Work in progress", "links": []}
+{"entry_id": "2310.04671", "title": "Exploring the Potential of Multi-Modal AI for Driving Hazard Prediction", "authors": ["Korawat Charoenpitaks", "Van-Quang Nguyen", "Masanori Suganuma", "Masahiro Takahashi", "Ryoma Niihara", "Takayuki Okatani"], "published": "2023-10-07 03:16:30", "updated": "2024-07-01 09:29:39", "summary": "This paper addresses the problem of predicting hazards that drivers may\nencounter while driving a car. We formulate it as a task of anticipating\nimpending accidents using a single input image captured by car dashcams. Unlike\nexisting approaches to driving hazard prediction that rely on computational\nsimulations or anomaly detection from videos, this study focuses on high-level\ninference from static images. The problem needs predicting and reasoning about\nfuture events based on uncertain observations, which falls under visual\nabductive reasoning. To enable research in this understudied area, a new\ndataset named the DHPR (Driving Hazard Prediction and Reasoning) dataset is\ncreated. The dataset consists of 15K dashcam images of street scenes, and each\nimage is associated with a tuple containing car speed, a hypothesized hazard\ndescription, and visual entities present in the scene. These are annotated by\nhuman annotators, who identify risky scenes and provide descriptions of\npotential accidents that could occur a few seconds later. We present several\nbaseline methods and evaluate their performance on our dataset, identifying\nremaining issues and discussing future directions. This study contributes to\nthe field by introducing a novel problem formulation and dataset, enabling\nresearchers to explore the potential of multi-modal AI for driving hazard\nprediction.", "comment": "Main Paper: 11 pages, Supplementary Materials: 25 pages", "links": []}
+{"entry_id": "2406.12272", "title": "Slot State Space Models", "authors": ["Jindong Jiang", "Fei Deng", "Gautam Singh", "Minseung Lee", "Sungjin Ahn"], "published": "2024-06-18 04:59:14", "updated": "2024-06-30 22:25:01", "summary": "Recent State Space Models (SSMs) such as S4, S5, and Mamba have shown\nremarkable computational benefits in long-range temporal dependency modeling.\nHowever, in many sequence modeling problems, the underlying process is\ninherently modular and it is of interest to have inductive biases that mimic\nthis modular structure. In this paper, we introduce SlotSSMs, a novel framework\nfor incorporating independent mechanisms into SSMs to preserve or encourage\nseparation of information. Unlike conventional SSMs that maintain a monolithic\nstate vector, SlotSSMs maintains the state as a collection of multiple vectors\ncalled slots. Crucially, the state transitions are performed independently per\nslot with sparse interactions across slots implemented via the bottleneck of\nself-attention. In experiments, we evaluate our model in object-centric video\nunderstanding, 3D visual reasoning, and video prediction tasks, which involve\nmodeling multiple objects and their long-range temporal dependencies. We find\nthat our proposed design offers substantial performance gains over existing\nsequence modeling methods.", "comment": null, "links": []}
{"entry_id": "2406.19934", "title": "From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis", "authors": ["Chuanqi Cheng", "Jian Guan", "Wei Wu", "Rui Yan"], "published": "2024-06-28 14:04:10", "updated": "2024-06-28 14:04:10", "summary": "We explore multi-step reasoning in vision-language models (VLMs). The problem\nis challenging, as reasoning data consisting of multiple steps of visual and\nlanguage processing are barely available. To overcome the challenge, we first\nintroduce a least-to-most visual reasoning paradigm, which interleaves steps of\ndecomposing a question into sub-questions and invoking external tools for\nresolving sub-questions. Based on the paradigm, we further propose a novel data\nsynthesis approach that can automatically create questions and multi-step\nreasoning paths for an image in a bottom-up manner. Our approach divides the\ncomplex synthesis task into a few simple sub-tasks, and (almost entirely)\nrelies on open-sourced models to accomplish the sub-tasks. Therefore, the\nentire synthesis process is reproducible and cost-efficient, and the\nsynthesized data is quality guaranteed. With the approach, we construct $50$k\nvisual reasoning examples. Then, we develop a visual reasoner through\nsupervised fine-tuning, which is capable of generally enhancing the reasoning\nabilities of a wide range of existing VLMs in a plug-and-play fashion.\nExtensive experiments indicate that the visual reasoner can consistently and\nsignificantly improve four VLMs on four VQA benchmarks. Our code and dataset\nare available at https://github.com/steven-ccq/VisualReasoner.", "comment": null, "links": []}
{"entry_id": "2406.19693", "title": "MMRo: Are Multimodal LLMs Eligible as the Brain for In-Home Robotics?", "authors": ["Jinming Li", "Yichen Zhu", "Zhiyuan Xu", "Jindong Gu", "Minjie Zhu", "Xin Liu", "Ning Liu", "Yaxin Peng", "Feifei Feng", "Jian Tang"], "published": "2024-06-28 07:09:06", "updated": "2024-06-28 07:09:06", "summary": "It is fundamentally challenging for robots to serve as useful assistants in\nhuman environments because this requires addressing a spectrum of sub-problems\nacross robotics, including perception, language understanding, reasoning, and\nplanning. The recent advancements in Multimodal Large Language Models (MLLMs)\nhave demonstrated their exceptional abilities in solving complex mathematical\nproblems, mastering commonsense and abstract reasoning. This has led to the\nrecent utilization of MLLMs as the brain in robotic systems, enabling these\nmodels to conduct high-level planning prior to triggering low-level control\nactions for task execution. However, it remains uncertain whether existing\nMLLMs are reliable in serving the brain role of robots. In this study, we\nintroduce the first benchmark for evaluating Multimodal LLM for Robotic (MMRo)\nbenchmark, which tests the capability of MLLMs for robot applications.\nSpecifically, we identify four essential capabilities perception, task\nplanning, visual reasoning, and safety measurement that MLLMs must possess to\nqualify as the robot's central processing unit. We have developed several\nscenarios for each capability, resulting in a total of 14 metrics for\nevaluation. We present experimental results for various MLLMs, including both\ncommercial and open-source models, to assess the performance of existing\nsystems. Our findings indicate that no single model excels in all areas,\nsuggesting that current MLLMs are not yet trustworthy enough to serve as the\ncognitive core for robots. Our data can be found in\nhttps://mm-robobench.github.io/.", "comment": null, "links": []}
{"entry_id": "2406.13444", "title": "VDebugger: Harnessing Execution Feedback for Debugging Visual Programs", "authors": ["Xueqing Wu", "Zongyu Lin", "Songyan Zhao", "Te-Lin Wu", "Pan Lu", "Nanyun Peng", "Kai-Wei Chang"], "published": "2024-06-19 11:09:16", "updated": "2024-06-27 17:09:24", "summary": "Visual programs are executable code generated by large language models to\naddress visual reasoning problems. They decompose complex questions into\nmultiple reasoning steps and invoke specialized models for each step to solve\nthe problems. However, these programs are prone to logic errors, with our\npreliminary evaluation showing that 58% of the total errors are caused by\nprogram logic errors. Debugging complex visual programs remains a major\nbottleneck for visual reasoning. To address this, we introduce VDebugger, a\nnovel critic-refiner framework trained to localize and debug visual programs by\ntracking execution step by step. VDebugger identifies and corrects program\nerrors leveraging detailed execution feedback, improving interpretability and\naccuracy. The training data is generated through an automated pipeline that\ninjects errors into correct visual programs using a novel mask-best decoding\ntechnique. Evaluations on six datasets demonstrate VDebugger's effectiveness,\nshowing performance improvements of up to 3.2% in downstream task accuracy.\nFurther studies show VDebugger's ability to generalize to unseen tasks,\nbringing a notable improvement of 2.3% on the unseen COVR task. Code, data and\nmodels are made publicly available at https://github.com/shirley-wu/vdebugger/", "comment": "update reference", "links": []}
@@ -5,7 +31,7 @@
{"entry_id": "2406.19217", "title": "Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos", "authors": ["Zhimin Shao", "Jialang Xu", "Danail Stoyanov", "Evangelos B. Mazomenos", "Yueming Jin"], "published": "2024-06-27 14:43:50", "updated": "2024-06-27 14:43:50", "summary": "Despite significant advancements in robotic systems and surgical data\nscience, ensuring safe and optimal execution in robot-assisted minimally\ninvasive surgery (RMIS) remains a complex challenge. Current surgical error\ndetection methods involve two parts: identifying surgical gestures and then\ndetecting errors within each gesture clip. These methods seldom consider the\nrich contextual and semantic information inherent in surgical videos, limiting\ntheir performance due to reliance on accurate gesture identification. Motivated\nby the chain-of-thought prompting in natural language processing, this letter\npresents a novel and real-time end-to-end error detection framework,\nChain-of-Thought (COG) prompting, leveraging contextual information from\nsurgical videos. This encompasses two reasoning modules designed to mimic the\ndecision-making processes of expert surgeons. Concretely, we first design a\nGestural-Visual Reasoning module, which utilizes transformer and attention\narchitectures for gesture prompting, while the second, a Multi-Scale Temporal\nReasoning module, employs a multi-stage temporal convolutional network with\nboth slow and fast paths for temporal information extraction. We extensively\nvalidate our method on the public benchmark RMIS dataset JIGSAWS. Our method\nencapsulates the reasoning processes inherent to surgical activities enabling\nit to outperform the state-of-the-art by 4.6% in F1 score, 4.6% in Accuracy,\nand 5.9% in Jaccard index while processing each frame in 6.69 milliseconds on\naverage, demonstrating the great potential of our approach in enhancing the\nsafety and efficacy of RMIS procedures and surgical education. The code will be\navailable.", "comment": "8 pages, 4 figures", "links": []}
{"entry_id": "2406.18925", "title": "Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument Understanding", "authors": ["Jiwan Chung", "Sungjae Lee", "Minseo Kim", "Seungju Han", "Ashkan Yousefpour", "Jack Hessel", "Youngjae Yu"], "published": "2024-06-27 06:32:56", "updated": "2024-06-27 06:32:56", "summary": "Visual arguments, often used in advertising or social causes, rely on images\nto persuade viewers to do or believe something. Understanding these arguments\nrequires selective vision: only specific visual stimuli within an image are\nrelevant to the argument, and relevance can only be understood within the\ncontext of a broader argumentative structure. While visual arguments are\nreadily appreciated by human audiences, we ask: are today's AI capable of\nsimilar understanding?\n We collect and release VisArgs, an annotated corpus designed to make explicit\nthe (usually implicit) structures underlying visual arguments. VisArgs includes\n1,611 images accompanied by three types of textual annotations: 5,112 visual\npremises (with region annotations), 5,574 commonsense premises, and reasoning\ntrees connecting them to a broader argument. We propose three tasks over\nVisArgs to probe machine capacity for visual argument understanding:\nlocalization of premises, identification of premises, and deduction of\nconclusions. Experiments demonstrate that 1) machines cannot fully identify the\nrelevant visual cues. The top-performing model, GPT-4-O, achieved an accuracy\nof only 78.5%, whereas humans reached 98.0%. All models showed a performance\ndrop, with an average decrease in accuracy of 19.5%, when the comparison set\nwas changed from objects outside the image to irrelevant objects within the\nimage. Furthermore, 2) this limitation is the greatest factor impacting their\nperformance in understanding visual arguments. Most models improved the most\nwhen given relevant visual premises as additional inputs, compared to other\ninputs, for deducing the conclusion of the visual argument.", "comment": "12 pages, 5 figures", "links": []}
{"entry_id": "2406.18839", "title": "Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA", "authors": ["Elham J. Barezi", "Parisa Kordjamshidi"], "published": "2024-06-27 02:19:38", "updated": "2024-06-27 02:19:38", "summary": "We study the Knowledge-Based visual question-answering problem, for which\ngiven a question, the models need to ground it into the visual modality to find\nthe answer. Although many recent works use question-dependent captioners to\nverbalize the given image and use Large Language Models to solve the VQA\nproblem, the research results show they are not reasonably performing for\nmulti-hop questions. Our study shows that replacing a complex question with\nseveral simpler questions helps to extract more relevant information from the\nimage and provide a stronger comprehension of it. Moreover, we analyze the\ndecomposed questions to find out the modality of the information that is\nrequired to answer them and use a captioner for the visual questions and LLMs\nas a general knowledge source for the non-visual KB-based questions. Our\nresults demonstrate the positive impact of using simple questions before\nretrieving visual or non-visual information. We have provided results and\nanalysis on three well-known VQA datasets including OKVQA, A-OKVQA, and KRVQA,\nand achieved up to 2% improvement in accuracy.", "comment": null, "links": []}
-{"entry_id": "2406.12272", "title": "Slot State Space Models", "authors": ["Jindong Jiang", "Fei Deng", "Gautam Singh", "Minseung Lee", "Sungjin Ahn"], "published": "2024-06-18 04:59:14", "updated": "2024-06-26 03:04:04", "summary": "Recent State Space Models (SSMs) such as S4, S5, and Mamba have shown\nremarkable computational benefits in long-range temporal dependency modeling.\nHowever, in many sequence modeling problems, the underlying process is\ninherently modular and it is of interest to have inductive biases that mimic\nthis modular structure. In this paper, we introduce SlotSSMs, a novel framework\nfor incorporating independent mechanisms into SSMs to preserve or encourage\nseparation of information. Unlike conventional SSMs that maintain a monolithic\nstate vector, SlotSSMs maintains the state as a collection of multiple vectors\ncalled slots. Crucially, the state transitions are performed independently per\nslot with sparse interactions across slots implemented via the bottleneck of\nself-attention. In experiments, we evaluate our model in object-centric video\nunderstanding, 3D visual reasoning, and video prediction tasks, which involve\nmodeling multiple objects and their long-range temporal dependencies. We find\nthat our proposed design offers substantial performance gains over existing\nsequence modeling methods.", "comment": null, "links": []}
+{"entry_id": "2407.00092", "title": "Visual Reasoning and Multi-Agent Approach in Multimodal Large Language Models (MLLMs): Solving TSP and mTSP Combinatorial Challenges", "authors": ["Mohammed Elhenawy", "Ahmad Abutahoun", "Taqwa I. Alhadidi", "Ahmed Jaber", "Huthaifa I. Ashqar", "Shadi Jaradat", "Ahmed Abdelhay", "Sebastien Glaser", "Andry Rakotonirainy"], "published": "2024-06-26 07:12:06", "updated": "2024-06-26 07:12:06", "summary": "Multimodal Large Language Models (MLLMs) harness comprehensive knowledge\nspanning text, images, and audio to adeptly tackle complex problems, including\nzero-shot in-context learning scenarios. This study explores the ability of\nMLLMs in visually solving the Traveling Salesman Problem (TSP) and Multiple\nTraveling Salesman Problem (mTSP) using images that portray point distributions\non a two-dimensional plane. We introduce a novel approach employing multiple\nspecialized agents within the MLLM framework, each dedicated to optimizing\nsolutions for these combinatorial challenges. Our experimental investigation\nincludes rigorous evaluations across zero-shot settings and introduces\ninnovative multi-agent zero-shot in-context scenarios. The results demonstrated\nthat both multi-agent models. Multi-Agent 1, which includes the Initializer,\nCritic, and Scorer agents, and Multi-Agent 2, which comprises only the\nInitializer and Critic agents; significantly improved solution quality for TSP\nand mTSP problems. Multi-Agent 1 excelled in environments requiring detailed\nroute refinement and evaluation, providing a robust framework for sophisticated\noptimizations. In contrast, Multi-Agent 2, focusing on iterative refinements by\nthe Initializer and Critic, proved effective for rapid decision-making\nscenarios. These experiments yield promising outcomes, showcasing the robust\nvisual reasoning capabilities of MLLMs in addressing diverse combinatorial\nproblems. The findings underscore the potential of MLLMs as powerful tools in\ncomputational optimization, offering insights that could inspire further\nadvancements in this promising field. Project link:\nhttps://github.com/ahmed-abdulhuy/Solving-TSP-and-mTSP-Combinatorial-Challenges-using-Visual-Reasoning-and-Multi-Agent-Approach-MLLMs-.git", "comment": null, "links": []}
{"entry_id": "2406.16469", "title": "Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration", "authors": ["Yujin Baek", "ChaeHun Park", "Jaeseok Kim", "Yu-Jung Heo", "Du-Seong Chang", "Jaegul Choo"], "published": "2024-06-24 09:18:15", "updated": "2024-06-24 09:18:15", "summary": "To create culturally inclusive vision-language models (VLMs), the foremost\nrequirement is developing a test benchmark that can diagnose the models'\nability to respond to questions reflecting cultural elements. This paper\naddresses the necessity for such benchmarks, noting that existing research has\nrelied on human annotators' manual efforts, which impedes diversity and\nefficiency. We propose a semi-automated pipeline for constructing cultural VLM\nbenchmarks to enhance diversity and efficiency. This pipeline leverages\nhuman-VLM collaboration, where VLMs generate questions based on guidelines,\nhuman-annotated examples, and image-wise relevant knowledge, which are then\nreviewed by native speakers for quality and cultural relevance. The\neffectiveness of our adaptable pipeline is demonstrated through a specific\napplication: creating a dataset tailored to Korean culture, dubbed K-Viscuit.\nThe resulting benchmark features two types of questions: Type 1 questions\nmeasure visual recognition abilities, while Type 2 assess fine-grained visual\nreasoning skills. This ensures a thorough diagnosis of VLM models across\nvarious aspects. Our evaluation using K-Viscuit revealed that open-source\nmodels notably lag behind proprietary models in understanding Korean culture,\nhighlighting areas for improvement. We provided diverse analyses of VLM\nperformance across different cultural aspects. Besides, we explored the\npotential of incorporating external knowledge retrieval to enhance the\ngeneration process, suggesting future directions for improving cultural\ninterpretation ability of VLMs. Our dataset and code will be made publicly\navailable.", "comment": null, "links": []}
{"entry_id": "2406.15955", "title": "Beyond the Doors of Perception: Vision Transformers Represent Relations Between Objects", "authors": ["Michael A. Lepori", "Alexa R. Tartaglini", "Wai Keen Vong", "Thomas Serre", "Brenden M. Lake", "Ellie Pavlick"], "published": "2024-06-22 22:43:10", "updated": "2024-06-22 22:43:10", "summary": "Though vision transformers (ViTs) have achieved state-of-the-art performance\nin a variety of settings, they exhibit surprising failures when performing\ntasks involving visual relations. This begs the question: how do ViTs attempt\nto perform tasks that require computing visual relations between objects? Prior\nefforts to interpret ViTs tend to focus on characterizing relevant low-level\nvisual features. In contrast, we adopt methods from mechanistic\ninterpretability to study the higher-level visual algorithms that ViTs use to\nperform abstract visual reasoning. We present a case study of a fundamental,\nyet surprisingly difficult, relational reasoning task: judging whether two\nvisual entities are the same or different. We find that pretrained ViTs\nfine-tuned on this task often exhibit two qualitatively different stages of\nprocessing despite having no obvious inductive biases to do so: 1) a perceptual\nstage wherein local object features are extracted and stored in a disentangled\nrepresentation, and 2) a relational stage wherein object representations are\ncompared. In the second stage, we find evidence that ViTs can learn to\nrepresent somewhat abstract visual relations, a capability that has long been\nconsidered out of reach for artificial neural networks. Finally, we demonstrate\nthat failure points at either stage can prevent a model from learning a\ngeneralizable solution to our fairly simple tasks. By understanding ViTs in\nterms of discrete processing stages, one can more precisely diagnose and\nrectify shortcomings of existing and future models.", "comment": null, "links": []}
{"entry_id": "2403.03190", "title": "Triple-CFN: Restructuring Concept and Feature Spaces for Enhancing Abstract Reasoning Process", "authors": ["Ruizhuo Song", "Beiming Yuan"], "published": "2024-03-05 18:29:17", "updated": "2024-06-21 10:57:32", "summary": "Visual abstract reasoning poses challenges to AI algorithms, requiring\ncognitive abilities beyond perception. For methodology, this study emphasizes\nthe need to separately extract concepts and features from visual abstract\nreasoning problems, employing the responses of features to concepts as elements\nin the reasoning process. It also advocates for clear concept and feature\nspaces to tackle visual abstract reasoning tasks effectively. For technology,\nwe introduce the Cross-Feature Network (CFN), a framework that separately\nextracts concepts and features from reasoning problems, utilizing their\nresponses as reasoning representations. The CFN integrates a dual\nExpectation-Maximization process to actively seek an ideal concept space for\nproblem-solving, yielding notable results despite limitations in generalization\ntasks. To overcome these limitations, we propose the Triple-CFN, maximizing\nfeature extraction and demonstrating effectiveness in Bongard-Logo and Raven's\nProgressive Matrices (RPM) problems. Additionally, we present Meta Triple-CFN,\nwhich constructs a promising concept space for RPM, ensuring high reasoning\naccuracy and concept interpretability. Furthermore, we design the Re-space\nlayer, defining a clear feature space for (Meta) Triple-CFN, with its unique\nwarm-start process aiding generalization. Overall, this work advances machine\nintelligence through innovative network designs for abstract reasoning.", "comment": "13 pages, 16 figures, 7 tables", "links": []}
@@ -21,11 +47,9 @@
{"entry_id": "2406.11327", "title": "ClawMachine: Fetching Visual Tokens as An Entity for Referring and Grounding", "authors": ["Tianren Ma", "Lingxi Xie", "Yunjie Tian", "Boyu Yang", "Yuan Zhang", "David Doermann", "Qixiang Ye"], "published": "2024-06-17 08:39:16", "updated": "2024-06-17 08:39:16", "summary": "An essential topic for multimodal large language models (MLLMs) is aligning\nvision and language concepts at a finer level. In particular, we devote efforts\nto encoding visual referential information for tasks such as referring and\ngrounding. Existing methods, including proxy encoding and geometry encoding,\nincorporate additional syntax to encode the object's location, bringing extra\nburdens in training MLLMs to communicate between language and vision. This\nstudy presents ClawMachine, offering a new methodology that notates an entity\ndirectly using the visual tokens. It allows us to unify the prompt and answer\nof visual referential tasks without additional syntax. Upon a joint\nvision-language vocabulary, ClawMachine unifies visual referring and grounding\ninto an auto-regressive format and learns with a decoder-only architecture.\nExperiments validate that our model achieves competitive performance across\nvisual referring and grounding tasks with a reduced demand for training data.\nAdditionally, ClawMachine demonstrates a native ability to integrate\nmulti-source information for complex visual reasoning, which prior MLLMs can\nhardly perform without specific adaptions.", "comment": "Project page: https://github.com/martian422/ClawMachine", "links": []}
{"entry_id": "2406.11068", "title": "A Unified View of Abstract Visual Reasoning Problems", "authors": ["Mikołaj Małkiński", "Jacek Mańdziuk"], "published": "2024-06-16 20:52:44", "updated": "2024-06-16 20:52:44", "summary": "The field of Abstract Visual Reasoning (AVR) encompasses a wide range of\nproblems, many of which are inspired by human IQ tests. The variety of AVR\ntasks has resulted in state-of-the-art AVR methods being task-specific\napproaches. Furthermore, contemporary methods consider each AVR problem\ninstance not as a whole, but in the form of a set of individual panels with\nparticular locations and roles (context vs. answer panels) pre-assigned\naccording to the task-specific arrangements. While these highly specialized\napproaches have recently led to significant progress in solving particular AVR\ntasks, considering each task in isolation hinders the development of universal\nlearning systems in this domain. In this paper, we introduce a unified view of\nAVR tasks, where each problem instance is rendered as a single image, with no a\npriori assumptions about the number of panels, their location, or role. The\nmain advantage of the proposed unified view is the ability to develop universal\nlearning models applicable to various AVR tasks. What is more, the proposed\napproach inherently facilitates transfer learning in the AVR domain, as various\ntypes of problems share a common representation. The experiments conducted on\nfour AVR datasets with Raven's Progressive Matrices and Visual Analogy\nProblems, and one real-world visual analogy dataset show that the proposed\nunified representation of AVR tasks poses a challenge to state-of-the-art Deep\nLearning (DL) AVR models and, more broadly, contemporary DL image recognition\nmethods. In order to address this challenge, we introduce the Unified Model for\nAbstract Visual Reasoning (UMAVR) capable of dealing with various types of AVR\nproblems in a unified manner. UMAVR outperforms existing AVR methods in\nselected single-task learning experiments, and demonstrates effective knowledge\nreuse in transfer learning and curriculum learning setups.", "comment": null, "links": []}
{"entry_id": "2406.11061", "title": "Generalization and Knowledge Transfer in Abstract Visual Reasoning Models", "authors": ["Mikołaj Małkiński", "Jacek Mańdziuk"], "published": "2024-06-16 20:26:38", "updated": "2024-06-16 20:26:38", "summary": "We study generalization and knowledge reuse capabilities of deep neural\nnetworks in the domain of abstract visual reasoning (AVR), employing Raven's\nProgressive Matrices (RPMs), a recognized benchmark task for assessing AVR\nabilities. Two knowledge transfer scenarios referring to the I-RAVEN dataset\nare investigated. Firstly, inspired by generalization assessment capabilities\nof the PGM dataset and popularity of I-RAVEN, we introduce\nAttributeless-I-RAVEN, a benchmark with four generalization regimes that allow\nto test generalization of abstract rules applied to held-out attributes.\nSecondly, we construct I-RAVEN-Mesh, a dataset that enriches RPMs with a novel\ncomponent structure comprising line-based patterns, facilitating assessment of\nprogressive knowledge acquisition in transfer learning setting. The developed\nbenchmarks reveal shortcomings of the contemporary deep learning models, which\nwe partly address with Pathways of Normalized Group Convolution (PoNG) model, a\nnovel neural architecture for solving AVR tasks. PoNG excels in both presented\nchallenges, as well as the standard I-RAVEN and PGM setups.", "comment": null, "links": []}
-{"entry_id": "2401.13311", "title": "ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models", "authors": ["Rohan Wadhawan", "Hritik Bansal", "Kai-Wei Chang", "Nanyun Peng"], "published": "2024-01-24 09:07:11", "updated": "2024-06-16 00:38:24", "summary": "Many real-world tasks require an agent to reason jointly over text and visual\nobjects, (e.g., navigating in public spaces), which we refer to as\ncontext-sensitive text-rich visual reasoning. Specifically, these tasks require\nan understanding of the context in which the text interacts with visual\nelements within an image. However, there is a lack of existing datasets to\nbenchmark the state-of-the-art multimodal models' capability on\ncontext-sensitive text-rich visual reasoning. In this paper, we introduce\nConTextual, a novel dataset featuring human-crafted instructions that require\ncontext-sensitive reasoning for text-rich images. We conduct experiments to\nassess the performance of 14 foundation models (GPT-4V, Gemini-Pro-Vision,\nLLaVA-Next) and establish a human performance baseline. Further, we perform\nhuman evaluations of the model responses and observe a significant performance\ngap of 30.8% between GPT-4V (the current best-performing Large Multimodal\nModel) and human performance. Our fine-grained analysis reveals that GPT-4V\nencounters difficulties interpreting time-related data and infographics.\nHowever, it demonstrates proficiency in comprehending abstract visual contexts\nsuch as memes and quotes. Finally, our qualitative analysis uncovers various\nfactors contributing to poor performance including lack of precise visual\nperception and hallucinations. Our dataset, code, and leaderboard can be found\non the project page https://con-textual.github.io/", "comment": null, "links": []}
{"entry_id": "2406.10424", "title": "What is the Visual Cognition Gap between Humans and Multimodal LLMs?", "authors": ["Xu Cao", "Bolin Lai", "Wenqian Ye", "Yunsheng Ma", "Joerg Heintz", "Jintai Chen", "Jianguo Cao", "James M. Rehg"], "published": "2024-06-14 22:02:21", "updated": "2024-06-14 22:02:21", "summary": "Recently, Multimodal Large Language Models (MLLMs) have shown great promise\nin language-guided perceptual tasks such as recognition, segmentation, and\nobject detection. However, their effectiveness in addressing visual cognition\nproblems that require high-level reasoning is not well-established. One such\nchallenge is abstract visual reasoning (AVR) -- the cognitive ability to\ndiscern relationships among patterns in a set of images and extrapolate to\npredict subsequent patterns. This skill is crucial during the early\nneurodevelopmental stages of children. Inspired by the AVR tasks in Raven's\nProgressive Matrices (RPM) and Wechsler Intelligence Scale for Children (WISC),\nwe propose a new dataset MaRs-VQA and a new benchmark VCog-Bench containing\nthree datasets to evaluate the zero-shot AVR capability of MLLMs and compare\ntheir performance with existing human intelligent investigation. Our\ncomparative experiments with different open-source and closed-source MLLMs on\nthe VCog-Bench revealed a gap between MLLMs and human intelligence,\nhighlighting the visual cognitive limitations of current MLLMs. We believe that\nthe public release of VCog-Bench, consisting of MaRs-VQA, and the inference\npipeline will drive progress toward the next generation of MLLMs with\nhuman-like visual cognition abilities.", "comment": "14 pages, 4 figures, the appendix will be updated soon", "links": []}
{"entry_id": "2406.09949", "title": "Neural Concept Binder", "authors": ["Wolfgang Stammer", "Antonia Wüst", "David Steinmann", "Kristian Kersting"], "published": "2024-06-14 11:52:09", "updated": "2024-06-14 11:52:09", "summary": "The challenge in object-based visual reasoning lies in generating descriptive\nyet distinct concept representations. Moreover, doing this in an unsupervised\nfashion requires human users to understand a model's learned concepts and\npotentially revise false concepts. In addressing this challenge, we introduce\nthe Neural Concept Binder, a new framework for deriving discrete concept\nrepresentations resulting in what we term \"concept-slot encodings\". These\nencodings leverage both \"soft binding\" via object-centric block-slot encodings\nand \"hard binding\" via retrieval-based inference. The Neural Concept Binder\nfacilitates straightforward concept inspection and direct integration of\nexternal knowledge, such as human input or insights from other AI models like\nGPT-4. Additionally, we demonstrate that incorporating the hard binding\nmechanism does not compromise performance; instead, it enables seamless\nintegration into both neural and symbolic modules for intricate reasoning\ntasks, as evidenced by evaluations on our newly introduced CLEVR-Sudoku\ndataset.", "comment": null, "links": []}
{"entry_id": "2305.17455", "title": "CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers", "authors": ["Dachuan Shi", "Chaofan Tao", "Anyi Rao", "Zhendong Yang", "Chun Yuan", "Jiaqi Wang"], "published": "2023-05-27 12:07:21", "updated": "2024-06-13 19:15:53", "summary": "Recent vision-language models have achieved tremendous advances. However,\ntheir computational costs are also escalating dramatically, making model\nacceleration exceedingly critical. To pursue more efficient vision-language\nTransformers, this paper introduces Cross-Guided Ensemble of Tokens (CrossGET),\na general acceleration framework for vision-language Transformers. This\nframework adaptively combines tokens in real-time during inference,\nsignificantly reducing computational costs while maintaining high performance.\nCrossGET features two primary innovations: 1) Cross-Guided Matching and\nEnsemble. CrossGET leverages cross-modal guided token matching and ensemble to\neffectively utilize cross-modal information, achieving wider applicability\nacross both modality-independent models, e.g., CLIP, and modality-dependent\nones, e.g., BLIP2. 2) Complete-Graph Soft Matching. CrossGET introduces an\nalgorithm for the token-matching mechanism, ensuring reliable matching results\nwhile facilitating parallelizability and high efficiency. Extensive experiments\nhave been conducted on various vision-language tasks, such as image-text\nretrieval, visual reasoning, image captioning, and visual question answering.\nThe performance on both classic multimodal architectures and emerging\nmultimodal LLMs demonstrates the framework's effectiveness and versatility. The\ncode is available at https://github.com/sdc17/CrossGET.", "comment": "ICML 2024. Code: https://github.com/sdc17/CrossGET", "links": []}
-{"entry_id": "2406.09403", "title": "Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models", "authors": ["Yushi Hu", "Weijia Shi", "Xingyu Fu", "Dan Roth", "Mari Ostendorf", "Luke Zettlemoyer", "Noah A Smith", "Ranjay Krishna"], "published": "2024-06-13 17:59:31", "updated": "2024-06-13 17:59:31", "summary": "Humans draw to facilitate reasoning: we draw auxiliary lines when solving\ngeometry problems; we mark and circle when reasoning on maps; we use sketches\nto amplify our ideas and relieve our limited-capacity working memory. However,\nsuch actions are missing in current multimodal language models (LMs). Current\nchain-of-thought and tool-use paradigms only use text as intermediate reasoning\nsteps. In this work, we introduce Sketchpad, a framework that gives multimodal\nLMs a visual sketchpad and tools to draw on the sketchpad. The LM conducts\nplanning and reasoning according to the visual artifacts it has drawn.\nDifferent from prior work, which uses text-to-image models to enable LMs to\ndraw, Sketchpad enables LMs to draw with lines, boxes, marks, etc., which is\ncloser to human sketching and better facilitates reasoning. Sketchpad can also\nuse specialist vision models during the sketching process (e.g., draw bounding\nboxes with object detection models, draw masks with segmentation models), to\nfurther enhance visual perception and reasoning. We experiment with a wide\nrange of math tasks (including geometry, functions, graphs, and chess) and\ncomplex visual reasoning tasks. Sketchpad substantially improves performance on\nall tasks over strong base models with no sketching, yielding an average gain\nof 12.7% on math tasks, and 8.6% on vision tasks. GPT-4o with Sketchpad sets a\nnew state of the art on all tasks, including V*Bench (80.3%), BLINK spatial\nreasoning (83.9%), and visual correspondence (80.8%). All codes and data are in\nhttps://visualsketchpad.github.io/.", "comment": "26 pages", "links": []}
{"entry_id": "2406.09240", "title": "Comparison Visual Instruction Tuning", "authors": ["Wei Lin", "Muhammad Jehanzeb Mirza", "Sivan Doveh", "Rogerio Feris", "Raja Giryes", "Sepp Hochreiter", "Leonid Karlinsky"], "published": "2024-06-13 15:43:59", "updated": "2024-06-13 15:43:59", "summary": "Comparing two images in terms of Commonalities and Differences (CaD) is a\nfundamental human capability that forms the basis of advanced visual reasoning\nand interpretation. It is essential for the generation of detailed and\ncontextually relevant descriptions, performing comparative analysis, novelty\ndetection, and making informed decisions based on visual data. However,\nsurprisingly, little attention has been given to these fundamental concepts in\nthe best current mimic of human visual intelligence - Large Multimodal Models\n(LMMs). We develop and contribute a new two-phase approach CaD-VI for\ncollecting synthetic visual instructions, together with an\ninstruction-following dataset CaD-Inst containing 349K image pairs with CaD\ninstructions collected using CaD-VI. Our approach significantly improves the\nCaD spotting capabilities in LMMs, advancing the SOTA on a diverse set of\nrelated tasks by up to 17.5%. It is also complementary to existing\ndifference-only instruction datasets, allowing automatic targeted refinement of\nthose resources increasing their effectiveness for CaD tuning by up to 10%.\nAdditionally, we propose an evaluation benchmark with 7.5K open-ended QAs to\nassess the CaD understanding abilities of LMMs.", "comment": "Project page: https://wlin-at.github.io/cad_vi ; Huggingface dataset\n repo: https://huggingface.co/datasets/wlin21at/CaD-Inst", "links": []}
{"entry_id": "2406.09105", "title": "INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance", "authors": ["Chenwei Lin", "Hanjia Lyu", "Xian Xu", "Jiebo Luo"], "published": "2024-06-13 13:31:49", "updated": "2024-06-13 13:31:49", "summary": "Large Vision-Language Models (LVLMs) have demonstrated outstanding\nperformance in various general multimodal applications such as image\nrecognition and visual reasoning, and have also shown promising potential in\nspecialized domains. However, the application potential of LVLMs in the\ninsurance domain-characterized by rich application scenarios and abundant\nmultimodal data-has not been effectively explored. There is no systematic\nreview of multimodal tasks in the insurance domain, nor a benchmark\nspecifically designed to evaluate the capabilities of LVLMs in insurance. This\ngap hinders the development of LVLMs within the insurance domain. In this\npaper, we systematically review and distill multimodal tasks for four\nrepresentative types of insurance: auto insurance, property insurance, health\ninsurance, and agricultural insurance. We propose INS-MMBench, the first\ncomprehensive LVLMs benchmark tailored for the insurance domain. INS-MMBench\ncomprises a total of 2.2K thoroughly designed multiple-choice questions,\ncovering 12 meta-tasks and 22 fundamental tasks. Furthermore, we evaluate\nmultiple representative LVLMs, including closed-source models such as GPT-4o\nand open-source models like BLIP-2. This evaluation not only validates the\neffectiveness of our benchmark but also provides an in-depth performance\nanalysis of current LVLMs on various multimodal tasks in the insurance domain.\nWe hope that INS-MMBench will facilitate the further application of LVLMs in\nthe insurance domain and inspire interdisciplinary development. Our dataset and\nevaluation code are available at https://github.com/FDU-INS/INS-MMBench.", "comment": null, "links": []}
{"entry_id": "2403.03173", "title": "Solving the Clustering Reasoning Problems by Modeling a Deep-Learning-Based Probabilistic Model", "authors": ["Ruizhuo Song", "Beiming Yuan"], "published": "2024-03-05 18:08:29", "updated": "2024-06-13 09:41:55", "summary": "Visual abstract reasoning problems pose significant challenges to the\nperception and cognition abilities of artificial intelligence algorithms,\ndemanding deeper pattern recognition and inductive reasoning beyond mere\nidentification of explicit image features. Research advancements in this field\noften provide insights and technical support for other similar domains. In this\nstudy, we introduce PMoC, a deep-learning-based probabilistic model, achieving\nhigh reasoning accuracy in the Bongard-Logo, which stands as one of the most\nchallenging clustering reasoning tasks. PMoC is a novel approach for\nconstructing probabilistic models based on deep learning, which is distinctly\ndifferent from previous techniques. PMoC revitalizes the probabilistic\napproach, which has been relatively weak in visual abstract reasoning. As a\nbonus, we also designed Pose-Transformer for complex visual abstract reasoning\ntasks. Inspired by capsule networks, it focuses on positional relationships in\nimage data, boosting accuracy when combined with PMoC. Our Pose-Transformer\neffectively addresses reasoning difficulties associated with changes in the\nposition of entities, outperforming previous models on RAVEN dataset, and the\nPGM dataset. RAVEN and PGM represent two significant progressive pattern\nreasoning problems. Finally, considering the deployment difficulties of\nPose-Transformer, we introduced Straw-Pose-Transformer, a lightweight version.\nThis study contributes to enhancing the capabilities of artificial intelligence\nin abstract reasoning, cognitive pattern, and probabilistic modeling of complex\nsystems.", "comment": "14 pages, 17 figures, 4 tables", "links": []}
@@ -66,11 +90,9 @@
{"entry_id": "2404.06405", "title": "Wu's Method can Boost Symbolic AI to Rival Silver Medalists and AlphaGeometry to Outperform Gold Medalists at IMO Geometry", "authors": ["Shiven Sinha", "Ameya Prabhu", "Ponnurangam Kumaraguru", "Siddharth Bhat", "Matthias Bethge"], "published": "2024-04-09 15:54:00", "updated": "2024-04-11 14:37:29", "summary": "Proving geometric theorems constitutes a hallmark of visual reasoning\ncombining both intuitive and logical skills. Therefore, automated theorem\nproving of Olympiad-level geometry problems is considered a notable milestone\nin human-level automated reasoning. The introduction of AlphaGeometry, a\nneuro-symbolic model trained with 100 million synthetic samples, marked a major\nbreakthrough. It solved 25 of 30 International Mathematical Olympiad (IMO)\nproblems whereas the reported baseline based on Wu's method solved only ten. In\nthis note, we revisit the IMO-AG-30 Challenge introduced with AlphaGeometry,\nand find that Wu's method is surprisingly strong. Wu's method alone can solve\n15 problems, and some of them are not solved by any of the other methods. This\nleads to two key findings: (i) Combining Wu's method with the classic synthetic\nmethods of deductive databases and angle, ratio, and distance chasing solves 21\nout of 30 methods by just using a CPU-only laptop with a time limit of 5\nminutes per problem. Essentially, this classic method solves just 4 problems\nless than AlphaGeometry and establishes the first fully symbolic baseline\nstrong enough to rival the performance of an IMO silver medalist. (ii) Wu's\nmethod even solves 2 of the 5 problems that AlphaGeometry failed to solve.\nThus, by combining AlphaGeometry with Wu's method we set a new state-of-the-art\nfor automated theorem proving on IMO-AG-30, solving 27 out of 30 problems, the\nfirst AI method which outperforms an IMO gold medalist.", "comment": "Work in Progress. Released for wider feedback", "links": []}
{"entry_id": "2306.13549", "title": "A Survey on Multimodal Large Language Models", "authors": ["Shukang Yin", "Chaoyou Fu", "Sirui Zhao", "Ke Li", "Xing Sun", "Tong Xu", "Enhong Chen"], "published": "2023-06-23 15:21:52", "updated": "2024-04-01 17:51:54", "summary": "Recently, Multimodal Large Language Model (MLLM) represented by GPT-4V has\nbeen a new rising research hotspot, which uses powerful Large Language Models\n(LLMs) as a brain to perform multimodal tasks. The surprising emergent\ncapabilities of MLLM, such as writing stories based on images and OCR-free math\nreasoning, are rare in traditional multimodal methods, suggesting a potential\npath to artificial general intelligence. To this end, both academia and\nindustry have endeavored to develop MLLMs that can compete with or even better\nthan GPT-4V, pushing the limit of research at a surprising speed. In this\npaper, we aim to trace and summarize the recent progress of MLLMs. First of\nall, we present the basic formulation of MLLM and delineate its related\nconcepts, including architecture, training strategy and data, as well as\nevaluation. Then, we introduce research topics about how MLLMs can be extended\nto support more granularity, modalities, languages, and scenarios. We continue\nwith multimodal hallucination and extended techniques, including Multimodal ICL\n(M-ICL), Multimodal CoT (M-CoT), and LLM-Aided Visual Reasoning (LAVR). To\nconclude the paper, we discuss existing challenges and point out promising\nresearch directions. In light of the fact that the era of MLLM has only just\nbegun, we will keep updating this survey and hope it can inspire more research.\nAn associated GitHub link collecting the latest papers is available at\nhttps://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.", "comment": "Project\n page:https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models", "links": []}
{"entry_id": "2311.17076", "title": "Compositional Chain-of-Thought Prompting for Large Multimodal Models", "authors": ["Chancharik Mitra", "Brandon Huang", "Trevor Darrell", "Roei Herzig"], "published": "2023-11-27 22:23:27", "updated": "2024-04-01 03:17:09", "summary": "The combination of strong visual backbones and Large Language Model (LLM)\nreasoning has led to Large Multimodal Models (LMMs) becoming the current\nstandard for a wide range of vision and language (VL) tasks. However, recent\nresearch has shown that even the most advanced LMMs still struggle to capture\naspects of compositional visual reasoning, such as attributes and relationships\nbetween objects. One solution is to utilize scene graphs (SGs)--a formalization\nof objects and their relations and attributes that has been extensively used as\na bridge between the visual and textual domains. Yet, scene graph data requires\nscene graph annotations, which are expensive to collect and thus not easily\nscalable. Moreover, finetuning an LMM based on SG data can lead to catastrophic\nforgetting of the pretraining objective. To overcome this, inspired by\nchain-of-thought methods, we propose Compositional Chain-of-Thought (CCoT), a\nnovel zero-shot Chain-of-Thought prompting method that utilizes SG\nrepresentations in order to extract compositional knowledge from an LMM.\nSpecifically, we first generate an SG using the LMM, and then use that SG in\nthe prompt to produce a response. Through extensive experiments, we find that\nthe proposed CCoT approach not only improves LMM performance on several vision\nand language VL compositional benchmarks but also improves the performance of\nseveral popular LMMs on general multimodal benchmarks, without the need for\nfine-tuning or annotated ground-truth SGs. Code:\nhttps://github.com/chancharikmitra/CCoT", "comment": null, "links": []}
-{"entry_id": "2403.16921", "title": "PropTest: Automatic Property Testing for Improved Visual Programming", "authors": ["Jaywon Koo", "Ziyan Yang", "Paola Cascante-Bonilla", "Baishakhi Ray", "Vicente Ordonez"], "published": "2024-03-25 16:39:15", "updated": "2024-03-25 16:39:15", "summary": "Visual Programming has emerged as an alternative to end-to-end black-box\nvisual reasoning models. This type of methods leverage Large Language Models\n(LLMs) to decompose a problem and generate the source code for an executable\ncomputer program. This strategy has the advantage of offering an interpretable\nreasoning path and does not require finetuning a model with task-specific data.\nWe propose PropTest, a general strategy that improves visual programming by\nfurther using an LLM to generate code that tests for visual properties in an\ninitial round of proposed solutions. Particularly, our method tests for\ndata-type consistency, as well as syntactic and semantic properties in the\ngenerated solutions. Our proposed solution outperforms baselines and achieves\ncomparable results to state-of-the-art methods while using smaller and publicly\navailable LLMs (CodeLlama-7B and WizardCoder-15B). This is demonstrated across\ndifferent benchmarks on visual question answering and referring expression\ncomprehension, showing the efficacy of our approach in enhancing the\nperformance and generalization of visual reasoning tasks. Specifically,\nPropTest improves ViperGPT by obtaining 48.66% accuracy (+8.3%) on the A-OKVQA\nbenchmark and 52.8% (+3.3%) on the RefCOCO+ benchmark using CodeLlama-7B.", "comment": "Project Page: https://jaywonkoo17.github.io/PropTest/", "links": []}
{"entry_id": "2403.14743", "title": "VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding", "authors": ["Ahmad Mahmood", "Ashmal Vayani", "Muzammal Naseer", "Salman Khan", "Fahad Shahbaz Khan"], "published": "2024-03-21 18:00:00", "updated": "2024-03-25 01:18:37", "summary": "Recent studies have demonstrated the effectiveness of Large Language Models\n(LLMs) as reasoning modules that can deconstruct complex tasks into more\nmanageable sub-tasks, particularly when applied to visual reasoning tasks for\nimages. In contrast, this paper introduces a Video Understanding and Reasoning\nFramework (VURF) based on the reasoning power of LLMs. Ours is a novel approach\nto extend the utility of LLMs in the context of video tasks, leveraging their\ncapacity to generalize from minimal input and output demonstrations within a\ncontextual framework. By presenting LLMs with pairs of instructions and their\ncorresponding high-level programs, we harness their contextual learning\ncapabilities to generate executable visual programs for video understanding. To\nenhance program's accuracy and robustness, we implement two important\nstrategies. Firstly, we employ a feedback-generation approach, powered by\nGPT-3.5, to rectify errors in programs utilizing unsupported functions.\nSecondly, taking motivation from recent works on self refinement of LLM\noutputs, we introduce an iterative procedure for improving the quality of the\nin-context examples by aligning the initial outputs to the outputs that would\nhave been generated had the LLM not been bound by the structure of the\nin-context examples. Our results on several video-specific tasks, including\nvisual QA, video anticipation, pose estimation and multi-video QA illustrate\nthe efficacy of these enhancements in improving the performance of visual\nprogramming approaches for video tasks.", "comment": null, "links": []}
{"entry_id": "2403.13666", "title": "Grounding Spatial Relations in Text-Only Language Models", "authors": ["Gorka Azkune", "Ander Salaberria", "Eneko Agirre"], "published": "2024-03-20 15:20:30", "updated": "2024-03-20 15:20:30", "summary": "This paper shows that text-only Language Models (LM) can learn to ground\nspatial relations like \"left of\" or \"below\" if they are provided with explicit\nlocation information of objects and they are properly trained to leverage those\nlocations. We perform experiments on a verbalized version of the Visual Spatial\nReasoning (VSR) dataset, where images are coupled with textual statements which\ncontain real or fake spatial relations between two objects of the image. We\nverbalize the images using an off-the-shelf object detector, adding location\ntokens to every object label to represent their bounding boxes in textual form.\nGiven the small size of VSR, we do not observe any improvement when using\nlocations, but pretraining the LM over a synthetic dataset automatically\nderived by us improves results significantly when using location tokens. We\nthus show that locations allow LMs to ground spatial relations, with our\ntext-only LMs outperforming Vision-and-Language Models and setting the new\nstate-of-the-art for the VSR dataset. Our analysis show that our text-only LMs\ncan generalize beyond the relations seen in the synthetic dataset to some\nextent, learning also more useful information than that encoded in the spatial\nrules we used to create the synthetic dataset itself.", "comment": "Accepted in Neural Networks", "links": ["http://dx.doi.org/10.1016/j.neunet.2023.11.031"]}
{"entry_id": "2309.04461", "title": "Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models", "authors": ["Yangyi Chen", "Karan Sikka", "Michael Cogswell", "Heng Ji", "Ajay Divakaran"], "published": "2023-09-08 17:49:44", "updated": "2024-03-19 21:48:59", "summary": "Vision-language models (VLMs) have recently demonstrated strong efficacy as\nvisual assistants that can parse natural queries about the visual content and\ngenerate human-like outputs. In this work, we explore the ability of these\nmodels to demonstrate human-like reasoning based on the perceived information.\nTo address a crucial concern regarding the extent to which their reasoning\ncapabilities are fully consistent and grounded, we also measure the reasoning\nconsistency of these models. We achieve this by proposing a chain-of-thought\n(CoT) based consistency measure. However, such an evaluation requires a\nbenchmark that encompasses both high-level inference and detailed reasoning\nchains, which is costly. We tackle this challenge by proposing a\nLLM-Human-in-the-Loop pipeline, which notably reduces cost while simultaneously\nensuring the generation of a high-quality dataset. Based on this pipeline and\nthe existing coarse-grained annotated dataset, we build the CURE benchmark to\nmeasure both the zero-shot reasoning performance and consistency of VLMs. We\nevaluate existing state-of-the-art VLMs, and find that even the best-performing\nmodel is unable to demonstrate strong visual reasoning capabilities and\nconsistency, indicating that substantial efforts are required to enable VLMs to\nperform visual reasoning as systematically and consistently as humans. As an\nearly step, we propose a two-stage training framework aimed at improving both\nthe reasoning performance and consistency of VLMs. The first stage involves\nemploying supervised fine-tuning of VLMs using step-by-step reasoning samples\nautomatically generated by LLMs. In the second stage, we further augment the\ntraining process by incorporating feedback provided by LLMs to produce\nreasoning chains that are highly consistent and grounded. We empirically\nhighlight the effectiveness of our framework in both reasoning performance and\nconsistency.", "comment": "NAACL 2024 Main Conference. The data is released at\n https://github.com/Yangyi-Chen/CoTConsistency", "links": []}
-{"entry_id": "2403.12884", "title": "HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning", "authors": ["Fucai Ke", "Zhixi Cai", "Simindokht Jahangard", "Weiqing Wang", "Pari Delir Haghighi", "Hamid Rezatofighi"], "published": "2024-03-19 16:31:30", "updated": "2024-03-19 16:31:30", "summary": "Recent advances in visual reasoning (VR), particularly with the aid of Large\nVision-Language Models (VLMs), show promise but require access to large-scale\ndatasets and face challenges such as high computational costs and limited\ngeneralization capabilities. Compositional visual reasoning approaches have\nemerged as effective strategies; however, they heavily rely on the commonsense\nknowledge encoded in Large Language Models (LLMs) to perform planning,\nreasoning, or both, without considering the effect of their decisions on the\nvisual reasoning process, which can lead to errors or failed procedures. To\naddress these challenges, we introduce HYDRA, a multi-stage dynamic\ncompositional visual reasoning framework designed for reliable and\nincrementally progressive general reasoning. HYDRA integrates three essential\nmodules: a planner, a Reinforcement Learning (RL) agent serving as a cognitive\ncontroller, and a reasoner. The planner and reasoner modules utilize an LLM to\ngenerate instruction samples and executable code from the selected instruction,\nrespectively, while the RL agent dynamically interacts with these modules,\nmaking high-level decisions on selection of the best instruction sample given\ninformation from the historical state stored through a feedback loop. This\nadaptable design enables HYDRA to adjust its actions based on previous feedback\nreceived during the reasoning process, leading to more reliable reasoning\noutputs and ultimately enhancing its overall effectiveness. Our framework\ndemonstrates state-of-the-art performance in various VR tasks on four different\nwidely-used datasets.", "comment": null, "links": []}
{"entry_id": "2310.10207", "title": "Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World", "authors": ["Rujie Wu", "Xiaojian Ma", "Zhenliang Zhang", "Wei Wang", "Qing Li", "Song-Chun Zhu", "Yizhou Wang"], "published": "2023-10-16 09:19:18", "updated": "2024-03-18 09:05:12", "summary": "We introduce Bongard-OpenWorld, a new benchmark for evaluating real-world\nfew-shot reasoning for machine vision. It originates from the classical Bongard\nProblems (BPs): Given two sets of images (positive and negative), the model\nneeds to identify the set that query images belong to by inducing the visual\nconcepts, which is exclusively depicted by images from the positive set. Our\nbenchmark inherits the few-shot concept induction of the original BPs while\nadding the two novel layers of challenge: 1) open-world free-form concepts, as\nthe visual concepts in Bongard-OpenWorld are unique compositions of terms from\nan open vocabulary, ranging from object categories to abstract visual\nattributes and commonsense factual knowledge; 2) real-world images, as opposed\nto the synthetic diagrams used by many counterparts. In our exploration,\nBongard-OpenWorld already imposes a significant challenge to current few-shot\nreasoning algorithms. We further investigate to which extent the recently\nintroduced Large Language Models (LLMs) and Vision-Language Models (VLMs) can\nsolve our task, by directly probing VLMs, and combining VLMs and LLMs in an\ninteractive reasoning scheme. We even conceived a neuro-symbolic reasoning\napproach that reconciles LLMs & VLMs with logical reasoning to emulate the\nhuman problem-solving process for Bongard Problems. However, none of these\napproaches manage to close the human-machine gap, as the best learner achieves\n64% accuracy while human participants easily reach 91%. We hope\nBongard-OpenWorld can help us better understand the limitations of current\nvisual intelligence and facilitate future research on visual agents with\nstronger few-shot visual reasoning capabilities.", "comment": "Accepted to ICLR 2024", "links": []}
{"entry_id": "2403.11513", "title": "Visual Preference Inference: An Image Sequence-Based Preference Reasoning in Tabletop Object Manipulation", "authors": ["Joonhyung Lee", "Sangbeom Park", "Yongin Kwon", "Jemin Lee", "Minwook Ahn", "Sungjoon Choi"], "published": "2024-03-18 06:54:38", "updated": "2024-03-18 06:54:38", "summary": "In robotic object manipulation, human preferences can often be influenced by\nthe visual attributes of objects, such as color and shape. These properties\nplay a crucial role in operating a robot to interact with objects and align\nwith human intention. In this paper, we focus on the problem of inferring\nunderlying human preferences from a sequence of raw visual observations in\ntabletop manipulation environments with a variety of object types, named Visual\nPreference Inference (VPI). To facilitate visual reasoning in the context of\nmanipulation, we introduce the Chain-of-Visual-Residuals (CoVR) method. CoVR\nemploys a prompting mechanism that describes the difference between the\nconsecutive images (i.e., visual residuals) and incorporates such texts with a\nsequence of images to infer the user's preference. This approach significantly\nenhances the ability to understand and adapt to dynamic changes in its visual\nenvironment during manipulation tasks. Furthermore, we incorporate such texts\nalong with a sequence of images to infer the user's preferences. Our method\noutperforms baseline methods in terms of extracting human preferences from\nvisual sequences in both simulation and real-world environments. Code and\nvideos are available at:\n\\href{https://joonhyung-lee.github.io/vpi/}{https://joonhyung-lee.github.io/vpi/}", "comment": "8 pages", "links": []}
{"entry_id": "2403.06059", "title": "Test-time Distribution Learning Adapter for Cross-modal Visual Reasoning", "authors": ["Yi Zhang", "Ce Zhang"], "published": "2024-03-10 01:34:45", "updated": "2024-03-10 01:34:45", "summary": "Vision-Language Pre-Trained (VLP) models, such as CLIP, have demonstrated\nremarkable effectiveness in learning generic visual representations. Several\napproaches aim to efficiently adapt VLP models to downstream tasks with limited\nsupervision, aiming to leverage the acquired knowledge from VLP models.\nHowever, these methods suffer from either introducing biased representations or\nrequiring high computational complexity, which hinders their effectiveness in\nfine-tuning the CLIP model. Moreover, when a model is trained on data specific\nto a particular domain, its ability to generalize to uncharted domains\ndiminishes. In this work, we propose Test-Time Distribution LearNing Adapter\n(TT-DNA) which directly works during the testing period. Specifically, we\nestimate Gaussian distributions to model visual features of the few-shot\nsupport images to capture the knowledge from the support set. The cosine\nsimilarity between query image and the feature distribution of support images\nis used as the prediction of visual adapter. Subsequently, the visual adapter's\nprediction merges with the original CLIP prediction via a residual connection,\nresulting in the final prediction. Our extensive experimental results on visual\nreasoning for human object interaction demonstrate that our proposed TT-DNA\noutperforms existing state-of-the-art methods by large margins.", "comment": "Accepted by ICASSP 2024", "links": []}
@@ -84,7 +106,6 @@
{"entry_id": "2308.11971", "title": "EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE", "authors": ["Junyi Chen", "Longteng Guo", "Jia Sun", "Shuai Shao", "Zehuan Yuan", "Liang Lin", "Dongyu Zhang"], "published": "2023-08-23 07:36:30", "updated": "2024-03-01 11:22:54", "summary": "Building scalable vision-language models to learn from diverse, multimodal\ndata remains an open challenge. In this paper, we introduce an Efficient\nVision-languagE foundation model, namely EVE, which is one unified multimodal\nTransformer pre-trained solely by one unified pre-training task. Specifically,\nEVE encodes both vision and language within a shared Transformer network\nintegrated with modality-aware sparse Mixture-of-Experts (MoE) modules, which\ncapture modality-specific information by selectively switching to different\nexperts. To unify pre-training tasks of vision and language, EVE performs\nmasked signal modeling on image-text pairs to reconstruct masked signals, i.e.,\nimage pixels and text tokens, given visible signals. This simple yet effective\npre-training objective accelerates training by 3.5x compared to the model\npre-trained with Image-Text Contrastive and Image-Text Matching losses. Owing\nto the combination of the unified architecture and pre-training task, EVE is\neasy to scale up, enabling better downstream performance with fewer resources\nand faster training speed. Despite its simplicity, EVE achieves\nstate-of-the-art performance on various vision-language downstream tasks,\nincluding visual question answering, visual reasoning, and image-text\nretrieval.", "comment": "Accepted by AAAI 2024", "links": []}
{"entry_id": "2403.00352", "title": "Revisiting Disentanglement in Downstream Tasks: A Study on Its Necessity for Abstract Visual Reasoning", "authors": ["Ruiqian Nai", "Zixin Wen", "Ji Li", "Yuanzhi Li", "Yang Gao"], "published": "2024-03-01 08:31:58", "updated": "2024-03-01 08:31:58", "summary": "In representation learning, a disentangled representation is highly desirable\nas it encodes generative factors of data in a separable and compact pattern.\nResearchers have advocated leveraging disentangled representations to complete\ndownstream tasks with encouraging empirical evidence. This paper further\ninvestigates the necessity of disentangled representation in downstream\napplications. Specifically, we show that dimension-wise disentangled\nrepresentations are unnecessary on a fundamental downstream task, abstract\nvisual reasoning. We provide extensive empirical evidence against the necessity\nof disentanglement, covering multiple datasets, representation learning\nmethods, and downstream network architectures. Furthermore, our findings\nsuggest that the informativeness of representations is a better indicator of\ndownstream performance than disentanglement. Finally, the positive correlation\nbetween informativeness and disentanglement explains the claimed usefulness of\ndisentangled representations in previous works. The source code is available at\nhttps://github.com/Richard-coder-Nai/disentanglement-lib-necessity.git.", "comment": "Accepted to AAAI-2024", "links": []}
{"entry_id": "2308.10562", "title": "Seeing the Intangible: Survey of Image Classification into High-Level and Abstract Categories", "authors": ["Delfina Sol Martinez Pandiani", "Valentina Presutti"], "published": "2023-08-21 08:37:04", "updated": "2024-02-29 16:18:45", "summary": "The field of Computer Vision (CV) is increasingly shifting towards\n``high-level'' visual sensemaking tasks, yet the exact nature of these tasks\nremains unclear and tacit. This survey paper addresses this ambiguity by\nsystematically reviewing research on high-level visual understanding, focusing\nparticularly on Abstract Concepts (ACs) in automatic image classification. Our\nsurvey contributes in three main ways: Firstly, it clarifies the tacit\nunderstanding of high-level semantics in CV through a multidisciplinary\nanalysis, and categorization into distinct clusters, including commonsense,\nemotional, aesthetic, and inductive interpretative semantics. Secondly, it\nidentifies and categorizes computer vision tasks associated with high-level\nvisual sensemaking, offering insights into the diverse research areas within\nthis domain. Lastly, it examines how abstract concepts such as values and\nideologies are handled in CV, revealing challenges and opportunities in\nAC-based image classification. Notably, our survey of AC image classification\ntasks highlights persistent challenges, such as the limited efficacy of massive\ndatasets and the importance of integrating supplementary information and\nmid-level features. We emphasize the growing relevance of hybrid AI systems in\naddressing the multifaceted nature of AC image classification tasks. Overall,\nthis survey enhances our understanding of high-level visual reasoning in CV and\nlays the groundwork for future research endeavors.", "comment": "Preprint", "links": []}
-{"entry_id": "2310.04671", "title": "Visual Abductive Reasoning Meets Driving Hazard Prediction", "authors": ["Korawat Charoenpitaks", "Van-Quang Nguyen", "Masanori Suganuma", "Masahiro Takahashi", "Ryoma Niihara", "Takayuki Okatani"], "published": "2023-10-07 03:16:30", "updated": "2024-02-27 14:22:09", "summary": "This paper addresses the problem of predicting hazards that drivers may\nencounter while driving a car. We formulate it as a task of anticipating\nimpending accidents using a single input image captured by car dashcams. Unlike\nexisting approaches to driving hazard prediction that rely on computational\nsimulations or anomaly detection from videos, this study focuses on high-level\ninference from static images. The problem needs predicting and reasoning about\nfuture events based on uncertain observations, which falls under visual\nabductive reasoning. To enable research in this understudied area, a new\ndataset named the DHPR (Driving Hazard Prediction and Reasoning) dataset is\ncreated. The dataset consists of 15K dashcam images of street scenes, and each\nimage is associated with a tuple containing car speed, a hypothesized hazard\ndescription, and visual entities present in the scene. These are annotated by\nhuman annotators, who identify risky scenes and provide descriptions of\npotential accidents that could occur a few seconds later. We present several\nbaseline methods and evaluate their performance on our dataset, identifying\nremaining issues and discussing future directions. This study contributes to\nthe field by introducing a novel problem formulation and dataset, enabling\nresearchers to explore the potential of multi-modal AI for driving hazard\nprediction.", "comment": "Main Paper: 10 pages, Supplementary Materials: 28 pages", "links": []}
{"entry_id": "2403.10534", "title": "VISREAS: Complex Visual Reasoning with Unanswerable Questions", "authors": ["Syeda Nahida Akter", "Sangwu Lee", "Yingshan Chang", "Yonatan Bisk", "Eric Nyberg"], "published": "2024-02-23 00:12:10", "updated": "2024-02-23 00:12:10", "summary": "Verifying a question's validity before answering is crucial in real-world\napplications, where users may provide imperfect instructions. In this scenario,\nan ideal model should address the discrepancies in the query and convey them to\nthe users rather than generating the best possible answer. Addressing this\nrequirement, we introduce a new compositional visual question-answering\ndataset, VISREAS, that consists of answerable and unanswerable visual queries\nformulated by traversing and perturbing commonalities and differences among\nobjects, attributes, and relations. VISREAS contains 2.07M semantically diverse\nqueries generated automatically using Visual Genome scene graphs. The unique\nfeature of this task, validating question answerability with respect to an\nimage before answering, and the poor performance of state-of-the-art models\ninspired the design of a new modular baseline, LOGIC2VISION that reasons by\nproducing and executing pseudocode without any external modules to generate the\nanswer. LOGIC2VISION outperforms generative models in VISREAS (+4.82% over\nLLaVA-1.5; +12.23% over InstructBLIP) and achieves a significant gain in\nperformance against the classification models.", "comment": "18 pages, 14 figures, 5 tables", "links": []}
{"entry_id": "2402.12675", "title": "Visual Reasoning in Object-Centric Deep Neural Networks: A Comparative Cognition Approach", "authors": ["Guillermo Puebla", "Jeffrey S. Bowers"], "published": "2024-02-20 02:48:14", "updated": "2024-02-20 02:48:14", "summary": "Achieving visual reasoning is a long-term goal of artificial intelligence. In\nthe last decade, several studies have applied deep neural networks (DNNs) to\nthe task of learning visual relations from images, with modest results in terms\nof generalization of the relations learned. However, in recent years,\nobject-centric representation learning has been put forward as a way to achieve\nvisual reasoning within the deep learning framework. Object-centric models\nattempt to model input scenes as compositions of objects and relations between\nthem. To this end, these models use several kinds of attention mechanisms to\nsegregate the individual objects in a scene from the background and from other\nobjects. In this work we tested relation learning and generalization in several\nobject-centric models, as well as a ResNet-50 baseline. In contrast to previous\nresearch, which has focused heavily in the same-different task in order to\nasses relational reasoning in DNNs, we use a set of tasks -- with varying\ndegrees of difficulty -- derived from the comparative cognition literature. Our\nresults show that object-centric models are able to segregate the different\nobjects in a scene, even in many out-of-distribution cases. In our simpler\ntasks, this improves their capacity to learn and generalize visual relations in\ncomparison to the ResNet-50 baseline. However, object-centric models still\nstruggle in our more difficult tasks and conditions. We conclude that abstract\nvisual reasoning remains an open challenge for DNNs, including object-centric\nmodels.", "comment": "16 pages, 14 figures", "links": []}
{"entry_id": "2402.11574", "title": "Visual In-Context Learning for Large Vision-Language Models", "authors": ["Yucheng Zhou", "Xiang Li", "Qianning Wang", "Jianbing Shen"], "published": "2024-02-18 12:43:38", "updated": "2024-02-18 12:43:38", "summary": "In Large Visual Language Models (LVLMs), the efficacy of In-Context Learning\n(ICL) remains limited by challenges in cross-modal interactions and\nrepresentation disparities. To overcome these challenges, we introduce a novel\nVisual In-Context Learning (VICL) method comprising Visual Demonstration\nRetrieval, Intent-Oriented Image Summarization, and Intent-Oriented\nDemonstration Composition. Our approach retrieves images via ''Retrieval &\nRerank'' paradigm, summarises images with task intent and task-specific visual\nparsing, and composes language-based demonstrations that reduce token count and\nalleviate cross-modal interaction problem. Experimental evaluations on five\nvisual reasoning datasets demonstrate the effectiveness of our method.\nMoreover, our extensive experiments leverage information flow analysis to\nelucidate the effectiveness of our method, and investigate the impact of length\nand position of demonstrations for LVLM. The use of in-context unlearning\nfurther shows promise in resetting specific model knowledge without retraining.", "comment": "13 pages, 7 figures", "links": []}
@@ -97,7 +118,6 @@
{"entry_id": "2401.11035", "title": "Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content Counterfactually", "authors": ["Mazal Bethany", "Brandon Wherry", "Nishant Vishwamitra", "Peyman Najafirad"], "published": "2024-01-19 21:38:18", "updated": "2024-01-19 21:38:18", "summary": "Social media platforms are being increasingly used by malicious actors to\nshare unsafe content, such as images depicting sexual activity, cyberbullying,\nand self-harm. Consequently, major platforms use artificial intelligence (AI)\nand human moderation to obfuscate such images to make them safer. Two critical\nneeds for obfuscating unsafe images is that an accurate rationale for\nobfuscating image regions must be provided, and the sensitive regions should be\nobfuscated (\\textit{e.g.} blurring) for users' safety. This process involves\naddressing two key problems: (1) the reason for obfuscating unsafe images\ndemands the platform to provide an accurate rationale that must be grounded in\nunsafe image-specific attributes, and (2) the unsafe regions in the image must\nbe minimally obfuscated while still depicting the safe regions. In this work,\nwe address these key issues by first performing visual reasoning by designing a\nvisual reasoning model (VLM) conditioned on pre-trained unsafe image\nclassifiers to provide an accurate rationale grounded in unsafe image\nattributes, and then proposing a counterfactual explanation algorithm that\nminimally identifies and obfuscates unsafe regions for safe viewing, by first\nutilizing an unsafe image classifier attribution matrix to guide segmentation\nfor a more optimal subregion segmentation followed by an informed greedy search\nto determine the minimum number of subregions required to modify the\nclassifier's output based on attribution score. Extensive experiments on\nuncurated data from social networks emphasize the efficacy of our proposed\nmethod. We make our code available at:\nhttps://github.com/SecureAIAutonomyLab/ConditionalVLM", "comment": null, "links": []}
{"entry_id": "2212.08044", "title": "Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift", "authors": ["Jielin Qiu", "Yi Zhu", "Xingjian Shi", "Florian Wenzel", "Zhiqiang Tang", "Ding Zhao", "Bo Li", "Mu Li"], "published": "2022-12-15 18:52:03", "updated": "2024-01-19 15:29:34", "summary": "Multimodal image-text models have shown remarkable performance in the past\nfew years. However, evaluating robustness against distribution shifts is\ncrucial before adopting them in real-world applications. In this work, we\ninvestigate the robustness of 12 popular open-sourced image-text models under\ncommon perturbations on five tasks (image-text retrieval, visual reasoning,\nvisual entailment, image captioning, and text-to-image generation). In\nparticular, we propose several new multimodal robustness benchmarks by applying\n17 image perturbation and 16 text perturbation techniques on top of existing\ndatasets. We observe that multimodal models are not robust to image and text\nperturbations, especially to image perturbations. Among the tested perturbation\nmethods, character-level perturbations constitute the most severe distribution\nshift for text, and zoom blur is the most severe shift for image data. We also\nintroduce two new robustness metrics (\\textbf{MMI} for MultiModal Impact score\nand \\textbf{MOR} for Missing Object Rate) for proper evaluations of multimodal\nmodels. We hope our extensive study sheds light on new directions for the\ndevelopment of robust multimodal models. More details can be found on the\nproject webpage: \\url{https://MMRobustness.github.io}.", "comment": "Accepted by Journal of Data-centric Machine Learning Research (DMLR)\n 2024", "links": []}
{"entry_id": "2401.08695", "title": "Enabling Collaborative Clinical Diagnosis of Infectious Keratitis by Integrating Expert Knowledge and Interpretable Data-driven Intelligence", "authors": ["Zhengqing Fang", "Shuowen Zhou", "Zhouhang Yuan", "Yuxuan Si", "Mengze Li", "Jinxu Li", "Yesheng Xu", "Wenjia Xie", "Kun Kuang", "Yingming Li", "Fei Wu", "Yu-Feng Yao"], "published": "2024-01-14 02:10:54", "updated": "2024-01-14 02:10:54", "summary": "Although data-driven artificial intelligence (AI) in medical image diagnosis\nhas shown impressive performance in silico, the lack of interpretability makes\nit difficult to incorporate the \"black box\" into clinicians' workflows. To make\nthe diagnostic patterns learned from data understandable by clinicians, we\ndevelop an interpretable model, knowledge-guided diagnosis model (KGDM), that\nprovides a visualized reasoning process containing AI-based biomarkers and\nretrieved cases that with the same diagnostic patterns. It embraces clinicians'\nprompts into the interpreted reasoning through human-AI interaction, leading to\npotentially enhanced safety and more accurate predictions. This study\ninvestigates the performance, interpretability, and clinical utility of KGDM in\nthe diagnosis of infectious keratitis (IK), which is the leading cause of\ncorneal blindness. The classification performance of KGDM is evaluated on a\nprospective validation dataset, an external testing dataset, and an publicly\navailable testing dataset. The diagnostic odds ratios (DOR) of the interpreted\nAI-based biomarkers are effective, ranging from 3.011 to 35.233 and exhibit\nconsistent diagnostic patterns with clinic experience. Moreover, a human-AI\ncollaborative diagnosis test is conducted and the participants with\ncollaboration achieved a performance exceeding that of both humans and AI. By\nsynergistically integrating interpretability and interaction, this study\nfacilitates the convergence of clinicians' expertise and data-driven\nintelligence. The promotion of inexperienced ophthalmologists with the aid of\nAI-based biomarkers, as well as increased AI prediction by intervention from\nexperienced ones, demonstrate a promising diagnostic paradigm for infectious\nkeratitis using KGDM, which holds the potential for extension to other diseases\nwhere experienced medical practitioners are limited and the safety of AI is\nconcerned.", "comment": "33 pages", "links": []}
-{"entry_id": "2303.10428", "title": "A Region-Prompted Adapter Tuning for Visual Abductive Reasoning", "authors": ["Hao Zhang", "Yeo Keat Ee", "Basura Fernando"], "published": "2023-03-18 14:46:44", "updated": "2024-01-07 05:06:26", "summary": "Visual Abductive Reasoning is an emerging vision-language (VL) topic where\nthe model needs to retrieve/generate a likely textual hypothesis from a visual\ninput (image or its part) using backward reasoning based on commonsense. Unlike\nin conventional VL retrieval or captioning tasks, where entities of texts\nappear in the image, in abductive inferences, the relevant facts about\ninferences are not readily apparent in the input images. Besides, these\ninferences are causally linked to specific regional visual cues and would\nchange as cues change. Existing works highlight cues utilizing a specific\nprompt (e.g., colorful prompt). Then, a full fine-tuning of a VL foundation\nmodel is launched to tweak its function from perception to deduction. However,\nthe colorful prompt uniformly patchify ``regional hints'' and ``global\ncontext'' at the same granularity level and may lose fine-grained visual\ndetails crucial for VAR. Meanwhile, full fine-tuning of VLF on limited data\nwould easily be overfitted.\n To tackle this, we propose a simple yet effective Region-Prompted Adapter\n(RPA), a hybrid parameter-efficient fine-tuning method that leverages the\nstrengths of detailed cues and efficient training for the VAR task.\nRPA~consists of two novel modules: Regional Prompt Generator (RPG) and\nAdapter$^\\textbf{+}$. The prior encodes ``regional visual hints'' and ``global\ncontexts'' into visual prompts separately at fine and coarse-grained levels.\nThe latter extends the vanilla adapters with a new Map Adapter, which modifies\nthe attention map using a trainable low-dim query/key projection. Additionally,\nwe propose a new Dual-Contrastive Loss to regress the visual feature toward\nfeatures of factual description and plausible hypothesis. Experiments on the\nSherlock demonstrate that RPA outperforms previous SOTAs, achieving the 1st\nrank on leaderboards (Comparison to Human Accuracy: RPA~31.74 vs CPT-CLIP\n29.58).", "comment": "13 pages, 11 figures, Under Review of IEEE Transaction", "links": []}
{"entry_id": "2301.13335", "title": "Multi-modal Large Language Model Enhanced Pseudo 3D Perception Framework for Visual Commonsense Reasoning", "authors": ["Jian Zhu", "Hanli Wang", "Miaojing Shi"], "published": "2023-01-30 23:43:28", "updated": "2023-12-25 12:59:02", "summary": "The visual commonsense reasoning (VCR) task is to choose an answer and\nprovide a justifying rationale based on the given image and textural question.\nRepresentative works first recognize objects in images and then associate them\nwith key words in texts. However, existing approaches do not consider exact\npositions of objects in a human-like three-dimensional (3D) manner, making them\nincompetent to accurately distinguish objects and understand visual relation.\nRecently, multi-modal large language models (MLLMs) have been used as powerful\ntools for several multi-modal tasks but not for VCR yet, which requires\nelaborate reasoning on specific visual objects referred by texts. In light of\nthe above, an MLLM enhanced pseudo 3D perception framework is designed for VCR.\nSpecifically, we first demonstrate that the relation between objects is\nrelevant to object depths in images, and hence introduce object depth into VCR\nframeworks to infer 3D positions of objects in images. Then, a depth-aware\nTransformer is proposed to encode depth differences between objects into the\nattention mechanism of Transformer to discriminatively associate objects with\nvisual scenes guided by depth. To further associate the answer with the depth\nof visual scene, each word in the answer is tagged with a pseudo depth to\nrealize depth-aware association between answer words and objects. On the other\nhand, BLIP-2 as an MLLM is employed to process images and texts, and the\nreferring expressions in texts involving specific visual objects are modified\nwith linguistic object labels to serve as comprehensible MLLM inputs. Finally,\na parameter optimization technique is devised to fully consider the quality of\ndata batches based on multi-level reasoning confidence. Experiments on the VCR\ndataset demonstrate the superiority of the proposed framework over\nstate-of-the-art approaches.", "comment": null, "links": []}
{"entry_id": "2312.14233", "title": "VCoder: Versatile Vision Encoders for Multimodal Large Language Models", "authors": ["Jitesh Jain", "Jianwei Yang", "Humphrey Shi"], "published": "2023-12-21 18:49:47", "updated": "2023-12-21 18:49:47", "summary": "Humans possess the remarkable skill of Visual Perception, the ability to see\nand understand the seen, helping them make sense of the visual world and, in\nturn, reason. Multimodal Large Language Models (MLLM) have recently achieved\nimpressive performance on vision-language tasks ranging from visual\nquestion-answering and image captioning to visual reasoning and image\ngeneration. However, when prompted to identify or count (perceive) the entities\nin a given image, existing MLLM systems fail. Working towards developing an\naccurate MLLM system for perception and reasoning, we propose using Versatile\nvision enCoders (VCoder) as perception eyes for Multimodal LLMs. We feed the\nVCoder with perception modalities such as segmentation or depth maps, improving\nthe MLLM's perception abilities. Secondly, we leverage the images from COCO and\noutputs from off-the-shelf vision perception models to create our COCO\nSegmentation Text (COST) dataset for training and evaluating MLLMs on the\nobject perception task. Thirdly, we introduce metrics to assess the object\nperception abilities in MLLMs on our COST dataset. Lastly, we provide extensive\nexperimental evidence proving the VCoder's improved object-level perception\nskills over existing Multimodal LLMs, including GPT-4V. We open-source our\ndataset, code, and models to promote research. We open-source our code at\nhttps://github.com/SHI-Labs/VCoder", "comment": "Project Page: https://praeclarumjj3.github.io/vcoder/", "links": []}
{"entry_id": "2312.12436", "title": "A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise", "authors": ["Chaoyou Fu", "Renrui Zhang", "Zihan Wang", "Yubo Huang", "Zhengye Zhang", "Longtian Qiu", "Gaoxiang Ye", "Yunhang Shen", "Mengdan Zhang", "Peixian Chen", "Sirui Zhao", "Shaohui Lin", "Deqiang Jiang", "Di Yin", "Peng Gao", "Ke Li", "Hongsheng Li", "Xing Sun"], "published": "2023-12-19 18:59:22", "updated": "2023-12-20 12:40:47", "summary": "The surge of interest towards Multi-modal Large Language Models (MLLMs),\ne.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both\nacademia and industry. They endow Large Language Models (LLMs) with powerful\ncapabilities in visual understanding, enabling them to tackle diverse\nmulti-modal tasks. Very recently, Google released Gemini, its newest and most\ncapable MLLM built from the ground up for multi-modality. In light of the\nsuperior reasoning capabilities, can Gemini challenge GPT-4V's leading position\nin multi-modal learning? In this paper, we present a preliminary exploration of\nGemini Pro's visual understanding proficiency, which comprehensively covers\nfour domains: fundamental perception, advanced cognition, challenging vision\ntasks, and various expert capacities. We compare Gemini Pro with the\nstate-of-the-art GPT-4V to evaluate its upper limits, along with the latest\nopen-sourced MLLM, Sphinx, which reveals the gap between manual efforts and\nblack-box systems. The qualitative samples indicate that, while GPT-4V and\nGemini showcase different answering styles and preferences, they can exhibit\ncomparable visual reasoning capabilities, and Sphinx still trails behind them\nconcerning domain generalizability. Specifically, GPT-4V tends to elaborate\ndetailed explanations and intermediate steps, and Gemini prefers to output a\ndirect and concise answer. The quantitative evaluation on the popular MME\nbenchmark also demonstrates the potential of Gemini to be a strong challenger\nto GPT-4V. Our early investigation of Gemini also observes some common issues\nof MLLMs, indicating that there still remains a considerable distance towards\nartificial general intelligence. Our project for tracking the progress of MLLM\nis released at\nhttps://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.", "comment": "Total 120 pages. See our project at\n https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models", "links": []}
diff --git a/arxiv_visual_reasoning.md b/arxiv_visual_reasoning.md
index aeae885..27f26f8 100644
--- a/arxiv_visual_reasoning.md
+++ b/arxiv_visual_reasoning.md
@@ -8,11 +8,775 @@ and is automatically generated by [update_arxiv.py](./tool/update_arxiv.py).
-Last update: 2024-07-01 08:03:44
+Last update: 2024-08-01 08:02:40
___
-## [From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis](https://arxiv.org/pdf/2406.19934) [New]
+## [A Plug-and-Play Method for Rare Human-Object Interactions Detection by Bridging Domain Gap](https://arxiv.org/pdf/2407.21438) [New]
+
+*Lijun Zhang, Wei Suo, Peng Wang, Yanning Zhang*
+
+**Abstract:** Human-object interactions (HOI) detection aims at capturing human-object
+pairs in images and corresponding actions. It is an important step toward
+high-level visual reasoning and scene understanding. However, due to the
+natural bias from the real world, existing methods mostly struggle with rare
+human-object pairs and lead to sub-optimal results. Recently, with the
+development of the generative model, a straightforward approach is to construct
+a more balanced dataset based on a group of supplementary samples.
+Unfortunately, there is a significant domain gap between the generated data and
+the original data, and simply merging the generated images into the original
+dataset cannot significantly boost the performance. To alleviate the above
+problem, we present a novel model-agnostic framework called
+\textbf{C}ontext-\textbf{E}nhanced \textbf{F}eature \textbf{A}lignment (CEFA)
+module, which can effectively align the generated data with the original data
+at the feature level and bridge the domain gap. Specifically, CEFA consists of
+a feature alignment module and a context enhancement module. On one hand,
+considering the crucial role of human-object pairs information in HOI tasks,
+the feature alignment module aligns the human-object pairs by aggregating
+instance information. On the other hand, to mitigate the issue of losing
+important context information caused by the traditional discriminator-style
+alignment method, we employ a context-enhanced image reconstruction module to
+improve the model's learning ability of contextual cues. Extensive experiments
+have shown that our method can serve as a plug-and-play module to improve the
+detection performance of HOI models on rare
+categories\footnote{https://github.com/LijunZhang01/CEFA}.
+
+**published:** *2024-07-31 08:42:48*, **updated:** *2024-07-31 08:42:48*
+
+
+
+## [Chat2Layout: Interactive 3D Furniture Layout with a Multimodal LLM](https://arxiv.org/pdf/2407.21333) [New]
+
+*Can Wang, Hongliang Zhong, Menglei Chai, Mingming He, Dongdong Chen, Jing Liao*
+
+**Abstract:** Automatic furniture layout is long desired for convenient interior design.
+Leveraging the remarkable visual reasoning capabilities of multimodal large
+language models (MLLMs), recent methods address layout generation in a static
+manner, lacking the feedback-driven refinement essential for interactive user
+engagement. We introduce Chat2Layout, a novel interactive furniture layout
+generation system that extends the functionality of MLLMs into the realm of
+interactive layout design. To achieve this, we establish a unified
+vision-question paradigm for in-context learning, enabling seamless
+communication with MLLMs to steer their behavior without altering model
+weights. Within this framework, we present a novel training-free visual
+prompting mechanism. This involves a visual-text prompting technique that
+assist MLLMs in reasoning about plausible layout plans, followed by an
+Offline-to-Online search (O2O-Search) method, which automatically identifies
+the minimal set of informative references to provide exemplars for visual-text
+prompting. By employing an agent system with MLLMs as the core controller, we
+enable bidirectional interaction. The agent not only comprehends the 3D
+environment and user requirements through linguistic and visual perception but
+also plans tasks and reasons about actions to generate and arrange furniture
+within the virtual space. Furthermore, the agent iteratively updates based on
+visual feedback from execution results. Experimental results demonstrate that
+our approach facilitates language-interactive generation and arrangement for
+diverse and complex 3D furniture.
+
+**comment:** *Main paper with supplemental materials*
+
+**published:** *2024-07-31 04:49:46*, **updated:** *2024-07-31 04:49:46*
+
+
+
+## [Pyramid Coder: Hierarchical Code Generator for Compositional Visual Question Answering](https://arxiv.org/pdf/2407.20563) [New]
+
+*Ruoyue Shen, Nakamasa Inoue, Koichi Shinoda*
+
+**Abstract:** Visual question answering (VQA) is the task of providing accurate answers to
+natural language questions based on visual input. Programmatic VQA (PVQA)
+models have been gaining attention recently. These use large language models
+(LLMs) to formulate executable programs that address questions requiring
+complex visual reasoning. However, there are challenges in enabling LLMs to
+comprehend the usage of image processing modules and generate relevant code. To
+overcome these challenges, this paper introduces PyramidCoder, a novel
+prompting framework for PVQA models. PyramidCoder consists of three
+hierarchical levels, each serving a distinct purpose: query rephrasing, code
+generation, and answer aggregation. Notably, PyramidCoder utilizes a single
+frozen LLM and pre-defined prompts at each level, eliminating the need for
+additional training and ensuring flexibility across various LLM architectures.
+Compared to the state-of-the-art PVQA model, our approach improves accuracy by
+at least 0.5% on the GQA dataset, 1.4% on the VQAv2 dataset, and 2.9% on the
+NLVR2 dataset.
+
+**comment:** *Accepted to the IEEE International Conference on Image Processing
+ (IEEE ICIP) 2024*
+
+**published:** *2024-07-30 05:36:43*, **updated:** *2024-07-30 05:36:43*
+
+
+
+## [Take A Step Back: Rethinking the Two Stages in Visual Reasoning](https://arxiv.org/pdf/2407.19666) [New]
+
+*Mingyu Zhang, Jiting Cai, Mingyu Liu, Yue Xu, Cewu Lu, Yong-Lu Li*
+
+**Abstract:** Visual reasoning, as a prominent research area, plays a crucial role in AI by
+facilitating concept formation and interaction with the world. However, current
+works are usually carried out separately on small datasets thus lacking
+generalization ability. Through rigorous evaluation of diverse benchmarks, we
+demonstrate the shortcomings of existing ad-hoc methods in achieving
+cross-domain reasoning and their tendency to data bias fitting. In this paper,
+we revisit visual reasoning with a two-stage perspective: (1) symbolization and
+(2) logical reasoning given symbols or their representations. We find that the
+reasoning stage is better at generalization than symbolization. Thus, it is
+more efficient to implement symbolization via separated encoders for different
+data domains while using a shared reasoner. Given our findings, we establish
+design principles for visual reasoning frameworks following the separated
+symbolization and shared reasoning. The proposed two-stage framework achieves
+impressive generalization ability on various visual reasoning tasks, including
+puzzles, physical prediction, and visual question answering (VQA), encompassing
+both 2D and 3D modalities. We believe our insights will pave the way for
+generalizable visual reasoning.
+
+**comment:** *ECCV 2024, Project page:
+ https://mybearyzhang.github.io/projects/TwoStageReason/*
+
+**published:** *2024-07-29 02:56:19*, **updated:** *2024-07-29 02:56:19*
+
+
+
+## [Solving Robotics Problems in Zero-Shot with Vision-Language Models](https://arxiv.org/pdf/2407.19094) [New]
+
+*Zidan Wang, Rui Shen, Bradly Stadie*
+
+**Abstract:** We introduce Wonderful Team, a multi-agent visual LLM (VLLM) framework for
+solving robotics problems in the zero-shot regime. By zero-shot we mean that,
+for a novel environment, we feed a VLLM an image of the robot's environment and
+a description of the task, and have the VLLM output the sequence of actions
+necessary for the robot to complete the task. Prior work on VLLMs in robotics
+has largely focused on settings where some part of the pipeline is fine-tuned,
+such as tuning an LLM on robot data or training a separate vision encoder for
+perception and action generation. Surprisingly, due to recent advances in the
+capabilities of VLLMs, this type of fine-tuning may no longer be necessary for
+many tasks. In this work, we show that with careful engineering, we can prompt
+a single off-the-shelf VLLM to handle all aspects of a robotics task, from
+high-level planning to low-level location-extraction and action-execution.
+Wonderful Team builds on recent advances in multi-agent LLMs to partition tasks
+across an agent hierarchy, making it self-corrective and able to effectively
+partition and solve even long-horizon tasks. Extensive experiments on VIMABench
+and real-world robotic environments demonstrate the system's capability to
+handle a variety of robotic tasks, including manipulation, visual
+goal-reaching, and visual reasoning, all in a zero-shot manner. These results
+underscore a key point: vision-language models have progressed rapidly in the
+past year, and should strongly be considered as a backbone for robotics
+problems going forward.
+
+**comment:** *aka Wonderful Team*
+
+**published:** *2024-07-26 21:18:57*, **updated:** *2024-07-26 21:18:57*
+
+
+
+## [Investigating learning-independent abstract reasoning in artificial neural networks](https://arxiv.org/pdf/2407.17791) [New]
+
+*Tomer Barak, Yonatan Loewenstein*
+
+**Abstract:** Humans are capable of solving complex abstract reasoning tests. Whether this
+ability reflects a learning-independent inference mechanism applicable to any
+novel unlearned problem or whether it is a manifestation of extensive training
+throughout life is an open question. Addressing this question in humans is
+challenging because it is impossible to control their prior training. However,
+assuming a similarity between the cognitive processing of Artificial Neural
+Networks (ANNs) and humans, the extent to which training is required for ANNs'
+abstract reasoning is informative about this question in humans. Previous
+studies demonstrated that ANNs can solve abstract reasoning tests. However,
+this success required extensive training. In this study, we examined the
+learning-independent abstract reasoning of ANNs. Specifically, we evaluated
+their performance without any pretraining, with the ANNs' weights being
+randomly-initialized, and only change in the process of problem solving. We
+found that naive ANN models can solve non-trivial visual reasoning tests,
+similar to those used to evaluate human learning-independent reasoning. We
+further studied the mechanisms that support this ability. Our results suggest
+the possibility of learning-independent abstract reasoning that does not
+require extensive training.
+
+**published:** *2024-07-25 05:58:58*, **updated:** *2024-07-25 05:58:58*
+
+
+
+## [KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models](https://arxiv.org/pdf/2407.17773) [New]
+
+*Eunice Yiu, Maan Qraitem, Charlie Wong, Anisa Noor Majhi, Yutong Bai, Shiry Ginosar, Alison Gopnik, Kate Saenko*
+
+**Abstract:** This paper investigates visual analogical reasoning in large multimodal
+models (LMMs) compared to human adults and children. A "visual analogy" is an
+abstract rule inferred from one image and applied to another. While benchmarks
+exist for testing visual reasoning in LMMs, they require advanced skills and
+omit basic visual analogies that even young children can make. Inspired by
+developmental psychology, we propose a new benchmark of 1,400 visual
+transformations of everyday objects to test LMMs on visual analogical reasoning
+and compare them to children and adults. We structure the evaluation into three
+stages: identifying what changed (e.g., color, number, etc.), how it changed
+(e.g., added one object), and applying the rule to new scenarios. Our findings
+show that while models like GPT-4V, LLaVA-1.5, and MANTIS identify the "what"
+effectively, they struggle with quantifying the "how" and extrapolating this
+rule to new objects. In contrast, children and adults exhibit much stronger
+analogical reasoning at all three stages. Additionally, the strongest tested
+model, GPT-4V, performs better in tasks involving simple visual attributes like
+color and size, correlating with quicker human adult response times.
+Conversely, more complex tasks such as number, rotation, and reflection, which
+necessitate extensive cognitive processing and understanding of the 3D physical
+world, present more significant challenges. Altogether, these findings
+highlight the limitations of training models on data that primarily consists of
+2D images and text.
+
+**comment:** *9 pages. For the KiVA benchmark, see https://github.com/ey242/KiVA*
+
+**published:** *2024-07-25 05:02:39*, **updated:** *2024-07-25 05:02:39*
+
+
+
+## [Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model](https://arxiv.org/pdf/2407.07053) [New]
+
+*Wenqi Zhang, Zhenglin Cheng, Yuanyu He, Mengna Wang, Yongliang Shen, Zeqi Tan, Guiyang Hou, Mingqian He, Yanna Ma, Weiming Lu, Yueting Zhuang*
+
+**Abstract:** Although most current large multimodal models (LMMs) can already understand
+photos of natural scenes and portraits, their understanding of abstract images,
+e.g., charts, maps, or layouts, and visual reasoning capabilities remains quite
+rudimentary. They often struggle with simple daily tasks, such as reading time
+from a clock, understanding a flowchart, or planning a route using a road map.
+In light of this, we design a multi-modal self-instruct, utilizing large
+language models and their code capabilities to synthesize massive abstract
+images and visual reasoning instructions across daily scenarios. Our strategy
+effortlessly creates a multimodal benchmark with 11,193 instructions for eight
+visual scenarios: charts, tables, simulated maps, dashboards, flowcharts,
+relation graphs, floor plans, and visual puzzles. \textbf{This benchmark,
+constructed with simple lines and geometric elements, exposes the shortcomings
+of most advanced LMMs} like Claude-3.5-Sonnet and GPT-4o in abstract image
+understanding, spatial relations reasoning, and visual element induction.
+Besides, to verify the quality of our synthetic data, we fine-tune an LMM using
+62,476 synthetic chart, table and road map instructions. The results
+demonstrate improved chart understanding and map navigation performance, and
+also demonstrate potential benefits for other visual reasoning tasks. Our code
+is available at: \url{https://github.com/zwq2018/Multi-modal-Self-instruct}.
+
+**comment:** *code: https://github.com/zwq2018/Multi-modal-Self-instruct dataset:
+ https://huggingface.co/datasets/zwq2018/Multi-modal-Self-instruct
+ Leaderboard: https://multi-modal-self-instruct.github.io/*
+
+**published:** *2024-07-09 17:18:27*, **updated:** *2024-07-23 17:12:12*
+
+
+
+## [PropTest: Automatic Property Testing for Improved Visual Programming](https://arxiv.org/pdf/2403.16921) [New]
+
+*Jaywon Koo, Ziyan Yang, Paola Cascante-Bonilla, Baishakhi Ray, Vicente Ordonez*
+
+**Abstract:** Visual Programming has recently emerged as an alternative to end-to-end
+black-box visual reasoning models. This type of method leverages Large Language
+Models (LLMs) to generate the source code for an executable computer program
+that solves a given problem. This strategy has the advantage of offering an
+interpretable reasoning path and does not require finetuning a model with
+task-specific data. We propose PropTest, a general strategy that improves
+visual programming by further using an LLM to generate code that tests for
+visual properties in an initial round of proposed solutions. Our method
+generates tests for data-type consistency, output syntax, and semantic
+properties. PropTest achieves comparable results to state-of-the-art methods
+while using publicly available LLMs. This is demonstrated across different
+benchmarks on visual question answering and referring expression comprehension.
+Particularly, PropTest improves ViperGPT by obtaining 46.1\% accuracy (+6.0\%)
+on GQA using Llama3-8B and 59.5\% (+8.1\%) on RefCOCO+ using CodeLlama-34B.
+
+**comment:** *Project Page: https://jaywonkoo17.github.io/PropTest/*
+
+**published:** *2024-03-25 16:39:15*, **updated:** *2024-07-22 23:21:33*
+
+
+
+## [TokenPacker: Efficient Visual Projector for Multimodal LLM](https://arxiv.org/pdf/2407.02392) [New]
+
+*Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jianke Zhu, Lei Zhang*
+
+**Abstract:** The visual projector serves as an essential bridge between the visual encoder
+and the Large Language Model (LLM) in a Multimodal LLM (MLLM). Typically, MLLMs
+adopt a simple MLP to preserve all visual contexts via one-to-one
+transformation. However, the visual tokens are redundant and can be
+considerably increased when dealing with high-resolution images, impairing the
+efficiency of MLLMs significantly. Some recent works have introduced resampler
+or abstractor to reduce the number of resulting visual tokens. Unfortunately,
+they fail to capture finer details and undermine the visual reasoning
+capabilities of MLLMs. In this work, we propose a novel visual projector, which
+adopts a coarse-to-fine scheme to inject the enriched characteristics to
+generate the condensed visual tokens. In specific, we first interpolate the
+visual features as a low-resolution point query, providing the overall visual
+representation as the foundation. Then, we introduce a region-to-point
+injection module that utilizes high-resolution, multi-level region-based cues
+as fine-grained reference keys and values, allowing them to be fully absorbed
+within the corresponding local context region. This step effectively updates
+the coarse point query, transforming it into an enriched one for the subsequent
+LLM reasoning. Extensive experiments demonstrate that our approach compresses
+the visual tokens by 75%~89%, while achieves comparable or even better
+performance across diverse benchmarks with significantly higher efficiency. The
+source codes can be found at https://github.com/CircleRadon/TokenPacker.
+
+**comment:** *16 pages, Codes:https://github.com/CircleRadon/TokenPacker*
+
+**published:** *2024-07-02 16:10:55*, **updated:** *2024-07-22 12:55:46*
+
+
+
+## [HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning](https://arxiv.org/pdf/2403.12884) [New]
+
+*Fucai Ke, Zhixi Cai, Simindokht Jahangard, Weiqing Wang, Pari Delir Haghighi, Hamid Rezatofighi*
+
+**Abstract:** Recent advances in visual reasoning (VR), particularly with the aid of Large
+Vision-Language Models (VLMs), show promise but require access to large-scale
+datasets and face challenges such as high computational costs and limited
+generalization capabilities. Compositional visual reasoning approaches have
+emerged as effective strategies; however, they heavily rely on the commonsense
+knowledge encoded in Large Language Models (LLMs) to perform planning,
+reasoning, or both, without considering the effect of their decisions on the
+visual reasoning process, which can lead to errors or failed procedures. To
+address these challenges, we introduce HYDRA, a multi-stage dynamic
+compositional visual reasoning framework designed for reliable and
+incrementally progressive general reasoning. HYDRA integrates three essential
+modules: a planner, a Reinforcement Learning (RL) agent serving as a cognitive
+controller, and a reasoner. The planner and reasoner modules utilize an LLM to
+generate instruction samples and executable code from the selected instruction,
+respectively, while the RL agent dynamically interacts with these modules,
+making high-level decisions on selection of the best instruction sample given
+information from the historical state stored through a feedback loop. This
+adaptable design enables HYDRA to adjust its actions based on previous feedback
+received during the reasoning process, leading to more reliable reasoning
+outputs and ultimately enhancing its overall effectiveness. Our framework
+demonstrates state-of-the-art performance in various VR tasks on four different
+widely-used datasets.
+
+**comment:** *Accepted by ECCV2024. Project page: https://hydra-vl4ai.github.io/*
+
+**published:** *2024-03-19 16:31:30*, **updated:** *2024-07-21 08:48:55*
+
+
+
+## [Can VLMs be used on videos for action recognition? LLMs are Visual Reasoning Coordinators](https://arxiv.org/pdf/2407.14834) [New]
+
+*Harsh Lunia*
+
+**Abstract:** Recent advancements have introduced multiple vision-language models (VLMs)
+demonstrating impressive commonsense reasoning across various domains. Despite
+their individual capabilities, the potential of synergizing these complementary
+VLMs remains underexplored. The Cola Framework addresses this by showcasing how
+a large language model (LLM) can efficiently coordinate multiple VLMs through
+natural language communication, leveraging their distinct strengths. We have
+verified this claim on the challenging A-OKVQA dataset, confirming the
+effectiveness of such coordination. Building on this, our study investigates
+whether the same methodology can be applied to surveillance videos for action
+recognition. Specifically, we explore if leveraging the combined knowledge base
+of VLMs and LLM can effectively deduce actions from a video when presented with
+only a few selectively important frames and minimal temporal information. Our
+experiments demonstrate that LLM, when coordinating different VLMs, can
+successfully recognize patterns and deduce actions in various scenarios despite
+the weak temporal signals. However, our findings suggest that to enhance this
+approach as a viable alternative solution, integrating a stronger temporal
+signal and exposing the models to slightly more frames would be beneficial.
+
+**comment:** *LLMs, VLMs, Action Recognition*
+
+**published:** *2024-07-20 10:26:28*, **updated:** *2024-07-20 10:26:28*
+
+
+
+## [I Know About "Up"! Enhancing Spatial Reasoning in Visual Language Models Through 3D Reconstruction](https://arxiv.org/pdf/2407.14133) [New]
+
+*Zaiqiao Meng, Hao Zhou, Yifang Chen*
+
+**Abstract:** Visual Language Models (VLMs) are essential for various tasks, particularly
+visual reasoning tasks, due to their robust multi-modal information
+integration, visual reasoning capabilities, and contextual awareness. However,
+existing \VLMs{}' visual spatial reasoning capabilities are often inadequate,
+struggling even with basic tasks such as distinguishing left from right. To
+address this, we propose the \ours{} model, designed to enhance the visual
+spatial reasoning abilities of VLMS. ZeroVLM employs Zero-1-to-3, a 3D
+reconstruction model for obtaining different views of the input images and
+incorporates a prompting mechanism to further improve visual spatial reasoning.
+Experimental results on four visual spatial reasoning datasets show that our
+\ours{} achieves up to 19.48% accuracy improvement, which indicates the
+effectiveness of the 3D reconstruction and prompting mechanisms of our ZeroVLM.
+
+**published:** *2024-07-19 09:03:30*, **updated:** *2024-07-19 09:03:30*
+
+
+
+## [RCA: Region Conditioned Adaptation for Visual Abductive Reasoning](https://arxiv.org/pdf/2303.10428) [New]
+
+*Hao Zhang, Yeo Keat Ee, Basura Fernando*
+
+**Abstract:** Visual abductive reasoning aims to make likely explanations for visual
+observations. We propose a simple yet effective Region Conditioned Adaptation,
+a hybrid parameter-efficient fine-tuning method that equips the frozen CLIP
+with the ability to infer explanations from local visual cues. We encode
+``local hints'' and ``global contexts'' into visual prompts of the CLIP model
+separately at fine and coarse-grained levels. Adapters are used for fine-tuning
+CLIP models for downstream tasks and we design a new attention adapter, that
+directly steers the focus of the attention map with trainable query and key
+projections of a frozen CLIP model. Finally, we train our new model with a
+modified contrastive loss to regress the visual feature simultaneously toward
+features of literal description and plausible explanations. The loss enables
+CLIP to maintain both perception and reasoning abilities. Experiments on the
+Sherlock visual abductive reasoning benchmark show that the RCA significantly
+outstands previous SOTAs, ranking the \nth{1} on the leaderboards (e.g., Human
+Acc: RCA 31.74 \textit{vs} CPT-CLIP 29.58, higher =better). We also validate
+the RCA is generalizable to local perception benchmarks like RefCOCO. We
+open-source our project at
+\textit{\color{magenta}{\url{https://github.com/LUNAProject22/RPA}}}.
+
+**comment:** *13 pages, 11 figures, ACM Multimedia 2024*
+
+**published:** *2023-03-18 14:46:44*, **updated:** *2024-07-19 04:52:07*
+
+
+
+## [X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs](https://arxiv.org/pdf/2407.13851) [New]
+
+*Sirnam Swetha, Jinyu Yang, Tal Neiman, Mamshad Nayeem Rizve, Son Tran, Benjamin Yao, Trishul Chilimbi, Mubarak Shah*
+
+**Abstract:** Recent advancements in Multimodal Large Language Models (MLLMs) have
+revolutionized the field of vision-language understanding by integrating visual
+perception capabilities into Large Language Models (LLMs). The prevailing trend
+in this field involves the utilization of a vision encoder derived from
+vision-language contrastive learning (CL), showing expertise in capturing
+overall representations while facing difficulties in capturing detailed local
+patterns. In this work, we focus on enhancing the visual representations for
+MLLMs by combining high-frequency and detailed visual representations, obtained
+through masked image modeling (MIM), with semantically-enriched low-frequency
+representations captured by CL. To achieve this goal, we introduce X-Former
+which is a lightweight transformer module designed to exploit the complementary
+strengths of CL and MIM through an innovative interaction mechanism.
+Specifically, X-Former first bootstraps vision-language representation learning
+and multimodal-to-multimodal generative learning from two frozen vision
+encoders, i.e., CLIP-ViT (CL-based) and MAE-ViT (MIM-based). It further
+bootstraps vision-to-language generative learning from a frozen LLM to ensure
+visual features from X-Former can be interpreted by the LLM. To demonstrate the
+effectiveness of our approach, we assess its performance on tasks demanding
+detailed visual understanding. Extensive evaluations indicate that X-Former
+excels in visual reasoning tasks involving both structural and semantic
+categories in the GQA dataset. Assessment on fine-grained visual perception
+benchmark further confirms its superior capabilities in visual understanding.
+
+**comment:** *Accepted at ECCV2024*
+
+**published:** *2024-07-18 18:39:54*, **updated:** *2024-07-18 18:39:54*
+
+
+
+## [Open-World Visual Reasoning by a Neuro-Symbolic Program of Zero-Shot Symbols](https://arxiv.org/pdf/2407.13382) [New]
+
+*Gertjan Burghouts, Fieke Hillerström, Erwin Walraven, Michael van Bekkum, Frank Ruis, Joris Sijs, Jelle van Mil, Judith Dijk*
+
+**Abstract:** We consider the problem of finding spatial configurations of multiple objects
+in images, e.g., a mobile inspection robot is tasked to localize abandoned
+tools on the floor. We define the spatial configuration of objects by
+first-order logic in terms of relations and attributes. A neuro-symbolic
+program matches the logic formulas to probabilistic object proposals for the
+given image, provided by language-vision models by querying them for the
+symbols. This work is the first to combine neuro-symbolic programming
+(reasoning) and language-vision models (learning) to find spatial
+configurations of objects in images in an open world setting. We show the
+effectiveness by finding abandoned tools on floors and leaking pipes. We find
+that most prediction errors are due to biases in the language-vision model.
+
+**comment:** *12 pages*
+
+**published:** *2024-07-18 10:40:22*, **updated:** *2024-07-18 10:40:22*
+
+
+
+## [ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models](https://arxiv.org/pdf/2401.13311) [New]
+
+*Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang, Nanyun Peng*
+
+**Abstract:** Many real-world tasks require an agent to reason jointly over text and visual
+objects, (e.g., navigating in public spaces), which we refer to as
+context-sensitive text-rich visual reasoning. Specifically, these tasks require
+an understanding of the context in which the text interacts with visual
+elements within an image. However, there is a lack of existing datasets to
+benchmark the state-of-the-art multimodal models' capability on
+context-sensitive text-rich visual reasoning. In this paper, we introduce
+ConTextual, a novel dataset featuring human-crafted instructions that require
+context-sensitive reasoning for text-rich images. We conduct experiments to
+assess the performance of 14 foundation models (GPT-4V, Gemini-Pro-Vision,
+LLaVA-Next) and establish a human performance baseline. Further, we perform
+human evaluations of the model responses and observe a significant performance
+gap of 30.8% between GPT-4V (the current best-performing Large Multimodal
+Model) and human performance. Our fine-grained analysis reveals that GPT-4V
+encounters difficulties interpreting time-related data and infographics.
+However, it demonstrates proficiency in comprehending abstract visual contexts
+such as memes and quotes. Finally, our qualitative analysis uncovers various
+factors contributing to poor performance including lack of precise visual
+perception and hallucinations. Our dataset, code, and leaderboard can be found
+on the project page https://con-textual.github.io/
+
+**published:** *2024-01-24 09:07:11*, **updated:** *2024-07-16 03:36:29*
+
+
+
+## [NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models](https://arxiv.org/pdf/2407.10380) [New]
+
+*Pranshu Pandya, Agney S Talwarr, Vatsal Gupta, Tushar Kataria, Vivek Gupta, Dan Roth*
+
+**Abstract:** Cognitive textual and visual reasoning tasks, such as puzzles, series, and
+analogies, demand the ability to quickly reason, decipher, and evaluate
+patterns both textually and spatially. While LLMs and VLMs, through extensive
+training on large amounts of human-curated data, have attained a high level of
+pseudo-human intelligence in some common sense reasoning tasks, they still
+struggle with more complex reasoning tasks that require cognitive
+understanding. In this work, we introduce a new dataset, NTSEBench, designed to
+evaluate the cognitive multi-modal reasoning and problem-solving skills of
+large models. The dataset comprises 2,728 multiple-choice questions comprising
+of a total of 4,642 images across 26 categories sampled from the NTSE
+examination conducted nationwide in India, featuring both visual and textual
+general aptitude questions that do not rely on rote learning. We establish
+baselines on the dataset using state-of-the-art LLMs and VLMs. To facilitate a
+comparison between open source and propriety models, we propose four distinct
+modeling strategies to handle different modalities (text and images) in the
+dataset instances.
+
+**comment:** *15 pages, 2 figures, 5 tables*
+
+**published:** *2024-07-15 01:21:56*, **updated:** *2024-07-15 01:21:56*
+
+
+
+## [Affordance-Guided Reinforcement Learning via Visual Prompting](https://arxiv.org/pdf/2407.10341) [New]
+
+*Olivia Y. Lee, Annie Xie, Kuan Fang, Karl Pertsch, Chelsea Finn*
+
+**Abstract:** Robots equipped with reinforcement learning (RL) have the potential to learn
+a wide range of skills solely from a reward signal. However, obtaining a robust
+and dense reward signal for general manipulation tasks remains a challenge.
+Existing learning-based approaches require significant data, such as
+demonstrations or examples of success and failure, to learn task-specific
+reward functions. Recently, there is also a growing adoption of large
+multi-modal foundation models for robotics. These models can perform visual
+reasoning in physical contexts and generate coarse robot motions for various
+manipulation tasks. Motivated by this range of capability, in this work, we
+propose and study rewards shaped by vision-language models (VLMs).
+State-of-the-art VLMs have demonstrated an impressive ability to reason about
+affordances through keypoints in zero-shot, and we leverage this to define
+dense rewards for robotic learning. On a real-world manipulation task specified
+by natural language description, we find that these rewards improve the sample
+efficiency of autonomous RL and enable successful completion of the task in 20K
+online finetuning steps. Additionally, we demonstrate the robustness of the
+approach to reductions in the number of in-domain demonstrations used for
+pretraining, reaching comparable performance in 35K online finetuning steps.
+
+**comment:** *15 pages, 9 figures. Robotics: Science and Systems (RSS) 2024, Task
+ Specification for General-Purpose Intelligent Robots & Lifelong Robot
+ Learning Workshops*
+
+**published:** *2024-07-14 21:41:29*, **updated:** *2024-07-14 21:41:29*
+
+
+
+## [Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding](https://arxiv.org/pdf/2306.06094) [New]
+
+*Mu Cai, Zeyi Huang, Yuheng Li, Utkarsh Ojha, Haohan Wang, Yong Jae Lee*
+
+**Abstract:** Large language models (LLMs) have made significant advancements in natural
+language understanding. However, through that enormous semantic representation
+that the LLM has learnt, is it somehow possible for it to understand images as
+well? This work investigates this question. To enable the LLM to process
+images, we convert them into a representation given by Scalable Vector Graphics
+(SVG). To study what the LLM can do with this XML-based textual description of
+images, we test the LLM on three broad computer vision tasks: (i) visual
+reasoning and question answering, (ii) image classification under distribution
+shift, few-shot learning, and (iii) generating new images using visual
+prompting. Even though we do not naturally associate LLMs with any visual
+understanding capabilities, our results indicate that the LLM can often do a
+decent job in many of these tasks, potentially opening new avenues for research
+into LLMs' ability to understand image data. Our code, data, and models can be
+found here https://github.com/mu-cai/svg-llm.
+
+**published:** *2023-06-09 17:57:01*, **updated:** *2024-07-11 17:59:53*
+
+
+
+## [NODE-Adapter: Neural Ordinary Differential Equations for Better Vision-Language Reasoning](https://arxiv.org/pdf/2407.08672) [New]
+
+*Yi Zhang, Chun-Wun Cheng, Ke Yu, Zhihai He, Carola-Bibiane Schönlieb, Angelica I. Aviles-Rivero*
+
+**Abstract:** In this paper, we consider the problem of prototype-based vision-language
+reasoning problem. We observe that existing methods encounter three major
+challenges: 1) escalating resource demands and prolonging training times, 2)
+contending with excessive learnable parameters, and 3) fine-tuning based only
+on a single modality. These challenges will hinder their capability to adapt
+Vision-Language Models (VLMs) to downstream tasks. Motivated by this critical
+observation, we propose a novel method called NODE-Adapter, which utilizes
+Neural Ordinary Differential Equations for better vision-language reasoning. To
+fully leverage both visual and textual modalities and estimate class prototypes
+more effectively and accurately, we divide our method into two stages:
+cross-modal prototype construction and cross-modal prototype optimization using
+neural ordinary differential equations. Specifically, we exploit VLM to encode
+hand-crafted prompts into textual features and few-shot support images into
+visual features. Then, we estimate the textual prototype and visual prototype
+by averaging the textual features and visual features, respectively, and
+adaptively combine the textual prototype and visual prototype to construct the
+cross-modal prototype. To alleviate the prototype bias, we then model the
+prototype optimization process as an initial value problem with Neural ODEs to
+estimate the continuous gradient flow. Our extensive experimental results,
+which cover few-shot classification, domain generalization, and visual
+reasoning on human-object interaction, demonstrate that the proposed method
+significantly outperforms existing state-of-the-art approaches.
+
+**published:** *2024-07-11 17:04:19*, **updated:** *2024-07-11 17:04:19*
+
+
+
+## [Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models](https://arxiv.org/pdf/2406.09403) [New]
+
+*Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, Ranjay Krishna*
+
+**Abstract:** Humans draw to facilitate reasoning: we draw auxiliary lines when solving
+geometry problems; we mark and circle when reasoning on maps; we use sketches
+to amplify our ideas and relieve our limited-capacity working memory. However,
+such actions are missing in current multimodal language models (LMs). Current
+chain-of-thought and tool-use paradigms only use text as intermediate reasoning
+steps. In this work, we introduce Sketchpad, a framework that gives multimodal
+LMs a visual sketchpad and tools to draw on the sketchpad. The LM conducts
+planning and reasoning according to the visual artifacts it has drawn.
+Different from prior work, which uses text-to-image models to enable LMs to
+draw, Sketchpad enables LMs to draw with lines, boxes, marks, etc., which is
+closer to human sketching and better facilitates reasoning. Sketchpad can also
+use specialist vision models during the sketching process (e.g., draw bounding
+boxes with object detection models, draw masks with segmentation models), to
+further enhance visual perception and reasoning. We experiment with a wide
+range of math tasks (including geometry, functions, graphs, and chess) and
+complex visual reasoning tasks. Sketchpad substantially improves performance on
+all tasks over strong base models with no sketching, yielding an average gain
+of 12.7% on math tasks, and 8.6% on vision tasks. GPT-4o with Sketchpad sets a
+new state of the art on all tasks, including V*Bench (80.3%), BLINK spatial
+reasoning (83.9%), and visual correspondence (80.8%). All codes and data are in
+https://visualsketchpad.github.io/.
+
+**comment:** *Project and codes url: https://visualsketchpad.github.io/*
+
+**published:** *2024-06-13 17:59:31*, **updated:** *2024-07-10 18:09:56*
+
+
+
+## [Funny-Valen-Tine: Planning Solution Distribution Enhances Machine Abstract Reasoning Ability](https://arxiv.org/pdf/2407.02688) [New]
+
+*Ruizhuo Song, Beiming Yuan*
+
+**Abstract:** Visual abstract reasoning problems hold immense importance in the field of
+image processing. Both Bongard-Logo and Raven's Progressive Matrices (RPM)
+belong to this domain, with Bongard-Logo categorized as image clustering
+reasoning and RPM involving image progression pattern reasoning. This paper
+introduces Valen, a novel baseline model under probabilistic highlighting
+models. Valen exhibits remarkable performance in solving both RPM and
+Bongard-Logo problems, offering a versatile solution. Our investigation delves
+into the underlying mechanisms of probability-highlighting solvers, realizing
+they approximate solutions to reasoning problem instances as distributions
+delineated by primary and auxiliary samples. We propose that the learning
+objective is not the distribution of correct solutions but one defined by both
+primary and auxiliary samples. To bridge discrepancies, we introduced the Tine
+method, an adversarial learning-based approach to assist Valen in estimating a
+solution distribution closer to the correct one, albeit with issues like
+unstable training. Reflecting on Tine, we propose modeling the sample
+distribution of reasoning problems as a mixture of Gaussian distributions,
+leading to the Funny method. This effectively enables Valen to capture the true
+form of the correct solution distribution. Furthermore, we designed the SBR
+method to model the distribution of progressive patterns representation
+similarly. Overall, the Funny, Tine, and SBR methods significantly improve
+Valen's performance, providing new ideas and methods for studying visual
+abstract reasoning problems.
+
+**comment:** *14 pages, 20 figures, 3 tables*
+
+**published:** *2024-07-02 22:04:20*, **updated:** *2024-07-07 12:25:33*
+
+
+
+## [We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?](https://arxiv.org/pdf/2407.01284) [New]
+
+*Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, Honggang Zhang*
+
+**Abstract:** Visual mathematical reasoning, as a fundamental visual reasoning ability, has
+received widespread attention from the Large Multimodal Models (LMMs)
+community. Existing benchmarks, such as MathVista and MathVerse, focus more on
+the result-oriented performance but neglect the underlying principles in
+knowledge acquisition and generalization. Inspired by human-like mathematical
+reasoning, we introduce WE-MATH, the first benchmark specifically designed to
+explore the problem-solving principles beyond end-to-end performance. We
+meticulously collect and categorize 6.5K visual math problems, spanning 67
+hierarchical knowledge concepts and five layers of knowledge granularity. We
+decompose composite problems into sub-problems according to the required
+knowledge concepts and introduce a novel four-dimensional metric, namely
+Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery
+(CM), and Rote Memorization (RM), to hierarchically assess inherent issues in
+LMMs' reasoning process. With WE-MATH, we conduct a thorough evaluation of
+existing LMMs in visual mathematical reasoning and reveal a negative
+correlation between solving steps and problem-specific performance. We confirm
+the IK issue of LMMs can be effectively improved via knowledge augmentation
+strategies. More notably, the primary challenge of GPT-4o has significantly
+transitioned from IK to IG, establishing it as the first LMM advancing towards
+the knowledge generalization stage. In contrast, other LMMs exhibit a marked
+inclination towards Rote Memorization - they correctly solve composite problems
+involving multiple knowledge concepts yet fail to answer sub-problems. We
+anticipate that WE-MATH will open new pathways for advancements in visual
+mathematical reasoning for LMMs. The WE-MATH data and evaluation code are
+available at https://github.com/We-Math/We-Math.
+
+**comment:** *Work in progress*
+
+**published:** *2024-07-01 13:39:08*, **updated:** *2024-07-01 13:39:08*
+
+
+
+## [Exploring the Potential of Multi-Modal AI for Driving Hazard Prediction](https://arxiv.org/pdf/2310.04671) [New]
+
+*Korawat Charoenpitaks, Van-Quang Nguyen, Masanori Suganuma, Masahiro Takahashi, Ryoma Niihara, Takayuki Okatani*
+
+**Abstract:** This paper addresses the problem of predicting hazards that drivers may
+encounter while driving a car. We formulate it as a task of anticipating
+impending accidents using a single input image captured by car dashcams. Unlike
+existing approaches to driving hazard prediction that rely on computational
+simulations or anomaly detection from videos, this study focuses on high-level
+inference from static images. The problem needs predicting and reasoning about
+future events based on uncertain observations, which falls under visual
+abductive reasoning. To enable research in this understudied area, a new
+dataset named the DHPR (Driving Hazard Prediction and Reasoning) dataset is
+created. The dataset consists of 15K dashcam images of street scenes, and each
+image is associated with a tuple containing car speed, a hypothesized hazard
+description, and visual entities present in the scene. These are annotated by
+human annotators, who identify risky scenes and provide descriptions of
+potential accidents that could occur a few seconds later. We present several
+baseline methods and evaluate their performance on our dataset, identifying
+remaining issues and discussing future directions. This study contributes to
+the field by introducing a novel problem formulation and dataset, enabling
+researchers to explore the potential of multi-modal AI for driving hazard
+prediction.
+
+**comment:** *Main Paper: 11 pages, Supplementary Materials: 25 pages*
+
+**published:** *2023-10-07 03:16:30*, **updated:** *2024-07-01 09:29:39*
+
+
+
+## [Slot State Space Models](https://arxiv.org/pdf/2406.12272) [New]
+
+*Jindong Jiang, Fei Deng, Gautam Singh, Minseung Lee, Sungjin Ahn*
+
+**Abstract:** Recent State Space Models (SSMs) such as S4, S5, and Mamba have shown
+remarkable computational benefits in long-range temporal dependency modeling.
+However, in many sequence modeling problems, the underlying process is
+inherently modular and it is of interest to have inductive biases that mimic
+this modular structure. In this paper, we introduce SlotSSMs, a novel framework
+for incorporating independent mechanisms into SSMs to preserve or encourage
+separation of information. Unlike conventional SSMs that maintain a monolithic
+state vector, SlotSSMs maintains the state as a collection of multiple vectors
+called slots. Crucially, the state transitions are performed independently per
+slot with sparse interactions across slots implemented via the bottleneck of
+self-attention. In experiments, we evaluate our model in object-centric video
+understanding, 3D visual reasoning, and video prediction tasks, which involve
+modeling multiple objects and their long-range temporal dependencies. We find
+that our proposed design offers substantial performance gains over existing
+sequence modeling methods.
+
+**published:** *2024-06-18 04:59:14*, **updated:** *2024-06-30 22:25:01*
+
+
+
+## [From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis](https://arxiv.org/pdf/2406.19934)
*Chuanqi Cheng, Jian Guan, Wei Wu, Rui Yan*
@@ -39,7 +803,7 @@ are available at https://github.com/steven-ccq/VisualReasoner.
-## [MMRo: Are Multimodal LLMs Eligible as the Brain for In-Home Robotics?](https://arxiv.org/pdf/2406.19693) [New]
+## [MMRo: Are Multimodal LLMs Eligible as the Brain for In-Home Robotics?](https://arxiv.org/pdf/2406.19693)
*Jinming Li, Yichen Zhu, Zhiyuan Xu, Jindong Gu, Minjie Zhu, Xin Liu, Ning Liu, Yaxin Peng, Feifei Feng, Jian Tang*
@@ -70,7 +834,7 @@ https://mm-robobench.github.io/.
-## [VDebugger: Harnessing Execution Feedback for Debugging Visual Programs](https://arxiv.org/pdf/2406.13444) [New]
+## [VDebugger: Harnessing Execution Feedback for Debugging Visual Programs](https://arxiv.org/pdf/2406.13444)
*Xueqing Wu, Zongyu Lin, Songyan Zhao, Te-Lin Wu, Pan Lu, Nanyun Peng, Kai-Wei Chang*
@@ -98,7 +862,7 @@ models are made publicly available at https://github.com/shirley-wu/vdebugger/
-## [Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA](https://arxiv.org/pdf/2401.15847) [New]
+## [Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA](https://arxiv.org/pdf/2401.15847)
*Yue Fan, Jing Gu, Kaiwen Zhou, Qianqi Yan, Shan Jiang, Ching-Chen Kuo, Xinze Guan, Xin Eric Wang*
@@ -129,7 +893,7 @@ data are released at https://sites.google.com/view/multipanelvqa/home.
-## [Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos](https://arxiv.org/pdf/2406.19217) [New]
+## [Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos](https://arxiv.org/pdf/2406.19217)
*Zhimin Shao, Jialang Xu, Danail Stoyanov, Evangelos B. Mazomenos, Yueming Jin*
@@ -163,7 +927,7 @@ available.
-## [Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument Understanding](https://arxiv.org/pdf/2406.18925) [New]
+## [Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument Understanding](https://arxiv.org/pdf/2406.18925)
*Jiwan Chung, Sungjae Lee, Minseo Kim, Seungju Han, Ashkan Yousefpour, Jack Hessel, Youngjae Yu*
@@ -197,7 +961,7 @@ inputs, for deducing the conclusion of the visual argument.
-## [Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA](https://arxiv.org/pdf/2406.18839) [New]
+## [Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA](https://arxiv.org/pdf/2406.18839)
*Elham J. Barezi, Parisa Kordjamshidi*
@@ -221,31 +985,39 @@ and achieved up to 2% improvement in accuracy.
-## [Slot State Space Models](https://arxiv.org/pdf/2406.12272) [New]
+## [Visual Reasoning and Multi-Agent Approach in Multimodal Large Language Models (MLLMs): Solving TSP and mTSP Combinatorial Challenges](https://arxiv.org/pdf/2407.00092) [New]
-*Jindong Jiang, Fei Deng, Gautam Singh, Minseung Lee, Sungjin Ahn*
+*Mohammed Elhenawy, Ahmad Abutahoun, Taqwa I. Alhadidi, Ahmed Jaber, Huthaifa I. Ashqar, Shadi Jaradat, Ahmed Abdelhay, Sebastien Glaser, Andry Rakotonirainy*
-**Abstract:** Recent State Space Models (SSMs) such as S4, S5, and Mamba have shown
-remarkable computational benefits in long-range temporal dependency modeling.
-However, in many sequence modeling problems, the underlying process is
-inherently modular and it is of interest to have inductive biases that mimic
-this modular structure. In this paper, we introduce SlotSSMs, a novel framework
-for incorporating independent mechanisms into SSMs to preserve or encourage
-separation of information. Unlike conventional SSMs that maintain a monolithic
-state vector, SlotSSMs maintains the state as a collection of multiple vectors
-called slots. Crucially, the state transitions are performed independently per
-slot with sparse interactions across slots implemented via the bottleneck of
-self-attention. In experiments, we evaluate our model in object-centric video
-understanding, 3D visual reasoning, and video prediction tasks, which involve
-modeling multiple objects and their long-range temporal dependencies. We find
-that our proposed design offers substantial performance gains over existing
-sequence modeling methods.
+**Abstract:** Multimodal Large Language Models (MLLMs) harness comprehensive knowledge
+spanning text, images, and audio to adeptly tackle complex problems, including
+zero-shot in-context learning scenarios. This study explores the ability of
+MLLMs in visually solving the Traveling Salesman Problem (TSP) and Multiple
+Traveling Salesman Problem (mTSP) using images that portray point distributions
+on a two-dimensional plane. We introduce a novel approach employing multiple
+specialized agents within the MLLM framework, each dedicated to optimizing
+solutions for these combinatorial challenges. Our experimental investigation
+includes rigorous evaluations across zero-shot settings and introduces
+innovative multi-agent zero-shot in-context scenarios. The results demonstrated
+that both multi-agent models. Multi-Agent 1, which includes the Initializer,
+Critic, and Scorer agents, and Multi-Agent 2, which comprises only the
+Initializer and Critic agents; significantly improved solution quality for TSP
+and mTSP problems. Multi-Agent 1 excelled in environments requiring detailed
+route refinement and evaluation, providing a robust framework for sophisticated
+optimizations. In contrast, Multi-Agent 2, focusing on iterative refinements by
+the Initializer and Critic, proved effective for rapid decision-making
+scenarios. These experiments yield promising outcomes, showcasing the robust
+visual reasoning capabilities of MLLMs in addressing diverse combinatorial
+problems. The findings underscore the potential of MLLMs as powerful tools in
+computational optimization, offering insights that could inspire further
+advancements in this promising field. Project link:
+https://github.com/ahmed-abdulhuy/Solving-TSP-and-mTSP-Combinatorial-Challenges-using-Visual-Reasoning-and-Multi-Agent-Approach-MLLMs-.git
-**published:** *2024-06-18 04:59:14*, **updated:** *2024-06-26 03:04:04*
+**published:** *2024-06-26 07:12:06*, **updated:** *2024-06-26 07:12:06*
-## [Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration](https://arxiv.org/pdf/2406.16469) [New]
+## [Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration](https://arxiv.org/pdf/2406.16469)
*Yujin Baek, ChaeHun Park, Jaeseok Kim, Yu-Jung Heo, Du-Seong Chang, Jaegul Choo*
@@ -277,7 +1049,7 @@ available.
-## [Beyond the Doors of Perception: Vision Transformers Represent Relations Between Objects](https://arxiv.org/pdf/2406.15955) [New]
+## [Beyond the Doors of Perception: Vision Transformers Represent Relations Between Objects](https://arxiv.org/pdf/2406.15955)
*Michael A. Lepori, Alexa R. Tartaglini, Wai Keen Vong, Thomas Serre, Brenden M. Lake, Ellie Pavlick*
@@ -307,7 +1079,7 @@ rectify shortcomings of existing and future models.
-## [Triple-CFN: Restructuring Concept and Feature Spaces for Enhancing Abstract Reasoning Process](https://arxiv.org/pdf/2403.03190) [New]
+## [Triple-CFN: Restructuring Concept and Feature Spaces for Enhancing Abstract Reasoning Process](https://arxiv.org/pdf/2403.03190)
*Ruizhuo Song, Beiming Yuan*
@@ -337,7 +1109,7 @@ intelligence through innovative network designs for abstract reasoning.
-## [Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities](https://arxiv.org/pdf/2406.14562) [New]
+## [Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities](https://arxiv.org/pdf/2406.14562)
*Sachit Menon, Richard Zemel, Carl Vondrick*
@@ -368,7 +1140,7 @@ well as its sources of error.
-## [Improving Visual Commonsense in Language Models via Multiple Image Generation](https://arxiv.org/pdf/2406.13621) [New]
+## [Improving Visual Commonsense in Language Models via Multiple Image Generation](https://arxiv.org/pdf/2406.13621)
*Guy Yariv, Idan Schwartz, Yossi Adi, Sagie Benaim*
@@ -399,7 +1171,7 @@ https://github.com/guyyariv/vLMIG.
-## [GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs](https://arxiv.org/pdf/2406.13246) [New]
+## [GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs](https://arxiv.org/pdf/2406.13246)
*Navid Rajabi, Jana Kosecka*
@@ -420,7 +1192,7 @@ the scaling laws in this task.
-## [ChartBench: A Benchmark for Complex Visual Reasoning in Charts](https://arxiv.org/pdf/2312.15915) [New]
+## [ChartBench: A Benchmark for Complex Visual Reasoning in Charts](https://arxiv.org/pdf/2312.15915)
*Zhengzhuo Xu, Sinan Du, Yiyan Qi, Chengjin Xu, Chun Yuan, Jian Guo*
@@ -446,7 +1218,7 @@ https://chartbench.github.io.
-## [Beyond Visual Appearances: Privacy-sensitive Objects Identification via Hybrid Graph Reasoning](https://arxiv.org/pdf/2406.12736) [New]
+## [Beyond Visual Appearances: Privacy-sensitive Objects Identification via Hybrid Graph Reasoning](https://arxiv.org/pdf/2406.12736)
*Zhuohang Jiang, Bingkui Tong, Xia Du, Ahmed Alhammadi, Jizhe Zhou*
@@ -479,7 +1251,7 @@ the full abstract, see the original paper.**
-## [RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding](https://arxiv.org/pdf/2406.12479) [New]
+## [RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding](https://arxiv.org/pdf/2406.12479)
*Linrui Xu, Ling Zhao, Wang Guo, Qiujun Li, Kewang Long, Kaiqi Zou, Yuhan Wang, Haifeng Li*
@@ -516,7 +1288,7 @@ https://github.com/GeoX-Lab/RS-GPT4V.
-## [Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models](https://arxiv.org/pdf/2403.19322) [New]
+## [Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models](https://arxiv.org/pdf/2403.19322)
*Jiaxing Chen, Yuxuan Liu, Dehu Li, Xiang An, Weimo Deng, Ziyong Feng, Yongle Zhao, Yin Xie*
@@ -543,7 +1315,7 @@ presenting a promising alternative to mere model scaling.
-## [Leveraging VLM-Based Pipelines to Annotate 3D Objects](https://arxiv.org/pdf/2311.17851) [New]
+## [Leveraging VLM-Based Pipelines to Annotate 3D Objects](https://arxiv.org/pdf/2311.17851)
*Rishabh Kabra, Loic Matthey, Alexander Lerchner, Niloy J. Mitra*
@@ -569,7 +1341,7 @@ for 764K objects from the Objaverse dataset.
-## [Beyond Embeddings: The Promise of Visual Table in Visual Reasoning](https://arxiv.org/pdf/2403.18252) [New]
+## [Beyond Embeddings: The Promise of Visual Table in Visual Reasoning](https://arxiv.org/pdf/2403.18252)
*Yiwu Zhong, Zi-Yuan Hu, Michael R. Lyu, Liwei Wang*
@@ -599,7 +1371,7 @@ available at https://github.com/LaVi-Lab/Visual-Table.
-## [ClawMachine: Fetching Visual Tokens as An Entity for Referring and Grounding](https://arxiv.org/pdf/2406.11327) [New]
+## [ClawMachine: Fetching Visual Tokens as An Entity for Referring and Grounding](https://arxiv.org/pdf/2406.11327)
*Tianren Ma, Lingxi Xie, Yunjie Tian, Boyu Yang, Yuan Zhang, David Doermann, Qixiang Ye*
@@ -626,7 +1398,7 @@ hardly perform without specific adaptions.
-## [A Unified View of Abstract Visual Reasoning Problems](https://arxiv.org/pdf/2406.11068) [New]
+## [A Unified View of Abstract Visual Reasoning Problems](https://arxiv.org/pdf/2406.11068)
*Mikołaj Małkiński, Jacek Mańdziuk*
@@ -660,7 +1432,7 @@ reuse in transfer learning and curriculum learning setups.
-## [Generalization and Knowledge Transfer in Abstract Visual Reasoning Models](https://arxiv.org/pdf/2406.11061) [New]
+## [Generalization and Knowledge Transfer in Abstract Visual Reasoning Models](https://arxiv.org/pdf/2406.11061)
*Mikołaj Małkiński, Jacek Mańdziuk*
@@ -684,36 +1456,7 @@ challenges, as well as the standard I-RAVEN and PGM setups.
-## [ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models](https://arxiv.org/pdf/2401.13311) [New]
-
-*Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang, Nanyun Peng*
-
-**Abstract:** Many real-world tasks require an agent to reason jointly over text and visual
-objects, (e.g., navigating in public spaces), which we refer to as
-context-sensitive text-rich visual reasoning. Specifically, these tasks require
-an understanding of the context in which the text interacts with visual
-elements within an image. However, there is a lack of existing datasets to
-benchmark the state-of-the-art multimodal models' capability on
-context-sensitive text-rich visual reasoning. In this paper, we introduce
-ConTextual, a novel dataset featuring human-crafted instructions that require
-context-sensitive reasoning for text-rich images. We conduct experiments to
-assess the performance of 14 foundation models (GPT-4V, Gemini-Pro-Vision,
-LLaVA-Next) and establish a human performance baseline. Further, we perform
-human evaluations of the model responses and observe a significant performance
-gap of 30.8% between GPT-4V (the current best-performing Large Multimodal
-Model) and human performance. Our fine-grained analysis reveals that GPT-4V
-encounters difficulties interpreting time-related data and infographics.
-However, it demonstrates proficiency in comprehending abstract visual contexts
-such as memes and quotes. Finally, our qualitative analysis uncovers various
-factors contributing to poor performance including lack of precise visual
-perception and hallucinations. Our dataset, code, and leaderboard can be found
-on the project page https://con-textual.github.io/
-
-**published:** *2024-01-24 09:07:11*, **updated:** *2024-06-16 00:38:24*
-
-
-
-## [What is the Visual Cognition Gap between Humans and Multimodal LLMs?](https://arxiv.org/pdf/2406.10424) [New]
+## [What is the Visual Cognition Gap between Humans and Multimodal LLMs?](https://arxiv.org/pdf/2406.10424)
*Xu Cao, Bolin Lai, Wenqian Ye, Yunsheng Ma, Joerg Heintz, Jintai Chen, Jianguo Cao, James M. Rehg*
@@ -742,7 +1485,7 @@ human-like visual cognition abilities.
-## [Neural Concept Binder](https://arxiv.org/pdf/2406.09949) [New]
+## [Neural Concept Binder](https://arxiv.org/pdf/2406.09949)
*Wolfgang Stammer, Antonia Wüst, David Steinmann, Kristian Kersting*
@@ -766,7 +1509,7 @@ dataset.
-## [CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers](https://arxiv.org/pdf/2305.17455) [New]
+## [CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers](https://arxiv.org/pdf/2305.17455)
*Dachuan Shi, Chaofan Tao, Anyi Rao, Zhendong Yang, Chun Yuan, Jiaqi Wang*
@@ -796,39 +1539,7 @@ code is available at https://github.com/sdc17/CrossGET.
-## [Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models](https://arxiv.org/pdf/2406.09403) [New]
-
-*Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, Ranjay Krishna*
-
-**Abstract:** Humans draw to facilitate reasoning: we draw auxiliary lines when solving
-geometry problems; we mark and circle when reasoning on maps; we use sketches
-to amplify our ideas and relieve our limited-capacity working memory. However,
-such actions are missing in current multimodal language models (LMs). Current
-chain-of-thought and tool-use paradigms only use text as intermediate reasoning
-steps. In this work, we introduce Sketchpad, a framework that gives multimodal
-LMs a visual sketchpad and tools to draw on the sketchpad. The LM conducts
-planning and reasoning according to the visual artifacts it has drawn.
-Different from prior work, which uses text-to-image models to enable LMs to
-draw, Sketchpad enables LMs to draw with lines, boxes, marks, etc., which is
-closer to human sketching and better facilitates reasoning. Sketchpad can also
-use specialist vision models during the sketching process (e.g., draw bounding
-boxes with object detection models, draw masks with segmentation models), to
-further enhance visual perception and reasoning. We experiment with a wide
-range of math tasks (including geometry, functions, graphs, and chess) and
-complex visual reasoning tasks. Sketchpad substantially improves performance on
-all tasks over strong base models with no sketching, yielding an average gain
-of 12.7% on math tasks, and 8.6% on vision tasks. GPT-4o with Sketchpad sets a
-new state of the art on all tasks, including V*Bench (80.3%), BLINK spatial
-reasoning (83.9%), and visual correspondence (80.8%). All codes and data are in
-https://visualsketchpad.github.io/.
-
-**comment:** *26 pages*
-
-**published:** *2024-06-13 17:59:31*, **updated:** *2024-06-13 17:59:31*
-
-
-
-## [Comparison Visual Instruction Tuning](https://arxiv.org/pdf/2406.09240) [New]
+## [Comparison Visual Instruction Tuning](https://arxiv.org/pdf/2406.09240)
*Wei Lin, Muhammad Jehanzeb Mirza, Sivan Doveh, Rogerio Feris, Raja Giryes, Sepp Hochreiter, Leonid Karlinsky*
@@ -857,7 +1568,7 @@ assess the CaD understanding abilities of LMMs.
-## [INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance](https://arxiv.org/pdf/2406.09105) [New]
+## [INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance](https://arxiv.org/pdf/2406.09105)
*Chenwei Lin, Hanjia Lyu, Xian Xu, Jiebo Luo*
@@ -888,7 +1599,7 @@ evaluation code are available at https://github.com/FDU-INS/INS-MMBench.
-## [Solving the Clustering Reasoning Problems by Modeling a Deep-Learning-Based Probabilistic Model](https://arxiv.org/pdf/2403.03173) [New]
+## [Solving the Clustering Reasoning Problems by Modeling a Deep-Learning-Based Probabilistic Model](https://arxiv.org/pdf/2403.03173)
*Ruizhuo Song, Beiming Yuan*
@@ -921,7 +1632,7 @@ systems.
-## [A3VLM: Actionable Articulation-Aware Vision Language Model](https://arxiv.org/pdf/2406.07549) [New]
+## [A3VLM: Actionable Articulation-Aware Vision Language Model](https://arxiv.org/pdf/2406.07549)
*Siyuan Huang, Haonan Chang, Yuhan Liu, Yimeng Zhu, Hao Dong, Peng Gao, Abdeslam Boularias, Hongsheng Li*
@@ -945,7 +1656,7 @@ https://github.com/changhaonan/A3VLM.
-## [Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?](https://arxiv.org/pdf/2406.07546) [New]
+## [Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?](https://arxiv.org/pdf/2406.07546)
*Xingyu Fu, Muyu He, Yujie Lu, William Yang Wang, Dan Roth*
@@ -977,7 +1688,7 @@ image generation.
-## [Visual Transformation Telling](https://arxiv.org/pdf/2305.01928) [New]
+## [Visual Transformation Telling](https://arxiv.org/pdf/2305.01928)
*Wanqing Cui, Xin Hong, Yanyan Lan, Liang Pang, Jiafeng Guo, Xueqi Cheng*
@@ -1005,7 +1716,7 @@ face challenges in VTT, highlighting substantial areas for improvement.
-## [Eyeballing Combinatorial Problems: A Case Study of Using Multimodal Large Language Models to Solve Traveling Salesman Problems](https://arxiv.org/pdf/2406.06865) [New]
+## [Eyeballing Combinatorial Problems: A Case Study of Using Multimodal Large Language Models to Solve Traveling Salesman Problems](https://arxiv.org/pdf/2406.06865)
*Mohammed Elhenawy, Ahmed Abdelhay, Taqwa I. Alhadidi, Huthaifa I Ashqar, Shadi Jaradat, Ahmed Jaber, Sebastien Glaser, Andry Rakotonirainy*
@@ -1026,7 +1737,7 @@ into MLLMs' visual reasoning abilities to tackle other combinatorial problems.
-## [ALGO: Object-Grounded Visual Commonsense Reasoning for Open-World Egocentric Action Recognition](https://arxiv.org/pdf/2406.05722) [New]
+## [ALGO: Object-Grounded Visual Commonsense Reasoning for Open-World Egocentric Action Recognition](https://arxiv.org/pdf/2406.05722)
*Sanjoy Kundu, Shubham Trehan, Sathyanarayanan N. Aakur*
@@ -1056,7 +1767,7 @@ open-world activity inference.
-## [HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model](https://arxiv.org/pdf/2406.00307) [New]
+## [HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model](https://arxiv.org/pdf/2406.00307)
*Khoa Vo, Thinh Phan, Kashu Yamazaki, Minh Tran, Ngan Le*
@@ -1088,7 +1799,7 @@ language query, and moments query.
-## [Dragonfly: Multi-Resolution Zoom Supercharges Large Visual-Language Model](https://arxiv.org/pdf/2406.00977) [New]
+## [Dragonfly: Multi-Resolution Zoom Supercharges Large Visual-Language Model](https://arxiv.org/pdf/2406.00977)
*Kezhen Chen, Rahul Thapa, Rahul Chalamala, Ben Athiwaratkun, Shuaiwen Leon Song, James Zou*
@@ -1122,7 +1833,7 @@ model are available at https://github.com/togethercomputer/Dragonfly.
-## [Slot Abstractors: Toward Scalable Abstract Visual Reasoning](https://arxiv.org/pdf/2403.03458) [New]
+## [Slot Abstractors: Toward Scalable Abstract Visual Reasoning](https://arxiv.org/pdf/2403.03458)
*Shanka Subhra Mondal, Jonathan D. Cohen, Taylor W. Webb*
@@ -1151,7 +1862,7 @@ well as an abstract reasoning task involving real-world images.
-## [GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest](https://arxiv.org/pdf/2307.03601) [New]
+## [GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest](https://arxiv.org/pdf/2307.03601)
*Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Yu Liu, Kai Chen, Ping Luo*
@@ -1983,34 +2694,6 @@ https://github.com/chancharikmitra/CCoT
-## [PropTest: Automatic Property Testing for Improved Visual Programming](https://arxiv.org/pdf/2403.16921)
-
-*Jaywon Koo, Ziyan Yang, Paola Cascante-Bonilla, Baishakhi Ray, Vicente Ordonez*
-
-**Abstract:** Visual Programming has emerged as an alternative to end-to-end black-box
-visual reasoning models. This type of methods leverage Large Language Models
-(LLMs) to decompose a problem and generate the source code for an executable
-computer program. This strategy has the advantage of offering an interpretable
-reasoning path and does not require finetuning a model with task-specific data.
-We propose PropTest, a general strategy that improves visual programming by
-further using an LLM to generate code that tests for visual properties in an
-initial round of proposed solutions. Particularly, our method tests for
-data-type consistency, as well as syntactic and semantic properties in the
-generated solutions. Our proposed solution outperforms baselines and achieves
-comparable results to state-of-the-art methods while using smaller and publicly
-available LLMs (CodeLlama-7B and WizardCoder-15B). This is demonstrated across
-different benchmarks on visual question answering and referring expression
-comprehension, showing the efficacy of our approach in enhancing the
-performance and generalization of visual reasoning tasks. Specifically,
-PropTest improves ViperGPT by obtaining 48.66% accuracy (+8.3%) on the A-OKVQA
-benchmark and 52.8% (+3.3%) on the RefCOCO+ benchmark using CodeLlama-7B.
-
-**comment:** *Project Page: https://jaywonkoo17.github.io/PropTest/*
-
-**published:** *2024-03-25 16:39:15*, **updated:** *2024-03-25 16:39:15*
-
-
-
## [VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding](https://arxiv.org/pdf/2403.14743)
*Ahmad Mahmood, Ashmal Vayani, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan*
@@ -2109,37 +2792,6 @@ consistency.
-## [HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning](https://arxiv.org/pdf/2403.12884)
-
-*Fucai Ke, Zhixi Cai, Simindokht Jahangard, Weiqing Wang, Pari Delir Haghighi, Hamid Rezatofighi*
-
-**Abstract:** Recent advances in visual reasoning (VR), particularly with the aid of Large
-Vision-Language Models (VLMs), show promise but require access to large-scale
-datasets and face challenges such as high computational costs and limited
-generalization capabilities. Compositional visual reasoning approaches have
-emerged as effective strategies; however, they heavily rely on the commonsense
-knowledge encoded in Large Language Models (LLMs) to perform planning,
-reasoning, or both, without considering the effect of their decisions on the
-visual reasoning process, which can lead to errors or failed procedures. To
-address these challenges, we introduce HYDRA, a multi-stage dynamic
-compositional visual reasoning framework designed for reliable and
-incrementally progressive general reasoning. HYDRA integrates three essential
-modules: a planner, a Reinforcement Learning (RL) agent serving as a cognitive
-controller, and a reasoner. The planner and reasoner modules utilize an LLM to
-generate instruction samples and executable code from the selected instruction,
-respectively, while the RL agent dynamically interacts with these modules,
-making high-level decisions on selection of the best instruction sample given
-information from the historical state stored through a feedback loop. This
-adaptable design enables HYDRA to adjust its actions based on previous feedback
-received during the reasoning process, leading to more reliable reasoning
-outputs and ultimately enhancing its overall effectiveness. Our framework
-demonstrates state-of-the-art performance in various VR tasks on four different
-widely-used datasets.
-
-**published:** *2024-03-19 16:31:30*, **updated:** *2024-03-19 16:31:30*
-
-
-
## [Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World](https://arxiv.org/pdf/2310.10207)
*Rujie Wu, Xiaojian Ma, Zhenliang Zhang, Wei Wang, Qing Li, Song-Chun Zhu, Yizhou Wang*
@@ -2517,36 +3169,6 @@ lays the groundwork for future research endeavors.
-## [Visual Abductive Reasoning Meets Driving Hazard Prediction](https://arxiv.org/pdf/2310.04671)
-
-*Korawat Charoenpitaks, Van-Quang Nguyen, Masanori Suganuma, Masahiro Takahashi, Ryoma Niihara, Takayuki Okatani*
-
-**Abstract:** This paper addresses the problem of predicting hazards that drivers may
-encounter while driving a car. We formulate it as a task of anticipating
-impending accidents using a single input image captured by car dashcams. Unlike
-existing approaches to driving hazard prediction that rely on computational
-simulations or anomaly detection from videos, this study focuses on high-level
-inference from static images. The problem needs predicting and reasoning about
-future events based on uncertain observations, which falls under visual
-abductive reasoning. To enable research in this understudied area, a new
-dataset named the DHPR (Driving Hazard Prediction and Reasoning) dataset is
-created. The dataset consists of 15K dashcam images of street scenes, and each
-image is associated with a tuple containing car speed, a hypothesized hazard
-description, and visual entities present in the scene. These are annotated by
-human annotators, who identify risky scenes and provide descriptions of
-potential accidents that could occur a few seconds later. We present several
-baseline methods and evaluate their performance on our dataset, identifying
-remaining issues and discussing future directions. This study contributes to
-the field by introducing a novel problem formulation and dataset, enabling
-researchers to explore the potential of multi-modal AI for driving hazard
-prediction.
-
-**comment:** *Main Paper: 10 pages, Supplementary Materials: 28 pages*
-
-**published:** *2023-10-07 03:16:30*, **updated:** *2024-02-27 14:22:09*
-
-
-
## [VISREAS: Complex Visual Reasoning with Unanswerable Questions](https://arxiv.org/pdf/2403.10534)
*Syeda Nahida Akter, Sangwu Lee, Yingshan Chang, Yonatan Bisk, Eric Nyberg*
@@ -2889,44 +3511,6 @@ concerned.
-## [A Region-Prompted Adapter Tuning for Visual Abductive Reasoning](https://arxiv.org/pdf/2303.10428)
-
-*Hao Zhang, Yeo Keat Ee, Basura Fernando*
-
-**Abstract:** Visual Abductive Reasoning is an emerging vision-language (VL) topic where
-the model needs to retrieve/generate a likely textual hypothesis from a visual
-input (image or its part) using backward reasoning based on commonsense. Unlike
-in conventional VL retrieval or captioning tasks, where entities of texts
-appear in the image, in abductive inferences, the relevant facts about
-inferences are not readily apparent in the input images. Besides, these
-inferences are causally linked to specific regional visual cues and would
-change as cues change. Existing works highlight cues utilizing a specific
-prompt (e.g., colorful prompt). Then, a full fine-tuning of a VL foundation
-model is launched to tweak its function from perception to deduction. However,
-the colorful prompt uniformly patchify ``regional hints'' and ``global
-context'' at the same granularity level and may lose fine-grained visual
-details crucial for VAR. Meanwhile, full fine-tuning of VLF on limited data
-would easily be overfitted.
- To tackle this, we propose a simple yet effective Region-Prompted Adapter
-(RPA), a hybrid parameter-efficient fine-tuning method that leverages the
-strengths of detailed cues and efficient training for the VAR task.
-RPA~consists of two novel modules: Regional Prompt Generator (RPG) and
-Adapter$^\textbf{+}$. The prior encodes ``regional visual hints'' and ``global
-contexts'' into visual prompts separately at fine and coarse-grained levels.
-The latter extends the vanilla adapters with a new Map Adapter, which modifies
-the attention map using a trainable low-dim query/key projection. Additionally,
-we propose a new Dual-Contrastive Loss to regress the visual feature toward
-features of factual description and plausible hypothesis. Experiments on the
-Sherlock demonstrate that RPA outperforms previous SOTAs, achieving the 1st
-rank on leaderboards (Comparison to Human Accuracy: RPA~31.74 vs CPT-CLIP
-29.58).
-
-**comment:** *13 pages, 11 figures, Under Review of IEEE Transaction*
-
-**published:** *2023-03-18 14:46:44*, **updated:** *2024-01-07 05:06:26*
-
-
-
## [Multi-modal Large Language Model Enhanced Pseudo 3D Perception Framework for Visual Commonsense Reasoning](https://arxiv.org/pdf/2301.13335)
*Jian Zhu, Hanli Wang, Miaojing Shi*