+
+ + Computer Vision and Pattern Recognition 104 +
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ + ☆ Tactile-Augmented Radiance Fields CVPR 2024 +
+ +
+ We present a scene representation, which we call a tactile-augmented radiance
+field (TaRF), that brings vision and touch into a shared 3D space. This
+representation can be used to estimate the visual and tactile signals for a
+given 3D position within a scene. We capture a scene's TaRF from a collection
+of photos and sparsely sampled touch probes. Our approach makes use of two
+insights: (i) common vision-based touch sensors are built on ordinary cameras
+and thus can be registered to images using methods from multi-view geometry,
+and (ii) visually and structurally similar regions of a scene share the same
+tactile features. We use these insights to register touch signals to a captured
+visual scene, and to train a conditional diffusion model that, provided with an
+RGB-D image rendered from a neural radiance field, generates its corresponding
+tactile signal. To evaluate our approach, we collect a dataset of TaRFs. This
+dataset contains more touch samples than previous real-world datasets, and it
+provides spatially aligned visual signals for each captured touch signal. We
+demonstrate the accuracy of our cross-modal generative model and the utility of
+the captured visual-tactile data on several downstream tasks. Project page:
+https://dou-yiming.github.io/TaRF
+
+
+
+ comment: CVPR 2024, Project page: https://dou-yiming.github.io/TaRF, Code:
+ https://github.com/Dou-Yiming/TaRF/
+
+
+
+ + ☆ ChatHuman: Language-driven 3D Human Understanding with + Retrieval-Augmented Tool Reasoning +
+ +
+ Numerous methods have been proposed to detect, estimate, and analyze
+properties of people in images, including the estimation of 3D pose, shape,
+contact, human-object interaction, emotion, and more. Each of these methods
+works in isolation instead of synergistically. Here we address this problem and
+build a language-driven human understanding system -- ChatHuman, which combines
+and integrates the skills of many different methods. To do so, we finetune a
+Large Language Model (LLM) to select and use a wide variety of existing tools
+in response to user inputs. In doing so, ChatHuman is able to combine
+information from multiple tools to solve problems more accurately than the
+individual tools themselves and to leverage tool output to improve its ability
+to reason about humans. The novel features of ChatHuman include leveraging
+academic publications to guide the application of 3D human-related tools,
+employing a retrieval-augmented generation model to generate
+in-context-learning examples for handling new tools, and discriminating and
+integrating tool results to enhance 3D human understanding. Our experiments
+show that ChatHuman outperforms existing models in both tool selection accuracy
+and performance across multiple 3D human-related tasks. ChatHuman is a step
+towards consolidating diverse methods for human analysis into a single,
+powerful, system for 3D human reasoning.
+
+
+
+ comment: Project page: https://chathuman.github.io
+
+
+
+ + ☆ Edit-Your-Motion: Space-Time Diffusion Decoupling Learning for Video + Motion Editing +
+
+
+
+
+
+
+
+ Yi Zuo, Lingling Li, Licheng Jiao, Fang Liu, Xu Liu, Wenping Ma, Shuyuan Yang, Yuwei Guo
+
+
+ Existing diffusion-based video editing methods have achieved impressive
+results in motion editing. Most of the existing methods focus on the motion
+alignment between the edited video and the reference video. However, these
+methods do not constrain the background and object content of the video to
+remain unchanged, which makes it possible for users to generate unexpected
+videos. In this paper, we propose a one-shot video motion editing method called
+Edit-Your-Motion that requires only a single text-video pair for training.
+Specifically, we design the Detailed Prompt-Guided Learning Strategy (DPL) to
+decouple spatio-temporal features in space-time diffusion models. DPL separates
+learning object content and motion into two training stages. In the first
+training stage, we focus on learning the spatial features (the features of
+object content) and breaking down the temporal relationships in the video
+frames by shuffling them. We further propose Recurrent-Causal Attention
+(RC-Attn) to learn the consistent content features of the object from unordered
+video frames. In the second training stage, we restore the temporal
+relationship in video frames to learn the temporal feature (the features of the
+background and object's motion). We also adopt the Noise Constraint Loss to
+smooth out inter-frame differences. Finally, in the inference stage, we inject
+the content features of the source object into the editing branch through a
+two-branch structure (editing branch and reconstruction branch). With
+Edit-Your-Motion, users can edit the motion of objects in the source video to
+generate more exciting and diverse videos. Comprehensive qualitative
+experiments, quantitative experiments and user preference studies demonstrate
+that Edit-Your-Motion performs better than other methods.
+
+
+
+
+ + ☆ S3Former: Self-supervised High-resolution Transformer for Solar PV + Profiling +
+
+
+
+
+
+
+
+ Minh Tran, Adrian De Luis, Haitao Liao, Ying Huang, Roy McCann, Alan Mantooth, Jack Cothren, Ngan Le
+
+
+ As the impact of climate change escalates, the global necessity to transition
+to sustainable energy sources becomes increasingly evident. Renewable energies
+have emerged as a viable solution for users, with Photovoltaic energy being a
+favored choice for small installations due to its reliability and efficiency.
+Accurate mapping of PV installations is crucial for understanding the extension
+of its adoption and informing energy policy. To meet this need, we introduce
+S3Former, designed to segment solar panels from aerial imagery and provide size
+and location information critical for analyzing the impact of such
+installations on the grid. Solar panel identification is challenging due to
+factors such as varying weather conditions, roof characteristics, Ground
+Sampling Distance variations and lack of appropriate initialization weights for
+optimized training. To tackle these complexities, S3Former features a Masked
+Attention Mask Transformer incorporating a self-supervised learning pretrained
+backbone. Specifically, our model leverages low-level and high-level features
+extracted from the backbone and incorporates an instance query mechanism
+incorporated on the Transformer architecture to enhance the localization of
+solar PV installations. We introduce a self-supervised learning phase (pretext
+task) to improve the initialization weights on the backbone of S3Former. We
+evaluated S3Former using diverse datasets, demonstrate improvement
+state-of-the-art models.
+
+
+
+ comment: Preprint
+
+
+
+ + ☆ A Significantly Better Class of Activation Functions Than ReLU Like + Activation Functions +
+ +
+ This paper introduces a significantly better class of activation functions
+than the almost universally used ReLU like and Sigmoidal class of activation
+functions. Two new activation functions referred to as the Cone and
+Parabolic-Cone that differ drastically from popular activation functions and
+significantly outperform these on the CIFAR-10 and Imagenette benchmmarks are
+proposed. The cone activation functions are positive only on a finite interval
+and are strictly negative except at the end-points of the interval, where they
+become zero. Thus the set of inputs that produce a positive output for a neuron
+with cone activation functions is a hyperstrip and not a half-space as is the
+usual case. Since a hyper strip is the region between two parallel
+hyper-planes, it allows neurons to more finely divide the input feature space
+into positive and negative classes than with infinitely wide half-spaces. In
+particular the XOR function can be learn by a single neuron with cone-like
+activation functions. Both the cone and parabolic-cone activation functions are
+shown to achieve higher accuracies with significantly fewer neurons on
+benchmarks. The results presented in this paper indicate that many nonlinear
+real-world datasets may be separated with fewer hyperstrips than half-spaces.
+The Cone and Parabolic-Cone activation functions have larger derivatives than
+ReLU and are shown to significantly speedup training.
+
+
+
+ comment: 14 pages
+
+
+
+ + ☆ Towards Geographic Inclusion in the Evaluation of Text-to-Image Models +
+
+
+
+
+
+
+
+ Melissa Hall, Samuel J. Bell, Candace Ross, Adina Williams, Michal Drozdzal, Adriana Romero Soriano
+
+
+ Rapid progress in text-to-image generative models coupled with their
+deployment for visual content creation has magnified the importance of
+thoroughly evaluating their performance and identifying potential biases. In
+pursuit of models that generate images that are realistic, diverse, visually
+appealing, and consistent with the given prompt, researchers and practitioners
+often turn to automated metrics to facilitate scalable and cost-effective
+performance profiling. However, commonly-used metrics often fail to account for
+the full diversity of human preference; often even in-depth human evaluations
+face challenges with subjectivity, especially as interpretations of evaluation
+criteria vary across regions and cultures. In this work, we conduct a large,
+cross-cultural study to study how much annotators in Africa, Europe, and
+Southeast Asia vary in their perception of geographic representation, visual
+appeal, and consistency in real and generated images from state-of-the art
+public APIs. We collect over 65,000 image annotations and 20 survey responses.
+We contrast human annotations with common automated metrics, finding that human
+preferences vary notably across geographic location and that current metrics do
+not fully account for this diversity. For example, annotators in different
+locations often disagree on whether exaggerated, stereotypical depictions of a
+region are considered geographically representative. In addition, the utility
+of automatic evaluations is dependent on assumptions about their set-up, such
+as the alignment of feature extractors with human perception of object
+similarity or the definition of "appeal" captured in reference datasets used to
+ground evaluations. We recommend steps for improved automatic and human
+evaluations.
+
+
+
+
+ + ☆ AugmenTory: A Fast and Flexible Polygon Augmentation Library +
+
+
+
+
+
+
+
+ Tanaz Ghahremani, Mohammad Hoseyni, Mohammad Javad Ahmadi, Pouria Mehrabi, Amirhossein Nikoofard
+
+
+ Data augmentation is a key technique for addressing the challenge of limited
+datasets, which have become a major component in the training procedures of
+image processing. Techniques such as geometric transformations and color space
+adjustments have been thoroughly tested for their ability to artificially
+expand training datasets and generate semi-realistic data for training
+purposes. Data augmentation is the most important key to addressing the
+challenge of limited datasets, which have become a major component of image
+processing training procedures. Data augmentation techniques, such as geometric
+transformations and color space adjustments, are thoroughly tested for their
+ability to artificially expand training datasets and generate semi-realistic
+data for training purposes. Polygons play a crucial role in instance
+segmentation and have seen a surge in use across advanced models, such as
+YOLOv8. Despite their growing popularity, the lack of specialized libraries
+hampers the polygon-augmentation process. This paper introduces a novel
+solution to this challenge, embodied in the newly developed AugmenTory library.
+Notably, AugmenTory offers reduced computational demands in both time and space
+compared to existing methods. Additionally, the library includes a
+postprocessing thresholding feature. The AugmenTory package is publicly
+available on GitHub, where interested users can access the source code:
+https://github.com/Smartory/AugmenTory
+
+
+
+
+ + ☆ DistGrid: Scalable Scene Reconstruction with Distributed + Multi-resolution Hash Grid +
+ +
+ Neural Radiance Field~(NeRF) achieves extremely high quality in object-scaled
+and indoor scene reconstruction. However, there exist some challenges when
+reconstructing large-scale scenes. MLP-based NeRFs suffer from limited network
+capacity, while volume-based NeRFs are heavily memory-consuming when the scene
+resolution increases. Recent approaches propose to geographically partition the
+scene and learn each sub-region using an individual NeRF. Such partitioning
+strategies help volume-based NeRF exceed the single GPU memory limit and scale
+to larger scenes. However, this approach requires multiple background NeRF to
+handle out-of-partition rays, which leads to redundancy of learning. Inspired
+by the fact that the background of current partition is the foreground of
+adjacent partition, we propose a scalable scene reconstruction method based on
+joint Multi-resolution Hash Grids, named DistGrid. In this method, the scene is
+divided into multiple closely-paved yet non-overlapped Axis-Aligned Bounding
+Boxes, and a novel segmented volume rendering method is proposed to handle
+cross-boundary rays, thereby eliminating the need for background NeRFs. The
+experiments demonstrate that our method outperforms existing methods on all
+evaluated large-scale scenes, and provides visually plausible scene
+reconstruction. The scalability of our method on reconstruction quality is
+further evaluated qualitatively and quantitatively.
+
+
+
+ comment: Originally submitted to Siggraph Asia 2023
+
+
+
+ + ☆ DocRes: A Generalist Model Toward Unifying Document Image Restoration + Tasks CVPR 2024 +
+ +
+ Document image restoration is a crucial aspect of Document AI systems, as the
+quality of document images significantly influences the overall performance.
+Prevailing methods address distinct restoration tasks independently, leading to
+intricate systems and the incapability to harness the potential synergies of
+multi-task learning. To overcome this challenge, we propose DocRes, a
+generalist model that unifies five document image restoration tasks including
+dewarping, deshadowing, appearance enhancement, deblurring, and binarization.
+To instruct DocRes to perform various restoration tasks, we propose a novel
+visual prompt approach called Dynamic Task-Specific Prompt (DTSPrompt). The
+DTSPrompt for different tasks comprises distinct prior features, which are
+additional characteristics extracted from the input image. Beyond its role as a
+cue for task-specific execution, DTSPrompt can also serve as supplementary
+information to enhance the model's performance. Moreover, DTSPrompt is more
+flexible than prior visual prompt approaches as it can be seamlessly applied
+and adapted to inputs with high and variable resolutions. Experimental results
+demonstrate that DocRes achieves competitive or superior performance compared
+to existing state-of-the-art task-specific models. This underscores the
+potential of DocRes across a broader spectrum of document image restoration
+tasks. The source code is publicly available at
+https://github.com/ZZZHANG-jx/DocRes
+
+
+
+ comment: Accepted by CVPR 2024
+
+
+
+ + ☆ Vision Mamba: A Comprehensive Survey and Taxonomy +
+ +
+ State Space Model (SSM) is a mathematical model used to describe and analyze
+the behavior of dynamic systems. This model has witnessed numerous applications
+in several fields, including control theory, signal processing, economics and
+machine learning. In the field of deep learning, state space models are used to
+process sequence data, such as time series analysis, natural language
+processing (NLP) and video understanding. By mapping sequence data to state
+space, long-term dependencies in the data can be better captured. In
+particular, modern SSMs have shown strong representational capabilities in NLP,
+especially in long sequence modeling, while maintaining linear time complexity.
+Notably, based on the latest state-space models, Mamba merges time-varying
+parameters into SSMs and formulates a hardware-aware algorithm for efficient
+training and inference. Given its impressive efficiency and strong long-range
+dependency modeling capability, Mamba is expected to become a new AI
+architecture that may outperform Transformer. Recently, a number of works have
+attempted to study the potential of Mamba in various fields, such as general
+vision, multi-modal, medical image analysis and remote sensing image analysis,
+by extending Mamba from natural language domain to visual domain. To fully
+understand Mamba in the visual domain, we conduct a comprehensive survey and
+present a taxonomy study. This survey focuses on Mamba's application to a
+variety of visual tasks and data types, and discusses its predecessors, recent
+advances and far-reaching impact on a wide range of domains. Since Mamba is now
+on an upward trend, please actively notice us if you have new findings, and new
+progress on Mamba will be included in this survey in a timely manner and
+updated on the Mamba project at
+https://github.com/lx6c78/Vision-Mamba-A-Comprehensive-Survey-and-Taxonomy.
+
+
+
+ comment: https://github.com/lx6c78/Vision-Mamba-A-Comprehensive-Survey-and-Taxonomy
+
+
+
+ + ☆ Learning To See But Forgetting To Follow: Visual Instruction Tuning + Makes LLMs More Prone To Jailbreak Attacks +
+ +
+ Augmenting Large Language Models (LLMs) with image-understanding capabilities
+has resulted in a boom of high-performing Vision-Language models (VLMs). While
+studying the alignment of LLMs to human values has received widespread
+attention, the safety of VLMs has not received the same attention. In this
+paper, we explore the impact of jailbreaking on three state-of-the-art VLMs,
+each using a distinct modeling approach. By comparing each VLM to their
+respective LLM backbone, we find that each VLM is more susceptible to
+jailbreaking. We consider this as an undesirable outcome from visual
+instruction-tuning, which imposes a forgetting effect on an LLM's safety
+guardrails. Therefore, we provide recommendations for future work based on
+evaluation strategies that aim to highlight the weaknesses of a VLM, as well as
+take safety measures into account during visual instruction tuning.
+
+
+
+
+ + ☆ BILTS: A novel bi-invariant local trajectory-shape descriptor for + rigid-body motion +
+ +
+ Measuring the similarity between motions and established motion models is
+crucial for motion analysis, recognition, generation, and adaptation. To
+enhance similarity measurement across diverse contexts, invariant motion
+descriptors have been proposed. However, for rigid-body motion, few invariant
+descriptors exist that are bi-invariant, meaning invariant to both the body and
+world reference frames used to describe the motion. Moreover, their robustness
+to singularities is limited. This paper introduces a novel Bi-Invariant Local
+Trajectory-Shape descriptor (BILTS) and a corresponding dissimilarity measure.
+Mathematical relationships between BILTS and existing descriptors are derived,
+providing new insights into their properties. The paper also includes an
+algorithm to reproduce the motion from the BILTS descriptor, demonstrating its
+bidirectionality and usefulness for trajectory generation. Experimental
+validation using datasets of daily-life activities shows the higher robustness
+of the BILTS descriptor compared to the bi-invariant ISA descriptor. This
+higher robustness supports the further application of bi-invariant descriptors
+for motion recognition and generalization.
+
+
+
+ comment: This work has been submitted as a regular research paper for
+ consideration in the IEEE Transactions on Robotics. Copyright may be
+ transferred without notice, after which this version may no longer be
+ accessible
+
+
+
+ + ☆ DriveWorld: 4D Pre-trained Scene Understanding via World Models for + Autonomous Driving CVPR2024 +
+
+
+
+
+
+
+
+ Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, Liping Jing, Yiming Nie, Bin Dai
+
+
+ Vision-centric autonomous driving has recently raised wide attention due to
+its lower cost. Pre-training is essential for extracting a universal
+representation. However, current vision-centric pre-training typically relies
+on either 2D or 3D pre-text tasks, overlooking the temporal characteristics of
+autonomous driving as a 4D scene understanding task. In this paper, we address
+this challenge by introducing a world model-based autonomous driving 4D
+representation learning framework, dubbed \emph{DriveWorld}, which is capable
+of pre-training from multi-camera driving videos in a spatio-temporal fashion.
+Specifically, we propose a Memory State-Space Model for spatio-temporal
+modelling, which consists of a Dynamic Memory Bank module for learning
+temporal-aware latent dynamics to predict future changes and a Static Scene
+Propagation module for learning spatial-aware latent statics to offer
+comprehensive scene contexts. We additionally introduce a Task Prompt to
+decouple task-aware features for various downstream tasks. The experiments
+demonstrate that DriveWorld delivers promising results on various autonomous
+driving tasks. When pre-trained with the OpenScene dataset, DriveWorld achieves
+a 7.5% increase in mAP for 3D object detection, a 3.0% increase in IoU for
+online mapping, a 5.0% increase in AMOTA for multi-object tracking, a 0.1m
+decrease in minADE for motion forecasting, a 3.0% increase in IoU for occupancy
+prediction, and a 0.34m reduction in average L2 error for planning.
+
+
+
+ comment: Accepted by CVPR2024
+
+
+
+ + ☆ $\textbf{Splat-MOVER}$: Multi-Stage, Open-Vocabulary Robotic + Manipulation via Editable Gaussian Splatting +
+
+
+
+
+
+
+
+ Ola Shorinwa, Johnathan Tucker, Aliyah Smith, Aiden Swann, Timothy Chen, Roya Firoozi, Monroe Kennedy III, Mac Schwager
+
+
+ We present Splat-MOVER, a modular robotics stack for open-vocabulary robotic
+manipulation, which leverages the editability of Gaussian Splatting (GSplat)
+scene representations to enable multi-stage manipulation tasks. Splat-MOVER
+consists of: (i) $\textit{ASK-Splat}$, a GSplat representation that distills
+latent codes for language semantics and grasp affordance into the 3D scene.
+ASK-Splat enables geometric, semantic, and affordance understanding of 3D
+scenes, which is critical for many robotics tasks; (ii) $\textit{SEE-Splat}$, a
+real-time scene-editing module using 3D semantic masking and infilling to
+visualize the motions of objects that result from robot interactions in the
+real-world. SEE-Splat creates a "digital twin" of the evolving environment
+throughout the manipulation task; and (iii) $\textit{Grasp-Splat}$, a grasp
+generation module that uses ASK-Splat and SEE-Splat to propose candidate grasps
+for open-world objects. ASK-Splat is trained in real-time from RGB images in a
+brief scanning phase prior to operation, while SEE-Splat and Grasp-Splat run in
+real-time during operation. We demonstrate the superior performance of
+Splat-MOVER in hardware experiments on a Kinova robot compared to two recent
+baselines in four single-stage, open-vocabulary manipulation tasks, as well as
+in four multi-stage manipulation tasks using the edited scene to reflect scene
+changes due to prior manipulation stages, which is not possible with the
+existing baselines. Code for this project and a link to the project page will
+be made available soon.
+
+
+
+
+ + ☆ Choose What You Need: Disentangled Representation Learning for Scene + Text Recognition, Removal and Editing CVPR 2024 +
+ +
+ Scene text images contain not only style information (font, background) but
+also content information (character, texture). Different scene text tasks need
+different information, but previous representation learning methods use tightly
+coupled features for all tasks, resulting in sub-optimal performance. We
+propose a Disentangled Representation Learning framework (DARLING) aimed at
+disentangling these two types of features for improved adaptability in better
+addressing various downstream tasks (choose what you really need).
+Specifically, we synthesize a dataset of image pairs with identical style but
+different content. Based on the dataset, we decouple the two types of features
+by the supervision design. Clearly, we directly split the visual representation
+into style and content features, the content features are supervised by a text
+recognition loss, while an alignment loss aligns the style features in the
+image pairs. Then, style features are employed in reconstructing the
+counterpart image via an image decoder with a prompt that indicates the
+counterpart's content. Such an operation effectively decouples the features
+based on their distinctive properties. To the best of our knowledge, this is
+the first time in the field of scene text that disentangles the inherent
+properties of the text images. Our method achieves state-of-the-art performance
+in Scene Text Recognition, Removal, and Editing.
+
+
+
+ comment: Accepted to CVPR 2024
+
+
+
+ + ☆ Diff-IP2D: Diffusion-Based Hand-Object Interaction Prediction on + Egocentric Videos +
+ +
+ Understanding how humans would behave during hand-object interaction is vital
+for applications in service robot manipulation and extended reality. To achieve
+this, some recent works have been proposed to simultaneously predict hand
+trajectories and object affordances on human egocentric videos. They are
+regarded as the representation of future hand-object interactions, indicating
+potential human motion and motivation. However, the existing approaches mostly
+adopt the autoregressive paradigm for unidirectional prediction, which lacks
+mutual constraints within the holistic future sequence, and accumulates errors
+along the time axis. Meanwhile, these works basically overlook the effect of
+camera egomotion on first-person view predictions. To address these
+limitations, we propose a novel diffusion-based interaction prediction method,
+namely Diff-IP2D, to forecast future hand trajectories and object affordances
+concurrently in an iterative non-autoregressive manner. We transform the
+sequential 2D images into latent feature space and design a denoising diffusion
+model to predict future latent interaction features conditioned on past ones.
+Motion features are further integrated into the conditional denoising process
+to enable Diff-IP2D aware of the camera wearer's dynamics for more accurate
+interaction prediction. The experimental results show that our method
+significantly outperforms the state-of-the-art baselines on both the
+off-the-shelf metrics and our proposed new evaluation protocol. This highlights
+the efficacy of leveraging a generative paradigm for 2D hand-object interaction
+prediction. The code of Diff-IP2D will be released at
+https://github.com/IRMVLab/Diff-IP2D.
+
+
+
+
+ + ☆ Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation CVPR 2024 +
+ +
+ We present a new multi-modal face image generation method that converts a
+text prompt and a visual input, such as a semantic mask or scribble map, into a
+photo-realistic face image. To do this, we combine the strengths of Generative
+Adversarial networks (GANs) and diffusion models (DMs) by employing the
+multi-modal features in the DM into the latent space of the pre-trained GANs.
+We present a simple mapping and a style modulation network to link two models
+and convert meaningful representations in feature maps and attention maps into
+latent codes. With GAN inversion, the estimated latent codes can be used to
+generate 2D or 3D-aware facial images. We further present a multi-step training
+strategy that reflects textual and structural representations into the
+generated image. Our proposed network produces realistic 2D, multi-view, and
+stylized face images, which align well with inputs. We validate our method by
+using pre-trained 2D and 3D GANs, and our results outperform existing methods.
+Our project page is available at
+https://github.com/1211sh/Diffusion-driven_GAN-Inversion/.
+
+
+
+ comment: Accepted by CVPR 2024
+
+
+
+ + ☆ Novel View Synthesis with Neural Radiance Fields for Industrial Robot + Applications SP +
+
+
+
+
+
+
+
+ Markus Hillemann, Robert Langendörfer, Max Heiken, Max Mehltretter, Andreas Schenk, Martin Weinmann, Stefan Hinz, Christian Heipke, Markus Ulrich
+
+
+ Neural Radiance Fields (NeRFs) have become a rapidly growing research field
+with the potential to revolutionize typical photogrammetric workflows, such as
+those used for 3D scene reconstruction. As input, NeRFs require multi-view
+images with corresponding camera poses as well as the interior orientation. In
+the typical NeRF workflow, the camera poses and the interior orientation are
+estimated in advance with Structure from Motion (SfM). But the quality of the
+resulting novel views, which depends on different parameters such as the number
+and distribution of available images, as well as the accuracy of the related
+camera poses and interior orientation, is difficult to predict. In addition,
+SfM is a time-consuming pre-processing step, and its quality strongly depends
+on the image content. Furthermore, the undefined scaling factor of SfM hinders
+subsequent steps in which metric information is required. In this paper, we
+evaluate the potential of NeRFs for industrial robot applications. We propose
+an alternative to SfM pre-processing: we capture the input images with a
+calibrated camera that is attached to the end effector of an industrial robot
+and determine accurate camera poses with metric scale based on the robot
+kinematics. We then investigate the quality of the novel views by comparing
+them to ground truth, and by computing an internal quality measure based on
+ensemble methods. For evaluation purposes, we acquire multiple datasets that
+pose challenges for reconstruction typical of industrial applications, like
+reflective objects, poor texture, and fine structures. We show that the
+robot-based pose determination reaches similar accuracy as SfM in non-demanding
+cases, while having clear advantages in more challenging scenarios. Finally, we
+present first results of applying the ensemble method to estimate the quality
+of the synthetic novel view in the absence of a ground truth.
+
+
+
+ comment: 8 pages, 8 figures, accepted for publication in The International
+ Archives of the Photogrammetry, Remote Sensing and Spatial Information
+ Sciences (ISPRS Archives) 2024
+
+
+
+ + ☆ Audio-Visual Speech Representation Expert for Enhanced Talking Face + Video Generation and Evaluation CVPR2024 +
+
+
+
+
+
+
+
+ Dogucan Yaman, Fevziye Irem Eyiokur, Leonard Bärmann, Seymanur Aktı, Hazım Kemal Ekenel, Alexander Waibel
+
+
+ In the task of talking face generation, the objective is to generate a face
+video with lips synchronized to the corresponding audio while preserving visual
+details and identity information. Current methods face the challenge of
+learning accurate lip synchronization while avoiding detrimental effects on
+visual quality, as well as robustly evaluating such synchronization. To tackle
+these problems, we propose utilizing an audio-visual speech representation
+expert (AV-HuBERT) for calculating lip synchronization loss during training.
+Moreover, leveraging AV-HuBERT's features, we introduce three novel lip
+synchronization evaluation metrics, aiming to provide a comprehensive
+assessment of lip synchronization performance. Experimental results, along with
+a detailed ablation study, demonstrate the effectiveness of our approach and
+the utility of the proposed evaluation metrics.
+
+
+
+ comment: CVPR2024 NTIRE Workshop
+
+
+
+ + ☆ Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion + Transformer +
+
+
+
+
+
+
+
+ Zhuoyi Yang, Heyang Jiang, Wenyi Hong, Jiayan Teng, Wendi Zheng, Yuxiao Dong, Ming Ding, Jie Tang
+
+
+ Diffusion models have shown remarkable performance in image generation in
+recent years. However, due to a quadratic increase in memory during generating
+ultra-high-resolution images (e.g. 4096*4096), the resolution of generated
+images is often limited to 1024*1024. In this work. we propose a unidirectional
+block attention mechanism that can adaptively adjust the memory overhead during
+the inference process and handle global dependencies. Building on this module,
+we adopt the DiT structure for upsampling and develop an infinite
+super-resolution model capable of upsampling images of various shapes and
+resolutions. Comprehensive experiments show that our model achieves SOTA
+performance in generating ultra-high-resolution images in both machine and
+human evaluation. Compared to commonly used UNet structures, our model can save
+more than 5x memory when generating 4096*4096 images. The project URL is
+https://github.com/THUDM/Inf-DiT.
+
+
+
+
+ + ☆ Cross-IQA: Unsupervised Learning for Image Quality Assessment +
+ +
+ Automatic perception of image quality is a challenging problem that impacts
+billions of Internet and social media users daily. To advance research in this
+field, we propose a no-reference image quality assessment (NR-IQA) method
+termed Cross-IQA based on vision transformer(ViT) model. The proposed Cross-IQA
+method can learn image quality features from unlabeled image data. We construct
+the pretext task of synthesized image reconstruction to unsupervised extract
+the image quality information based ViT block. The pretrained encoder of
+Cross-IQA is used to fine-tune a linear regression model for score prediction.
+Experimental results show that Cross-IQA can achieve state-of-the-art
+performance in assessing the low-frequency degradation information (e.g., color
+change, blurring, etc.) of images compared with the classical full-reference
+IQA and NR-IQA under the same datasets.
+
+
+
+
+ + ☆ Non-rigid Structure-from-Motion: Temporally-smooth Procrustean Alignment + and Spatially-variant Deformation Modeling CVPR 2024 +
+ +
+ Even though Non-rigid Structure-from-Motion (NRSfM) has been extensively
+studied and great progress has been made, there are still key challenges that
+hinder their broad real-world applications: 1) the inherent motion/rotation
+ambiguity requires either explicit camera motion recovery with extra constraint
+or complex Procrustean Alignment; 2) existing low-rank modeling of the global
+shape can over-penalize drastic deformations in the 3D shape sequence. This
+paper proposes to resolve the above issues from a spatial-temporal modeling
+perspective. First, we propose a novel Temporally-smooth Procrustean Alignment
+module that estimates 3D deforming shapes and adjusts the camera motion by
+aligning the 3D shape sequence consecutively. Our new alignment module remedies
+the requirement of complex reference 3D shape during alignment, which is more
+conductive to non-isotropic deformation modeling. Second, we propose a
+spatial-weighted approach to enforce the low-rank constraint adaptively at
+different locations to accommodate drastic spatially-variant deformation
+reconstruction better. Our modeling outperform existing low-rank based methods,
+and extensive experiments across different datasets validate the effectiveness
+of our method.
+
+
+
+ comment: Accepted by CVPR 2024
+
+
+
+ + ☆ A New Dataset and Comparative Study for Aphid Cluster Detection and + Segmentation in Sorghum Fields +
+
+
+
+
+
+
+
+ Raiyan Rahman, Christopher Indris, Goetz Bramesfeld, Tianxiao Zhang, Kaidong Li, Xiangyu Chen, Ivan Grijalva, Brian McCornack, Daniel Flippo, Ajay Sharda, Guanghui Wang
+
+
+ Aphid infestations are one of the primary causes of extensive damage to wheat
+and sorghum fields and are one of the most common vectors for plant viruses,
+resulting in significant agricultural yield losses. To address this problem,
+farmers often employ the inefficient use of harmful chemical pesticides that
+have negative health and environmental impacts. As a result, a large amount of
+pesticide is wasted on areas without significant pest infestation. This brings
+to attention the urgent need for an intelligent autonomous system that can
+locate and spray sufficiently large infestations selectively within the complex
+crop canopies. We have developed a large multi-scale dataset for aphid cluster
+detection and segmentation, collected from actual sorghum fields and
+meticulously annotated to include clusters of aphids. Our dataset comprises a
+total of 54,742 image patches, showcasing a variety of viewpoints, diverse
+lighting conditions, and multiple scales, highlighting its effectiveness for
+real-world applications. In this study, we trained and evaluated four real-time
+semantic segmentation models and three object detection models specifically for
+aphid cluster segmentation and detection. Considering the balance between
+accuracy and efficiency, Fast-SCNN delivered the most effective segmentation
+results, achieving 80.46% mean precision, 81.21% mean recall, and 91.66 frames
+per second (FPS). For object detection, RT-DETR exhibited the best overall
+performance with a 61.63% mean average precision (mAP), 92.6% mean recall, and
+72.55 on an NVIDIA V100 GPU. Our experiments further indicate that aphid
+cluster segmentation is more suitable for assessing aphid infestations than
+using detection models.
+
+
+
+
+ + ☆ ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D + Occupancy Perception via View-Guided Transformers +
+ +
+ 3D occupancy, an advanced perception technology for driving scenarios,
+represents the entire scene without distinguishing between foreground and
+background by quantifying the physical space into a grid map. The widely
+adopted projection-first deformable attention, efficient in transforming image
+features into 3D representations, encounters challenges in aggregating
+multi-view features due to sensor deployment constraints. To address this
+issue, we propose our learning-first view attention mechanism for effective
+multi-view feature aggregation. Moreover, we showcase the scalability of our
+view attention across diverse multi-view 3D tasks, such as map construction and
+3D object detection. Leveraging the proposed view attention as well as an
+additional multi-frame streaming temporal attention, we introduce ViewFormer, a
+vision-centric transformer-based framework for spatiotemporal feature
+aggregation. To further explore occupancy-level flow representation, we present
+FlowOcc3D, a benchmark built on top of existing high-quality datasets.
+Qualitative and quantitative analyses on this benchmark reveal the potential to
+represent fine-grained dynamic scenes. Extensive experiments show that our
+approach significantly outperforms prior state-of-the-art methods. The codes
+and benchmark will be released soon.
+
+
+
+
+ + ☆ Semi-Supervised Disease Classification based on Limited Medical Image + Data +
+ +
+ In recent years, significant progress has been made in the field of learning
+from positive and unlabeled examples (PU learning), particularly in the context
+of advancing image and text classification tasks. However, applying PU learning
+to semi-supervised disease classification remains a formidable challenge,
+primarily due to the limited availability of labeled medical images. In the
+realm of medical image-aided diagnosis algorithms, numerous theoretical and
+practical obstacles persist. The research on PU learning for medical
+image-assisted diagnosis holds substantial importance, as it aims to reduce the
+time spent by professional experts in classifying images. Unlike natural
+images, medical images are typically accompanied by a scarcity of annotated
+data, while an abundance of unlabeled cases exists. Addressing these
+challenges, this paper introduces a novel generative model inspired by H\"older
+divergence, specifically designed for semi-supervised disease classification
+using positive and unlabeled medical image data. In this paper, we present a
+comprehensive formulation of the problem and establish its theoretical
+feasibility through rigorous mathematical analysis. To evaluate the
+effectiveness of our proposed approach, we conduct extensive experiments on
+five benchmark datasets commonly used in PU medical learning: BreastMNIST,
+PneumoniaMNIST, BloodMNIST, OCTMNIST, and AMD. The experimental results clearly
+demonstrate the superiority of our method over existing approaches based on KL
+divergence. Notably, our approach achieves state-of-the-art performance on all
+five disease classification benchmarks.
+ By addressing the limitations imposed by limited labeled data and harnessing
+the untapped potential of unlabeled medical images, our novel generative model
+presents a promising direction for enhancing semi-supervised disease
+classification in the field of medical image analysis.
+
+
+
+
+ + ☆ Group-aware Parameter-efficient Updating for Content-Adaptive Neural + Video Compression +
+ +
+ Content-adaptive compression is crucial for enhancing the adaptability of the
+pre-trained neural codec for various contents. Although these methods have been
+very practical in neural image compression (NIC), their application in neural
+video compression (NVC) is still limited due to two main aspects: 1), video
+compression relies heavily on temporal redundancy, therefore updating just one
+or a few frames can lead to significant errors accumulating over time; 2), NVC
+frameworks are generally more complex, with many large components that are not
+easy to update quickly during encoding. To address the previously mentioned
+challenges, we have developed a content-adaptive NVC technique called
+Group-aware Parameter-Efficient Updating (GPU). Initially, to minimize error
+accumulation, we adopt a group-aware approach for updating encoder parameters.
+This involves adopting a patch-based Group of Pictures (GoP) training strategy
+to segment a video into patch-based GoPs, which will be updated to facilitate a
+globally optimized domain-transferable solution. Subsequently, we introduce a
+parameter-efficient delta-tuning strategy, which is achieved by integrating
+several light-weight adapters into each coding component of the encoding
+process by both serial and parallel configuration. Such architecture-agnostic
+modules stimulate the components with large parameters, thereby reducing both
+the update cost and the encoding time. We incorporate our GPU into the latest
+NVC framework and conduct comprehensive experiments, whose results showcase
+outstanding video compression efficiency across four video benchmarks and
+adaptability of one medical image benchmark.
+
+
+
+
+ + ☆ A General Model for Detecting Learner Engagement: Implementation and + Evaluation +
+ +
+ Considering learner engagement has a mutual benefit for both learners and
+instructors. Instructors can help learners increase their attention,
+involvement, motivation, and interest. On the other hand, instructors can
+improve their instructional performance by evaluating the cumulative results of
+all learners and upgrading their training programs. This paper proposes a
+general, lightweight model for selecting and processing features to detect
+learners' engagement levels while preserving the sequential temporal
+relationship over time. During training and testing, we analyzed the videos
+from the publicly available DAiSEE dataset to capture the dynamic essence of
+learner engagement. We have also proposed an adaptation policy to find new
+labels that utilize the affective states of this dataset related to education,
+thereby improving the models' judgment. The suggested model achieves an
+accuracy of 68.57\% in a specific implementation and outperforms the studied
+state-of-the-art models detecting learners' engagement levels.
+
+
+
+ comment: 13 pages, 2 Postscript figures
+
+
+
+ + ☆ Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator + with Diffusion Models +
+
+
+
+
+
+
+
+ Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Yaole Wang, Jun Zhu
+
+
+ We introduce Vidu, a high-performance text-to-video generator that is capable
+of producing 1080p videos up to 16 seconds in a single generation. Vidu is a
+diffusion model with U-ViT as its backbone, which unlocks the scalability and
+the capability for handling long videos. Vidu exhibits strong coherence and
+dynamism, and is capable of generating both realistic and imaginative videos,
+as well as understanding some professional photography techniques, on par with
+Sora -- the most powerful reported text-to-video generator. Finally, we perform
+initial experiments on other controllable video generation, including
+canny-to-video generation, video prediction and subject-driven generation,
+which demonstrate promising results.
+
+
+
+ comment: Project page at https://www.shengshu-ai.com/vidu
+
+
+
+ + ☆ Breast Histopathology Image Retrieval by Attention-based Adversarially + Regularized Variational Graph Autoencoder with Contrastive Learning-Based + Feature Extraction +
+
+
+
+
+
+
+
+ Nematollah Saeidi, Hossein Karshenas, Bijan Shoushtarian, Sepideh Hatamikia, Ramona Woitek, Amirreza Mahbod
+
+
+ Breast cancer is a significant global health concern, particularly for women.
+Early detection and appropriate treatment are crucial in mitigating its impact,
+with histopathology examinations playing a vital role in swift diagnosis.
+However, these examinations often require a substantial workforce and
+experienced medical experts for proper recognition and cancer grading.
+Automated image retrieval systems have the potential to assist pathologists in
+identifying cancerous tissues, thereby accelerating the diagnostic process.
+Nevertheless, due to considerable variability among the tissue and cell
+patterns in histological images, proposing an accurate image retrieval model is
+very challenging.
+ This work introduces a novel attention-based adversarially regularized
+variational graph autoencoder model for breast histological image retrieval.
+Additionally, we incorporated cluster-guided contrastive learning as the graph
+feature extractor to boost the retrieval performance. We evaluated the proposed
+model's performance on two publicly available datasets of breast cancer
+histological images and achieved superior or very competitive retrieval
+performance, with average mAP scores of 96.5% for the BreakHis dataset and
+94.7% for the BACH dataset, and mVP scores of 91.9% and 91.3%, respectively.
+ Our proposed retrieval model has the potential to be used in clinical
+settings to enhance diagnostic performance and ultimately benefit patients.
+
+
+
+ comment: 31 pages
+
+
+
+ + ☆ Effective and Robust Adversarial Training against Data and Label + Corruptions +
+ +
+ Corruptions due to data perturbations and label noise are prevalent in the
+datasets from unreliable sources, which poses significant threats to model
+training. Despite existing efforts in developing robust models, current
+learning methods commonly overlook the possible co-existence of both
+corruptions, limiting the effectiveness and practicability of the model. In
+this paper, we develop an Effective and Robust Adversarial Training (ERAT)
+framework to simultaneously handle two types of corruption (i.e., data and
+label) without prior knowledge of their specifics. We propose a hybrid
+adversarial training surrounding multiple potential adversarial perturbations,
+alongside a semi-supervised learning based on class-rebalancing sample
+selection to enhance the resilience of the model for dual corruption. On the
+one hand, in the proposed adversarial training, the perturbation generation
+module learns multiple surrogate malicious data perturbations by taking a DNN
+model as the victim, while the model is trained to maintain semantic
+consistency between the original data and the hybrid perturbed data. It is
+expected to enable the model to cope with unpredictable perturbations in
+real-world data corruption. On the other hand, a class-rebalancing data
+selection strategy is designed to fairly differentiate clean labels from noisy
+labels. Semi-supervised learning is performed accordingly by discarding noisy
+labels. Extensive experiments demonstrate the superiority of the proposed ERAT
+framework.
+
+
+
+ comment: 12 pages, 8 figures
+
+
+
+ + ☆ Artificial Intelligence-powered fossil shark tooth identification: + Unleashing the potential of Convolutional Neural Networks +
+
+
+
+
+
+
+
+ Andrea Barucci, Giulia Ciacci, Pietro Liò, Tiago Azevedo, Andrea Di Cencio, Marco Merella, Giovanni Bianucci, Giulia Bosio, Simone Casati, Alberto Collareta
+
+
+ All fields of knowledge are being impacted by Artificial Intelligence. In
+particular, the Deep Learning paradigm enables the development of data analysis
+tools that support subject matter experts in a variety of sectors, from physics
+up to the recognition of ancient languages. Palaeontology is now observing this
+trend as well. This study explores the capability of Convolutional Neural
+Networks (CNNs), a particular class of Deep Learning algorithms specifically
+crafted for computer vision tasks, to classify images of isolated fossil shark
+teeth gathered from online datasets as well as from the authors$'$ experience
+on Peruvian Miocene and Italian Pliocene fossil assemblages. The shark taxa
+that are included in the final, composite dataset (which consists of more than
+one thousand images) are representative of both extinct and extant genera,
+namely, Carcharhinus, Carcharias, Carcharocles, Chlamydoselachus,
+Cosmopolitodus, Galeocerdo, Hemipristis, Notorynchus, Prionace and Squatina. We
+developed a CNN, named SharkNet-X, specifically tailored on our recognition
+task, reaching a 5-fold cross validated mean accuracy of 0.85 to identify
+images containing a single shark tooth. Furthermore, we elaborated a
+visualization of the features extracted from images using the last dense layer
+of the CNN, achieved through the application of the clustering technique t-SNE.
+In addition, in order to understand and explain the behaviour of the CNN while
+giving a paleontological point of view on the results, we introduced the
+explainability method SHAP. To the best of our knowledge, this is the first
+instance in which this method is applied to the field of palaeontology. The
+main goal of this work is to showcase how Deep Learning techniques can aid in
+identifying isolated fossil shark teeth, paving the way for developing new
+information tools for automating the recognition and classification of fossils.
+
+
+
+ comment: 40 pages, 8 figures
+
+
+
+ + ☆ Topicwise Separable Sentence Retrieval for Medical Report Generation +
+ +
+ Automated radiology reporting holds immense clinical potential in alleviating
+the burdensome workload of radiologists and mitigating diagnostic bias.
+Recently, retrieval-based report generation methods have garnered increasing
+attention due to their inherent advantages in terms of the quality and
+consistency of generated reports. However, due to the long-tail distribution of
+the training data, these models tend to learn frequently occurring sentences
+and topics, overlooking the rare topics. Regrettably, in many cases, the
+descriptions of rare topics often indicate critical findings that should be
+mentioned in the report. To address this problem, we introduce a Topicwise
+Separable Sentence Retrieval (Teaser) for medical report generation. To ensure
+comprehensive learning of both common and rare topics, we categorize queries
+into common and rare types to learn differentiated topics, and then propose
+Topic Contrastive Loss to effectively align topics and queries in the latent
+space. Moreover, we integrate an Abstractor module following the extraction of
+visual features, which aids the topic decoder in gaining a deeper understanding
+of the visual observational intent. Experiments on the MIMIC-CXR and IU X-ray
+datasets demonstrate that Teaser surpasses state-of-the-art models, while also
+validating its capability to effectively represent rare topics and establish
+more dependable correspondences between queries and topics.
+
+
+
+
+ + ☆ D-TrAttUnet: Toward Hybrid CNN-Transformer Architecture for Generic and + Subtle Segmentation in Medical Images +
+ +
+ Over the past two decades, machine analysis of medical imaging has advanced
+rapidly, opening up significant potential for several important medical
+applications. As complicated diseases increase and the number of cases rises,
+the role of machine-based imaging analysis has become indispensable. It serves
+as both a tool and an assistant to medical experts, providing valuable insights
+and guidance. A particularly challenging task in this area is lesion
+segmentation, a task that is challenging even for experienced radiologists. The
+complexity of this task highlights the urgent need for robust machine learning
+approaches to support medical staff. In response, we present our novel
+solution: the D-TrAttUnet architecture. This framework is based on the
+observation that different diseases often target specific organs. Our
+architecture includes an encoder-decoder structure with a composite
+Transformer-CNN encoder and dual decoders. The encoder includes two paths: the
+Transformer path and the Encoders Fusion Module path. The Dual-Decoder
+configuration uses two identical decoders, each with attention gates. This
+allows the model to simultaneously segment lesions and organs and integrate
+their segmentation losses.
+ To validate our approach, we performed evaluations on the Covid-19 and Bone
+Metastasis segmentation tasks. We also investigated the adaptability of the
+model by testing it without the second decoder in the segmentation of glands
+and nuclei. The results confirmed the superiority of our approach, especially
+in Covid-19 infections and the segmentation of bone metastases. In addition,
+the hybrid encoder showed exceptional performance in the segmentation of glands
+and nuclei, solidifying its role in modern medical image analysis.
+
+
+
+ comment: arXiv admin note: text overlap with arXiv:2303.15576
+
+
+
+ + ☆ Bridging the Synthetic-to-Authentic Gap: Distortion-Guided Unsupervised + Domain Adaptation for Blind Image Quality Assessment CVPR2024 +
+ +
+ The annotation of blind image quality assessment (BIQA) is labor-intensive
+and time-consuming, especially for authentic images. Training on synthetic data
+is expected to be beneficial, but synthetically trained models often suffer
+from poor generalization in real domains due to domain gaps. In this work, we
+make a key observation that introducing more distortion types in the synthetic
+dataset may not improve or even be harmful to generalizing authentic image
+quality assessment. To solve this challenge, we propose distortion-guided
+unsupervised domain adaptation for BIQA (DGQA), a novel framework that
+leverages adaptive multi-domain selection via prior knowledge from distortion
+to match the data distribution between the source domains and the target
+domain, thereby reducing negative transfer from the outlier source domains.
+Extensive experiments on two cross-domain settings (synthetic distortion to
+authentic distortion and synthetic distortion to algorithmic distortion) have
+demonstrated the effectiveness of our proposed DGQA. Besides, DGQA is
+orthogonal to existing model-based BIQA methods, and can be used in combination
+with such models to improve performance with less training data.
+
+
+
+ comment: Accepted by CVPR2024
+
+
+
+ + ☆ Sign2GPT: Leveraging Large Language Models for Gloss-Free Sign Language + Translation ICLR2024 +
+ +
+ Automatic Sign Language Translation requires the integration of both computer
+vision and natural language processing to effectively bridge the communication
+gap between sign and spoken languages. However, the deficiency in large-scale
+training data to support sign language translation means we need to leverage
+resources from spoken language. We introduce, Sign2GPT, a novel framework for
+sign language translation that utilizes large-scale pretrained vision and
+language models via lightweight adapters for gloss-free sign language
+translation. The lightweight adapters are crucial for sign language
+translation, due to the constraints imposed by limited dataset sizes and the
+computational requirements when training with long sign videos. We also propose
+a novel pretraining strategy that directs our encoder to learn sign
+representations from automatically extracted pseudo-glosses without requiring
+gloss order information or annotations. We evaluate our approach on two public
+benchmark sign language translation datasets, namely RWTH-PHOENIX-Weather 2014T
+and CSL-Daily, and improve on state-of-the-art gloss-free translation
+performance with a significant margin.
+
+
+
+ comment: Accepted at ICLR2024
+
+
+
+ + ☆ Exposing AI-generated Videos: A Benchmark Dataset and a Local-and-Global + Temporal Defect Based Detection Method +
+ +
+ The generative model has made significant advancements in the creation of
+realistic videos, which causes security issues. However, this emerging risk has
+not been adequately addressed due to the absence of a benchmark dataset for
+AI-generated videos. In this paper, we first construct a video dataset using
+advanced diffusion-based video generation algorithms with various semantic
+contents. Besides, typical video lossy operations over network transmission are
+adopted to generate degraded samples. Then, by analyzing local and global
+temporal defects of current AI-generated videos, a novel detection framework by
+adaptively learning local motion information and global appearance variation is
+constructed to expose fake videos. Finally, experiments are conducted to
+evaluate the generalization and robustness of different spatial and temporal
+domain detection methods, where the results can serve as the baseline and
+demonstrate the research challenge for future studies.
+
+
+
+
+ + ☆ ELiTe: Efficient Image-to-LiDAR Knowledge Transfer for Semantic + Segmentation ICME 2024 +
+ +
+ Cross-modal knowledge transfer enhances point cloud representation learning
+in LiDAR semantic segmentation. Despite its potential, the \textit{weak teacher
+challenge} arises due to repetitive and non-diverse car camera images and
+sparse, inaccurate ground truth labels. To address this, we propose the
+Efficient Image-to-LiDAR Knowledge Transfer (ELiTe) paradigm. ELiTe introduces
+Patch-to-Point Multi-Stage Knowledge Distillation, transferring comprehensive
+knowledge from the Vision Foundation Model (VFM), extensively trained on
+diverse open-world images. This enables effective knowledge transfer to a
+lightweight student model across modalities. ELiTe employs Parameter-Efficient
+Fine-Tuning to strengthen the VFM teacher and expedite large-scale model
+training with minimal costs. Additionally, we introduce the Segment Anything
+Model based Pseudo-Label Generation approach to enhance low-quality image
+labels, facilitating robust semantic representations. Efficient knowledge
+transfer in ELiTe yields state-of-the-art results on the SemanticKITTI
+benchmark, outperforming real-time inference models. Our approach achieves this
+with significantly fewer parameters, confirming its effectiveness and
+efficiency.
+
+
+
+ comment: 9 pages, 6 figures, ICME 2024 oral
+
+
+
+ + ☆ COM3D: Leveraging Cross-View Correspondence and Cross-Modal Mining for + 3D Retrieval ICME 2024 +
+ +
+ In this paper, we investigate an open research task of cross-modal retrieval
+between 3D shapes and textual descriptions. Previous approaches mainly rely on
+point cloud encoders for feature extraction, which may ignore key inherent
+features of 3D shapes, including depth, spatial hierarchy, geometric
+continuity, etc. To address this issue, we propose COM3D, making the first
+attempt to exploit the cross-view correspondence and cross-modal mining to
+enhance the retrieval performance. Notably, we augment the 3D features through
+a scene representation transformer, to generate cross-view correspondence
+features of 3D shapes, which enrich the inherent features and enhance their
+compatibility with text matching. Furthermore, we propose to optimize the
+cross-modal matching process based on the semi-hard negative example mining
+method, in an attempt to improve the learning efficiency. Extensive
+quantitative and qualitative experiments demonstrate the superiority of our
+proposed COM3D, achieving state-of-the-art results on the Text2Shape dataset.
+
+
+
+ comment: Accepted by ICME 2024 oral
+
+
+
+ + ☆ ESP: Extro-Spective Prediction for Long-term Behavior Reasoning in + Emergency Scenarios ICRA 2024 +
+
+
+
+
+
+
+
+ Dingrui Wang, Zheyuan Lai, Yuda Li, Yi Wu, Yuexin Ma, Johannes Betz, Ruigang Yang, Wei Li
+
+
+ Emergent-scene safety is the key milestone for fully autonomous driving, and
+reliable on-time prediction is essential to maintain safety in emergency
+scenarios. However, these emergency scenarios are long-tailed and hard to
+collect, which restricts the system from getting reliable predictions. In this
+paper, we build a new dataset, which aims at the long-term prediction with the
+inconspicuous state variation in history for the emergency event, named the
+Extro-Spective Prediction (ESP) problem. Based on the proposed dataset, a
+flexible feature encoder for ESP is introduced to various prediction methods as
+a seamless plug-in, and its consistent performance improvement underscores its
+efficacy. Furthermore, a new metric named clamped temporal error (CTE) is
+proposed to give a more comprehensive evaluation of prediction performance,
+especially in time-sensitive emergency events of subseconds. Interestingly, as
+our ESP features can be described in human-readable language naturally, the
+application of integrating into ChatGPT also shows huge potential. The
+ESP-dataset and all benchmarks are released at
+https://dingrui-wang.github.io/ESP-Dataset/.
+
+
+
+ comment: Accepted by ICRA 2024 as Oral Presentation
+
+
+
+ + ☆ Unmasking Illusions: Understanding Human Perception of Audiovisual + Deepfakes +
+ +
+ The emergence of contemporary deepfakes has attracted significant attention
+in machine learning research, as artificial intelligence (AI) generated
+synthetic media increases the incidence of misinterpretation and is difficult
+to distinguish from genuine content. Currently, machine learning techniques
+have been extensively studied for automatically detecting deepfakes. However,
+human perception has been less explored. Malicious deepfakes could ultimately
+cause public and social problems. Can we humans correctly perceive the
+authenticity of the content of the videos we watch? The answer is obviously
+uncertain; therefore, this paper aims to evaluate the human ability to discern
+deepfake videos through a subjective study. We present our findings by
+comparing human observers to five state-ofthe-art audiovisual deepfake
+detection models. To this end, we used gamification concepts to provide 110
+participants (55 native English speakers and 55 non-native English speakers)
+with a webbased platform where they could access a series of 40 videos (20 real
+and 20 fake) to determine their authenticity. Each participant performed the
+experiment twice with the same 40 videos in different random orders. The videos
+are manually selected from the FakeAVCeleb dataset. We found that all AI models
+performed better than humans when evaluated on the same 40 videos. The study
+also reveals that while deception is not impossible, humans tend to
+overestimate their detection capabilities. Our experimental results may help
+benchmark human versus machine performance, advance forensics analysis, and
+enable adaptive countermeasures.
+
+
+
+
+ + ☆ DCNN: Dual Cross-current Neural Networks Realized Using An Interactive + Deep Learning Discriminator for Fine-grained Objects +
+ +
+ Accurate classification of fine-grained images remains a challenge in
+backbones based on convolutional operations or self-attention mechanisms. This
+study proposes novel dual-current neural networks (DCNN), which combine the
+advantages of convolutional operations and self-attention mechanisms to improve
+the accuracy of fine-grained image classification. The main novel design
+features for constructing a weakly supervised learning backbone model DCNN
+include (a) extracting heterogeneous data, (b) keeping the feature map
+resolution unchanged, (c) expanding the receptive field, and (d) fusing global
+representations and local features. Experimental results demonstrated that
+using DCNN as the backbone network for classifying certain fine-grained
+benchmark datasets achieved performance advantage improvements of 13.5--19.5%
+and 2.2--12.9%, respectively, compared to other advanced convolution or
+attention-based fine-grained backbones.
+
+
+
+
+ + ☆ IMU-Aided Event-based Stereo Visual Odometry ICRA +
+ +
+ Direct methods for event-based visual odometry solve the mapping and camera
+pose tracking sub-problems by establishing implicit data association in a way
+that the generative model of events is exploited. The main bottlenecks faced by
+state-of-the-art work in this field include the high computational complexity
+of mapping and the limited accuracy of tracking. In this paper, we improve our
+previous direct pipeline \textit{Event-based Stereo Visual Odometry} in terms
+of accuracy and efficiency. To speed up the mapping operation, we propose an
+efficient strategy of edge-pixel sampling according to the local dynamics of
+events. The mapping performance in terms of completeness and local smoothness
+is also improved by combining the temporal stereo results and the static stereo
+results. To circumvent the degeneracy issue of camera pose tracking in
+recovering the yaw component of general 6-DoF motion, we introduce as a prior
+the gyroscope measurements via pre-integration. Experiments on publicly
+available datasets justify our improvement. We release our pipeline as an
+open-source software for future research in this field.
+
+
+
+ comment: 10 pages, 7 figures, ICRA
+
+
+
+ + ☆ DMOFC: Discrimination Metric-Optimized Feature Compression +
+ +
+ Feature compression, as an important branch of video coding for machines
+(VCM), has attracted significant attention and exploration. However, the
+existing methods mainly focus on intra-feature similarity, such as the Mean
+Squared Error (MSE) between the reconstructed and original features, while
+neglecting the importance of inter-feature relationships. In this paper, we
+analyze the inter-feature relationships, focusing on feature discriminability
+in machine vision and underscoring its significance in feature compression. To
+maintain the feature discriminability of reconstructed features, we introduce a
+discrimination metric for feature compression. The discrimination metric is
+designed to ensure that the distance between features of the same category is
+smaller than the distance between features of different categories.
+Furthermore, we explore the relationship between the discrimination metric and
+the discriminability of the original features. Experimental results confirm the
+effectiveness of the proposed discrimination metric and reveal there exists a
+trade-off between the discrimination metric and the discriminability of the
+original features.
+
+
+
+
+ + ☆ Space-time Reinforcement Network for Video Object Segmentation ICME 2024 +
+ +
+ Recently, video object segmentation (VOS) networks typically use memory-based
+methods: for each query frame, the mask is predicted by space-time matching to
+memory frames. Despite these methods having superior performance, they suffer
+from two issues: 1) Challenging data can destroy the space-time coherence
+between adjacent video frames. 2) Pixel-level matching will lead to undesired
+mismatching caused by the noises or distractors. To address the aforementioned
+issues, we first propose to generate an auxiliary frame between adjacent
+frames, serving as an implicit short-temporal reference for the query one.
+Next, we learn a prototype for each video object and prototype-level matching
+can be implemented between the query and memory. The experiment demonstrated
+that our network outperforms the state-of-the-art method on the DAVIS 2017,
+achieving a J&F score of 86.4%, and attains a competitive result 85.0% on
+YouTube VOS 2018. In addition, our network exhibits a high inference speed of
+32+ FPS.
+
+
+
+ comment: Accepted by ICME 2024. 6 pages, 10 figures
+
+
+
+ + ☆ Feature Map Convergence Evaluation for Functional Module +
+ +
+ Autonomous driving perception models are typically composed of multiple
+functional modules that interact through complex relationships to accomplish
+environment understanding. However, perception models are predominantly
+optimized as a black box through end-to-end training, lacking independent
+evaluation of functional modules, which poses difficulties for interpretability
+and optimization. Pioneering in the issue, we propose an evaluation method
+based on feature map analysis to gauge the convergence of model, thereby
+assessing functional modules' training maturity. We construct a quantitative
+metric named as the Feature Map Convergence Score (FMCS) and develop Feature
+Map Convergence Evaluation Network (FMCE-Net) to measure and predict the
+convergence degree of models respectively. FMCE-Net achieves remarkable
+predictive accuracy for FMCS across multiple image classification experiments,
+validating the efficacy and robustness of the introduced approach. To the best
+of our knowledge, this is the first independent evaluation method for
+functional modules, offering a new paradigm for the training assessment towards
+perception models.
+
+
+
+
+ + ☆ Lumbar Spine Tumor Segmentation and Localization in T2 MRI Images Using + AI +
+
+
+
+
+
+
+
+ Rikathi Pal, Sudeshna Mondal, Aditi Gupta, Priya Saha, Somoballi Ghoshal, Amlan Chakrabarti, Susmita Sur-Kolay
+
+
+ In medical imaging, segmentation and localization of spinal tumors in
+three-dimensional (3D) space pose significant computational challenges,
+primarily stemming from limited data availability. In response, this study
+introduces a novel data augmentation technique, aimed at automating spine tumor
+segmentation and localization through AI approaches. Leveraging a fusion of
+fuzzy c-means clustering and Random Forest algorithms, the proposed method
+achieves successful spine tumor segmentation based on predefined masks
+initially delineated by domain experts in medical imaging. Subsequently, a
+Convolutional Neural Network (CNN) architecture is employed for tumor
+classification. Moreover, 3D vertebral segmentation and labeling techniques are
+used to help pinpoint the exact location of the tumors in the lumbar spine.
+Results indicate a remarkable performance, with 99% accuracy for tumor
+segmentation, 98% accuracy for tumor classification, and 99% accuracy for tumor
+localization achieved with the proposed approach. These metrics surpass the
+efficacy of existing state-of-the-art techniques, as evidenced by superior Dice
+Score, Class Accuracy, and Intersection over Union (IOU) on class accuracy
+metrics. This innovative methodology holds promise for enhancing the diagnostic
+capabilities in detecting and characterizing spinal tumors, thereby
+facilitating more effective clinical decision-making.
+
+
+
+ comment: 9 pages, 12 figures
+
+
+
+ + ☆ Structured Click Control in Transformer-based Interactive Segmentation NeurIPS 2024 +
+ +
+ Click-point-based interactive segmentation has received widespread attention
+due to its efficiency. However, it's hard for existing algorithms to obtain
+precise and robust responses after multiple clicks. In this case, the
+segmentation results tend to have little change or are even worse than before.
+To improve the robustness of the response, we propose a structured click intent
+model based on graph neural networks, which adaptively obtains graph nodes via
+the global similarity of user-clicked Transformer tokens. Then the graph nodes
+will be aggregated to obtain structured interaction features. Finally, the dual
+cross-attention will be used to inject structured interaction features into
+vision Transformer features, thereby enhancing the control of clicks over
+segmentation results. Extensive experiments demonstrated the proposed algorithm
+can serve as a general structure in improving Transformer-based interactive
+segmenta?tion performance. The code and data will be released at
+https://github.com/hahamyt/scc.
+
+
+
+ comment: 10 pages, 6 figures, submitted to NeurIPS 2024
+
+
+
+ + ☆ SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional + Image Editing +
+ +
+ In this technical report, we introduce SEED-Data-Edit: a unique hybrid
+dataset for instruction-guided image editing, which aims to facilitate image
+manipulation using open-form language. SEED-Data-Edit is composed of three
+distinct types of data: (1) High-quality editing data produced by an automated
+pipeline, ensuring a substantial volume of diverse image editing pairs. (2)
+Real-world scenario data collected from the internet, which captures the
+intricacies of user intentions for promoting the practical application of image
+editing in the real world. (3) High-precision multi-turn editing data annotated
+by humans, which involves multiple rounds of edits for simulating iterative
+editing processes. The combination of these diverse data sources makes
+SEED-Data-Edit a comprehensive and versatile dataset for training
+language-guided image editing model. We fine-tune a pretrained Multimodal Large
+Language Model (MLLM) that unifies comprehension and generation with
+SEED-Data-Edit. The instruction tuned model demonstrates promising results,
+indicating the potential and effectiveness of SEED-Data-Edit in advancing the
+field of instructional image editing. The datasets are released in
+https://huggingface.co/datasets/AILab-CVC/SEED-Data-Edit.
+
+
+
+ comment: Technical Report; Dataset released in
+ https://huggingface.co/datasets/AILab-CVC/SEED-Data-Edit
+
+
+
+ + ☆ Deep Event-based Object Detection in Autonomous Driving: A Survey +
+ +
+ Object detection plays a critical role in autonomous driving, where
+accurately and efficiently detecting objects in fast-moving scenes is crucial.
+Traditional frame-based cameras face challenges in balancing latency and
+bandwidth, necessitating the need for innovative solutions. Event cameras have
+emerged as promising sensors for autonomous driving due to their low latency,
+high dynamic range, and low power consumption. However, effectively utilizing
+the asynchronous and sparse event data presents challenges, particularly in
+maintaining low latency and lightweight architectures for object detection.
+This paper provides an overview of object detection using event data in
+autonomous driving, showcasing the competitive benefits of event cameras.
+
+
+
+
+ + ☆ Predicting Lung Disease Severity via Image-Based AQI Analysis using Deep + Learning Techniques +
+ +
+ Air pollution is a significant health concern worldwide, contributing to
+various respiratory diseases. Advances in air quality mapping, driven by the
+emergence of smart cities and the proliferation of Internet-of-Things sensor
+devices, have led to an increase in available data, fueling momentum in air
+pollution forecasting. The objective of this study is to devise an integrated
+approach for predicting air quality using image data and subsequently assessing
+lung disease severity based on Air Quality Index (AQI).The aim is to implement
+an integrated approach by refining existing techniques to improve accuracy in
+predicting AQI and lung disease severity. The study aims to forecast additional
+atmospheric pollutants like AQI, PM10, O3, CO, SO2, NO2 in addition to PM2.5
+levels. Additionally, the study aims to compare the proposed approach with
+existing methods to show its effectiveness. The approach used in this paper
+uses VGG16 model for feature extraction in images and neural network for
+predicting AQI.In predicting lung disease severity, Support Vector Classifier
+(SVC) and K-Nearest Neighbors (KNN) algorithms are utilized. The neural network
+model for predicting AQI achieved training accuracy of 88.54 % and testing
+accuracy of 87.44%,which was measured using loss function, while the KNN model
+used for predicting lung disease severity achieved training accuracy of 98.4%
+and testing accuracy of 97.5% In conclusion, the integrated approach presented
+in this study forecasts air quality and evaluates lung disease severity,
+achieving high testing accuracies of 87.44% for AQI and 97.5% for lung disease
+severity using neural network, KNN, and SVC models. The future scope involves
+implementing transfer learning and advanced deep learning modules to enhance
+prediction capabilities. While the current study focuses on India, the
+objective is to expand its scope to encompass global coverage.
+
+
+
+ comment: 11 pages
+
+
+
+ + ☆ VMambaCC: A Visual State Space Model for Crowd Counting +
+ +
+ As a deep learning model, Visual Mamba (VMamba) has a low computational
+complexity and a global receptive field, which has been successful applied to
+image classification and detection. To extend its applications, we apply VMamba
+to crowd counting and propose a novel VMambaCC (VMamba Crowd Counting) model.
+Naturally, VMambaCC inherits the merits of VMamba, or global modeling for
+images and low computational cost. Additionally, we design a Multi-head
+High-level Feature (MHF) attention mechanism for VMambaCC. MHF is a new
+attention mechanism that leverages high-level semantic features to augment
+low-level semantic features, thereby enhancing spatial feature representation
+with greater precision. Building upon MHF, we further present a High-level
+Semantic Supervised Feature Pyramid Network (HS2PFN) that progressively
+integrates and enhances high-level semantic information with low-level semantic
+information. Extensive experimental results on five public datasets validate
+the efficacy of our approach. For example, our method achieves a mean absolute
+error of 51.87 and a mean squared error of 81.3 on the ShangHaiTech\_PartA
+dataset. Our code is coming soon.
+
+
+
+
+ + ☆ Unified End-to-End V2X Cooperative Autonomous Driving +
+
+
+
+
+
+
+
+ Zhiwei Li, Bozhen Zhang, Lei Yang, Tianyu Shen, Nuo Xu, Ruosen Hao, Weiting Li, Tao Yan, Huaping Liu
+
+
+ V2X cooperation, through the integration of sensor data from both vehicles
+and infrastructure, is considered a pivotal approach to advancing autonomous
+driving technology. Current research primarily focuses on enhancing perception
+accuracy, often overlooking the systematic improvement of accident prediction
+accuracy through end-to-end learning, leading to insufficient attention to the
+safety issues of autonomous driving. To address this challenge, this paper
+introduces the UniE2EV2X framework, a V2X-integrated end-to-end autonomous
+driving system that consolidates key driving modules within a unified network.
+The framework employs a deformable attention-based data fusion strategy,
+effectively facilitating cooperation between vehicles and infrastructure. The
+main advantages include: 1) significantly enhancing agents' perception and
+motion prediction capabilities, thereby improving the accuracy of accident
+predictions; 2) ensuring high reliability in the data fusion process; 3)
+superior end-to-end perception compared to modular approaches. Furthermore, We
+implement the UniE2EV2X framework on the challenging DeepAccident, a simulation
+dataset designed for V2X cooperative driving.
+
+
+
+
+ + ☆ Joint Estimation of Identity Verification and Relative Pose for Partial + Fingerprints +
+ +
+ Currently, portable electronic devices are becoming more and more popular.
+For lightweight considerations, their fingerprint recognition modules usually
+use limited-size sensors. However, partial fingerprints have few matchable
+features, especially when there are differences in finger pressing posture or
+image quality, which makes partial fingerprint verification challenging. Most
+existing methods regard fingerprint position rectification and identity
+verification as independent tasks, ignoring the coupling relationship between
+them -- relative pose estimation typically relies on paired features as
+anchors, and authentication accuracy tends to improve with more precise pose
+alignment. Consequently, in this paper we propose a method that jointly
+estimates identity verification and relative pose for partial fingerprints,
+aiming to leverage their inherent correlation to improve each other. To achieve
+this, we propose a multi-task CNN (Convolutional Neural Network)-Transformer
+hybrid network, and design a pre-training task to enhance the feature
+extraction capability. Experiments on multiple public datasets (NIST SD14,
+FVC2002 DB1A & DB3A, FVC2004 DB1A & DB2A, FVC2006 DB1A) and an in-house dataset
+show that our method achieves state-of-the-art performance in both partial
+fingerprint verification and relative pose estimation, while being more
+efficient than previous methods.
+
+
+
+
+ + ☆ Simple Drop-in LoRA Conditioning on Attention Layers Will Improve Your + Diffusion Model +
+ +
+ Current state-of-the-art diffusion models employ U-Net architectures
+containing convolutional and (qkv) self-attention layers. The U-Net processes
+images while being conditioned on the time embedding input for each sampling
+step and the class or caption embedding input corresponding to the desired
+conditional generation. Such conditioning involves scale-and-shift operations
+to the convolutional layers but does not directly affect the attention layers.
+While these standard architectural choices are certainly effective, not
+conditioning the attention layers feels arbitrary and potentially suboptimal.
+In this work, we show that simply adding LoRA conditioning to the attention
+layers without changing or tuning the other parts of the U-Net architecture
+improves the image generation quality. For example, a drop-in addition of LoRA
+conditioning to EDM diffusion model yields FID scores of 1.91/1.75 for
+unconditional and class-conditional CIFAR-10 generation, improving upon the
+baseline of 1.97/1.79.
+
+
+
+
+ + ☆ IPFed: Identity protected federated learning for user authentication +
+ +
+ With the development of laws and regulations related to privacy preservation,
+it has become difficult to collect personal data to perform machine learning.
+In this context, federated learning, which is distributed learning without
+sharing personal data, has been proposed. In this paper, we focus on federated
+learning for user authentication. We show that it is difficult to achieve both
+privacy preservation and high accuracy with existing methods. To address these
+challenges, we propose IPFed which is privacy-preserving federated learning
+using random projection for class embedding. Furthermore, we prove that IPFed
+is capable of learning equivalent to the state-of-the-art method. Experiments
+on face image datasets show that IPFed can protect the privacy of personal data
+while maintaining the accuracy of the state-of-the-art method.
+
+
+
+
+ + ☆ Role of Sensing and Computer Vision in 6G Wireless Communications +
+
+
+
+
+
+
+
+ Seungnyun Kim, Jihoon Moon, Jinhong Kim, Yongjun Ahn, Donghoon Kim, Sunwoo Kim, Kyuhong Shim, Byonghyo Shim
+
+
+ Recently, we are witnessing the remarkable progress and widespread adoption
+of sensing technologies in autonomous driving, robotics, and metaverse.
+Considering the rapid advancement of computer vision (CV) technology to analyze
+the sensing information, we anticipate a proliferation of wireless applications
+exploiting the sensing and CV technologies in 6G. In this article, we provide a
+holistic overview of the sensing and CV-aided wireless communications (SVWC)
+framework for 6G. By analyzing the high-resolution sensing information through
+the powerful CV techniques, SVWC can quickly and accurately understand the
+wireless environments and then perform the wireless tasks. To demonstrate the
+efficacy of SVWC, we design the whole process of SVWC including the sensing
+dataset collection, DL model training, and execution of realistic wireless
+tasks. From the numerical evaluations on 6G communication scenarios, we show
+that SVWC achieves considerable performance gains over the conventional 5G
+systems in terms of positioning accuracy, data rate, and access latency.
+
+
+
+
+ + ♻ ☆ Amodal Optical Flow +
+
+
+
+
+
+
+
+ Maximilian Luz, Rohit Mohan, Ahmed Rida Sekkat, Oliver Sawade, Elmar Matthes, Thomas Brox, Abhinav Valada
+
+
+ Optical flow estimation is very challenging in situations with transparent or
+occluded objects. In this work, we address these challenges at the task level
+by introducing Amodal Optical Flow, which integrates optical flow with amodal
+perception. Instead of only representing the visible regions, we define amodal
+optical flow as a multi-layered pixel-level motion field that encompasses both
+visible and occluded regions of the scene. To facilitate research on this new
+task, we extend the AmodalSynthDrive dataset to include pixel-level labels for
+amodal optical flow estimation. We present several strong baselines, along with
+the Amodal Flow Quality metric to quantify the performance in an interpretable
+manner. Furthermore, we propose the novel AmodalFlowNet as an initial step
+toward addressing this task. AmodalFlowNet consists of a transformer-based
+cost-volume encoder paired with a recurrent transformer decoder which
+facilitates recurrent hierarchical feature propagation and amodal semantic
+grounding. We demonstrate the tractability of amodal optical flow in extensive
+experiments and show its utility for downstream tasks such as panoptic
+tracking. We make the dataset, code, and trained models publicly available at
+http://amodal-flow.cs.uni-freiburg.de.
+
+
+
+
+ + ♻ ☆ A dataset of over one thousand computed tomography scans of battery + cells +
+
+
+
+
+
+
+
+ Amariah Condon, Bailey Buscarino, Eric Moch, William J. Sehnert, Owen Miles, Patrick K. Herring, Peter M. Attia
+
+
+ Battery technology is increasingly important for global electrification
+efforts. However, batteries are highly sensitive to small manufacturing
+variations that can induce reliability or safety issues. An important
+technology for battery quality control is computed tomography (CT) scanning,
+which is widely used for non-destructive 3D inspection across a variety of
+clinical and industrial applications. Historically, however, the utility of CT
+scanning for high-volume manufacturing has been limited by its low throughput
+as well as the difficulty of handling its large file sizes. In this work, we
+present a dataset of over one thousand CT scans of as-produced commercially
+available batteries. The dataset spans various chemistries (lithium-ion and
+sodium-ion) as well as various battery form factors (cylindrical, pouch, and
+prismatic). We evaluate seven different battery types in total. The
+manufacturing variability and the presence of battery defects can be observed
+via this dataset. This dataset may be of interest to scientists and engineers
+working on battery technology, computer vision, or both.
+
+
+
+
+ + ♻ ☆ MonoPCC: Photometric-invariant Cycle Constraint for Monocular Depth + Estimation of Endoscopic Images +
+
+
+
+
+
+
+
+ Zhiwei Wang, Ying Zhou, Shiquan He, Ting Li, Fan Huang, Qiang Ding, Xinxia Feng, Mei Liu, Qiang Li
+
+
+ Photometric constraint is indispensable for self-supervised monocular depth
+estimation. It involves warping a source image onto a target view using
+estimated depth&pose, and then minimizing the difference between the warped and
+target images. However, the endoscopic built-in light causes significant
+brightness fluctuations, and thus makes the photometric constraint unreliable.
+Previous efforts only mitigate this relying on extra models to calibrate image
+brightness. In this paper, we propose MonoPCC to address the brightness
+inconsistency radically by reshaping the photometric constraint into a cycle
+form. Instead of only warping the source image, MonoPCC constructs a closed
+loop consisting of two opposite forward-backward warping paths: from target to
+source and then back to target. Thus, the target image finally receives an
+image cycle-warped from itself, which naturally makes the constraint invariant
+to brightness changes. Moreover, MonoPCC transplants the source image's
+phase-frequency into the intermediate warped image to avoid structure lost, and
+also stabilizes the training via an exponential moving average (EMA) strategy
+to avoid frequent changes in the forward warping. The comprehensive and
+extensive experimental results on four endoscopic datasets demonstrate that our
+proposed MonoPCC shows a great robustness to the brightness inconsistency, and
+exceeds other state-of-the-arts by reducing the absolute relative error by at
+least 7.27%, 9.38%, 9.90% and 3.17%, respectively.
+
+
+
+ comment: 11 pages, 10 figures
+
+
+
+ + ♻ ☆ CLIP-KD: An Empirical Study of CLIP Model Distillation CVPR-2024 +
+
+
+
+
+
+
+
+ Chuanguang Yang, Zhulin An, Libo Huang, Junyu Bi, Xinqiang Yu, Han Yang, Boyu Diao, Yongjun Xu
+
+
+ Contrastive Language-Image Pre-training (CLIP) has become a promising
+language-supervised visual pre-training framework. This paper aims to distill
+small CLIP models supervised by a large teacher CLIP model. We propose several
+distillation strategies, including relation, feature, gradient and contrastive
+paradigms, to examine the effectiveness of CLIP-Knowledge Distillation (KD). We
+show that a simple feature mimicry with Mean Squared Error loss works
+surprisingly well. Moreover, interactive contrastive learning across teacher
+and student encoders is also effective in performance improvement. We explain
+that the success of CLIP-KD can be attributed to maximizing the feature
+similarity between teacher and student. The unified method is applied to
+distill several student models trained on CC3M+12M. CLIP-KD improves student
+CLIP models consistently over zero-shot ImageNet classification and cross-modal
+retrieval benchmarks. When using ViT-L/14 pretrained on Laion-400M as the
+teacher, CLIP-KD achieves 57.5\% and 55.4\% zero-shot top-1 ImageNet accuracy
+over ViT-B/16 and ResNet-50, surpassing the original CLIP without KD by 20.5\%
+and 20.1\% margins, respectively. Our code is released on
+https://github.com/winycg/CLIP-KD.
+
+
+
+ comment: CVPR-2024
+
+
+
+ + ♻ ☆ CascadedGaze: Efficiency in Global Context Extraction for Image + Restoration +
+
+
+
+
+
+
+
+ Amirhosein Ghasemabadi, Muhammad Kamran Janjua, Mohammad Salameh, Chunhua Zhou, Fengyu Sun, Di Niu
+
+
+ Image restoration tasks traditionally rely on convolutional neural networks.
+However, given the local nature of the convolutional operator, they struggle to
+capture global information. The promise of attention mechanisms in Transformers
+is to circumvent this problem, but it comes at the cost of intensive
+computational overhead. Many recent studies in image restoration have focused
+on solving the challenge of balancing performance and computational cost via
+Transformer variants. In this paper, we present CascadedGaze Network (CGNet),
+an encoder-decoder architecture that employs Global Context Extractor (GCE), a
+novel and efficient way to capture global information for image restoration.
+The GCE module leverages small kernels across convolutional layers to learn
+global dependencies, without requiring self-attention. Extensive experimental
+results show that our computationally efficient approach performs competitively
+to a range of state-of-the-art methods on synthetic image denoising and single
+image deblurring tasks, and pushes the performance boundary further on the real
+image denoising task.
+
+
+
+ comment: Published in Transactions on Machine Learning Research (TMLR), 2024.
+ 20 pages
+
+
+
+ + ♻ ☆ Learning Noise-Robust Joint Representation for Multimodal Emotion + Recognition under Incomplete Data Scenarios +
+ +
+ Multimodal emotion recognition (MER) in practical scenarios is significantly
+challenged by the presence of missing or incomplete data across different
+modalities. To overcome these challenges, researchers have aimed to simulate
+incomplete conditions during the training phase to enhance the system's overall
+robustness. Traditional methods have often involved discarding data or
+substituting data segments with zero vectors to approximate these
+incompletenesses. However, such approaches neither accurately represent
+real-world conditions nor adequately address the issue of noisy data
+availability. For instance, a blurry image cannot be simply replaced with zero
+vectors, and still retain information. To tackle this issue and develop a more
+precise MER system, we introduce a novel noise-robust MER model that
+effectively learns robust multimodal joint representations from noisy data.
+This approach includes two pivotal components: firstly, a noise scheduler that
+adjusts the type and level of noise in the data to emulate various realistic
+incomplete situations. Secondly, a Variational AutoEncoder (VAE)-based module
+is employed to reconstruct these robust multimodal joint representations from
+the noisy inputs. Notably, the introduction of the noise scheduler enables the
+exploration of an entirely new type of incomplete data condition, which is
+impossible with existing methods. Extensive experimental evaluations on the
+benchmark datasets IEMOCAP and CMU-MOSEI demonstrate the effectiveness of the
+noise scheduler and the excellent performance of our proposed model.
+
+
+
+
+ + ♻ ☆ NTIRE 2024 Quality Assessment of AI-Generated Content Challenge +
+
+
+
+
+
+
+
+ Xiaohong Liu, Xiongkuo Min, Guangtao Zhai, Chunyi Li, Tengchuan Kou, Wei Sun, Haoning Wu, Yixuan Gao, Yuqin Cao, Zicheng Zhang, Xiele Wu, Radu Timofte, Fei Peng, Huiyuan Fu, Anlong Ming, Chuanming Wang, Huadong Ma, Shuai He, Zifei Dou, Shu Chen, Huacong Zhang, Haiyi Xie, Chengwei Wang, Baoying Chen, Jishen Zeng, Jianquan Yang, Weigang Wang, Xi Fang, Xiaoxin Lv, Jun Yan, Tianwu Zhi, Yabin Zhang, Yaohui Li, Yang Li, Jingwen Xu, Jianzhao Liu, Yiting Liao, Junlin Li, Zihao Yu, Yiting Lu, Xin Li, Hossein Motamednia, S. Farhad Hosseini-Benvidi, Fengbin Guan, Ahmad Mahmoudi-Aznaveh, Azadeh Mansouri, Ganzorig Gankhuyag, Kihwan Yoon, Yifang Xu, Haotian Fan, Fangyuan Kong, Shiling Zhao, Weifeng Dong, Haibing Yin, Li Zhu, Zhiling Wang, Bingchen Huang, Avinab Saha, Sandeep Mishra, Shashank Gupta, Rajesh Sureddi, Oindrila Saha, Luigi Celona, Simone Bianco, Paolo Napoletano, Raimondo Schettini, Junfeng Yang, Jing Fu, Wei Zhang, Wenzhi Cao, Limei Liu, Han Peng, Weijun Yuan, Zhan Li, Yihang Cheng, Yifan Deng, Haohui Li, Bowen Qu, Yao Li, Shuqing Luo, Shunzhou Wang, Wei Gao, Zihao Lu, Marcos V. Conde, Xinrui Wang, Zhibo Chen, Ruling Liao, Yan Ye, Qiulin Wang, Bing Li, Zhaokun Zhou, Miao Geng, Rui Chen, Xin Tao, Xiaoyu Liang, Shangkun Sun, Xingyuan Ma, Jiaze Li, Mengduo Yang, Haoran Xu, Jie Zhou, Shiding Zhu, Bohan Yu, Pengfei Chen, Xinrui Xu, Jiabin Shen, Zhichao Duan, Erfan Asadi, Jiahe Liu, Qi Yan, Youran Qu, Xiaohui Zeng, Lele Wang, Renjie Liao
+
+
+ This paper reports on the NTIRE 2024 Quality Assessment of AI-Generated
+Content Challenge, which will be held in conjunction with the New Trends in
+Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2024. This challenge
+is to address a major challenge in the field of image and video processing,
+namely, Image Quality Assessment (IQA) and Video Quality Assessment (VQA) for
+AI-Generated Content (AIGC). The challenge is divided into the image track and
+the video track. The image track uses the AIGIQA-20K, which contains 20,000
+AI-Generated Images (AIGIs) generated by 15 popular generative models. The
+image track has a total of 318 registered participants. A total of 1,646
+submissions are received in the development phase, and 221 submissions are
+received in the test phase. Finally, 16 participating teams submitted their
+models and fact sheets. The video track uses the T2VQA-DB, which contains
+10,000 AI-Generated Videos (AIGVs) generated by 9 popular Text-to-Video (T2V)
+models. A total of 196 participants have registered in the video track. A total
+of 991 submissions are received in the development phase, and 185 submissions
+are received in the test phase. Finally, 12 participating teams submitted their
+models and fact sheets. Some methods have achieved better results than baseline
+methods, and the winning methods in both tracks have demonstrated superior
+prediction performance on AIGC.
+
+
+
+
+ + ♻ ☆ On Good Practices for Task-Specific Distillation of Large Pretrained + Visual Models +
+ +
+ Large pretrained visual models exhibit remarkable generalization across
+diverse recognition tasks. Yet, real-world applications often demand compact
+models tailored to specific problems. Variants of knowledge distillation have
+been devised for such a purpose, enabling task-specific compact models (the
+students) to learn from a generic large pretrained one (the teacher). In this
+paper, we show that the excellent robustness and versatility of recent
+pretrained models challenge common practices established in the literature,
+calling for a new set of optimal guidelines for task-specific distillation. To
+address the lack of samples in downstream tasks, we also show that a variant of
+Mixup based on stable diffusion complements standard data augmentation. This
+strategy eliminates the need for engineered text prompts and improves
+distillation of generic models into streamlined specialized networks.
+
+
+
+
+ + ♻ ☆ Deep Unlearning: Fast and Efficient Training-free Approach to Class + Forgetting +
+ +
+ Machine unlearning is a prominent and challenging field, driven by regulatory
+demands for user data deletion and heightened privacy awareness. Existing
+approaches involve retraining model or multiple finetuning steps for each
+deletion request, often constrained by computational limits and restricted data
+access. In this work, we introduce a novel class unlearning algorithm designed
+to strategically eliminate specific classes from the learned model. Our
+algorithm first estimates the Retain and the Forget Spaces using Singular Value
+Decomposition on the layerwise activations for a small subset of samples from
+the retain and unlearn classes, respectively. We then compute the shared
+information between these spaces and remove it from the forget space to isolate
+class-discriminatory feature space. Finally, we obtain the unlearned model by
+updating the weights to suppress the class discriminatory features from the
+activation spaces. We demonstrate our algorithm's efficacy on ImageNet using a
+Vision Transformer with only $\sim 1.5\%$ drop in retain accuracy compared to
+the original model while maintaining under $1\%$ accuracy on the unlearned
+class samples. Further, our algorithm consistently performs well when subject
+to Membership Inference Attacks showing $7.8\%$ improvement on average across a
+variety of image classification datasets and network architectures, as compared
+to other baselines while being $\sim 6 \times$ more computationally efficient.
+Our code is available at https://github.com/sangamesh-kodge/class_forgetting.
+
+
+
+
+ + ♻ ☆ PoseINN: Realtime Visual-based Pose Regression and Localization with + Invertible Neural Networks +
+ +
+ Estimating ego-pose from cameras is an important problem in robotics with
+applications ranging from mobile robotics to augmented reality. While SOTA
+models are becoming increasingly accurate, they can still be unwieldy due to
+high computational costs. In this paper, we propose to solve the problem by
+using invertible neural networks (INN) to find the mapping between the latent
+space of images and poses for a given scene. Our model achieves similar
+performance to the SOTA while being faster to train and only requiring offline
+rendering of low-resolution synthetic data. By using normalizing flows, the
+proposed method also provides uncertainty estimation for the output. We also
+demonstrated the efficiency of this method by deploying the model on a mobile
+robot.
+
+
+
+
+ + ♻ ☆ Zero Grads: Learning Local Surrogate Losses for Non-Differentiable + Graphics SIGGRAPH 2024 +
+ +
+ Gradient-based optimization is now ubiquitous across graphics, but
+unfortunately can not be applied to problems with undefined or zero gradients.
+To circumvent this issue, the loss function can be manually replaced by a
+``surrogate'' that has similar minima but is differentiable. Our proposed
+framework, ZeroGrads, automates this process by learning a neural approximation
+of the objective function, which in turn can be used to differentiate through
+arbitrary black-box graphics pipelines. We train the surrogate on an actively
+smoothed version of the objective and encourage locality, focusing the
+surrogate's capacity on what matters at the current training episode. The
+fitting is performed online, alongside the parameter optimization, and
+self-supervised, without pre-computed data or pre-trained models. As sampling
+the objective is expensive (it requires a full rendering or simulator run), we
+devise an efficient sampling scheme that allows for tractable run-times and
+competitive performance at little overhead. We demonstrate optimizing diverse
+non-convex, non-differentiable black-box problems in graphics, such as
+visibility in rendering, discrete parameter spaces in procedural modelling or
+optimal control in physics-driven animation. In contrast to other
+derivative-free algorithms, our approach scales well to higher dimensions,
+which we demonstrate on problems with up to 35k interlinked variables.
+
+
+
+ comment: Accepted at SIGGRAPH 2024. Project page:
+ https://mfischer-ucl.github.io/zerograds
+
+
+
+ + ♻ ☆ Solving the bongard-logo problem by modeling a probabilistic model +
+ +
+ Abstract reasoning problems challenge the perceptual and cognitive abilities
+of AI algorithms, demanding deeper pattern discernment and inductive reasoning
+beyond explicit image features. This study introduces PMoC, a tailored
+probability model for the Bongard-Logo problem, achieving high reasoning
+accuracy by constructing independent probability models. Additionally, we
+present Pose-Transformer, an enhanced Transformer-Encoder designed for complex
+abstract reasoning tasks, including Bongard-Logo, RAVEN, I-RAVEN, and PGM.
+Pose-Transformer incorporates positional information learning, inspired by
+capsule networks' pose matrices, enhancing its focus on local positional
+relationships in image data processing. When integrated with PMoC, it further
+improves reasoning accuracy. Our approach effectively addresses reasoning
+difficulties associated with abstract entities' positional changes,
+outperforming previous models on the OIG, D3$\times$3 subsets of RAVEN, and PGM
+databases. This research contributes to advancing AI's capabilities in abstract
+reasoning and cognitive pattern recognition.
+
+
+
+ comment: 14 pages, 11 figures, 3 tables
+
+
+
+ + ♻ ☆ A Unified Approach for Text- and Image-guided 4D Scene Generation +
+
+
+
+
+
+
+
+ Yufeng Zheng, Xueting Li, Koki Nagano, Sifei Liu, Karsten Kreis, Otmar Hilliges, Shalini De Mello
+
+
+ Large-scale diffusion generative models are greatly simplifying image, video
+and 3D asset creation from user-provided text prompts and images. However, the
+challenging problem of text-to-4D dynamic 3D scene generation with diffusion
+guidance remains largely unexplored. We propose Dream-in-4D, which features a
+novel two-stage approach for text-to-4D synthesis, leveraging (1) 3D and 2D
+diffusion guidance to effectively learn a high-quality static 3D asset in the
+first stage; (2) a deformable neural radiance field that explicitly
+disentangles the learned static asset from its deformation, preserving quality
+during motion learning; and (3) a multi-resolution feature grid for the
+deformation field with a displacement total variation loss to effectively learn
+motion with video diffusion guidance in the second stage. Through a user
+preference study, we demonstrate that our approach significantly advances image
+and motion quality, 3D consistency and text fidelity for text-to-4D generation
+compared to baseline approaches. Thanks to its motion-disentangled
+representation, Dream-in-4D can also be easily adapted for controllable
+generation where appearance is defined by one or multiple images, without the
+need to modify the motion learning stage. Thus, our method offers, for the
+first time, a unified approach for text-to-4D, image-to-4D and personalized 4D
+generation tasks.
+
+
+
+ comment: Project page: https://research.nvidia.com/labs/nxp/dream-in-4d/
+
+
+
+ + ♻ ☆ SDDGR: Stable Diffusion-based Deep Generative Replay for Class + Incremental Object Detection CVPR 2024 +
+ +
+ In the field of class incremental learning (CIL), generative replay has
+become increasingly prominent as a method to mitigate the catastrophic
+forgetting, alongside the continuous improvements in generative models.
+However, its application in class incremental object detection (CIOD) has been
+significantly limited, primarily due to the complexities of scenes involving
+multiple labels. In this paper, we propose a novel approach called stable
+diffusion deep generative replay (SDDGR) for CIOD. Our method utilizes a
+diffusion-based generative model with pre-trained text-to-diffusion networks to
+generate realistic and diverse synthetic images. SDDGR incorporates an
+iterative refinement strategy to produce high-quality images encompassing old
+classes. Additionally, we adopt an L2 knowledge distillation technique to
+improve the retention of prior knowledge in synthetic images. Furthermore, our
+approach includes pseudo-labeling for old objects within new task images,
+preventing misclassification as background elements. Extensive experiments on
+the COCO 2017 dataset demonstrate that SDDGR significantly outperforms existing
+algorithms, achieving a new state-of-the-art in various CIOD scenarios. The
+source code will be made available to the public.
+
+
+
+ comment: Accept to CVPR 2024. The camera-ready version
+
+
+
+ + ♻ ☆ Enhancing Boundary Segmentation for Topological Accuracy with + Skeleton-based Methods +
+
+
+
+
+
+
+
+ Chuni Liu, Boyuan Ma, Xiaojuan Ban, Yujie Xie, Hao Wang, Weihua Xue, Jingchao Ma, Ke Xu
+
+
+ Topological consistency plays a crucial role in the task of boundary
+segmentation for reticular images, such as cell membrane segmentation in neuron
+electron microscopic images, grain boundary segmentation in material
+microscopic images and road segmentation in aerial images. In these fields,
+topological changes in segmentation results have a serious impact on the
+downstream tasks, which can even exceed the misalignment of the boundary
+itself. To enhance the topology accuracy in segmentation results, we propose
+the Skea-Topo Aware loss, which is a novel loss function that takes into
+account the shape of each object and topological significance of the pixels. It
+consists of two components. First, a skeleton-aware weighted loss improves the
+segmentation accuracy by better modeling the object geometry with skeletons.
+Second, a boundary rectified term effectively identifies and emphasizes
+topological critical pixels in the prediction errors using both foreground and
+background skeletons in the ground truth and predictions. Experiments prove
+that our method improves topological consistency by up to 7 points in VI
+compared to 13 state-of-art methods, based on objective and subjective
+assessments across three different boundary segmentation datasets. The code is
+available at https://github.com/clovermini/Skea_topo.
+
+
+
+
+ + ♻ ☆ Motion State: A New Benchmark Multiple Object Tracking +
+ +
+ In the realm of video analysis, the field of multiple object tracking (MOT)
+assumes paramount importance, with the motion state of objects-whether static
+or dynamic relative to the ground-holding practical significance across diverse
+scenarios. However, the extant literature exhibits a notable dearth in the
+exploration of this aspect. Deep learning methodologies encounter challenges in
+accurately discerning object motion states, while conventional approaches
+reliant on comprehensive mathematical modeling may yield suboptimal tracking
+accuracy. To address these challenges, we introduce a Model-Data-Driven Motion
+State Judgment Object Tracking Method (MoD2T). This innovative architecture
+adeptly amalgamates traditional mathematical modeling with deep learning-based
+multi-object tracking frameworks. The integration of mathematical modeling and
+deep learning within MoD2T enhances the precision of object motion state
+determination, thereby elevating tracking accuracy. Our empirical
+investigations comprehensively validate the efficacy of MoD2T across varied
+scenarios, encompassing unmanned aerial vehicle surveillance and street-level
+tracking. Furthermore, to gauge the method's adeptness in discerning object
+motion states, we introduce the Motion State Validation F1 (MVF1) metric. This
+novel performance metric aims to quantitatively assess the accuracy of motion
+state classification, furnishing a comprehensive evaluation of MoD2T's
+performance. Elaborate experimental validations corroborate the rationality of
+MVF1. In order to holistically appraise MoD2T's performance, we meticulously
+annotate several renowned datasets and subject MoD2T to stringent testing.
+Remarkably, under conditions characterized by minimal or moderate camera
+motion, the achieved MVF1 values are particularly noteworthy, with exemplars
+including 0.774 for the KITTI dataset, 0.521 for MOT17, and 0.827 for UAVDT.
+
+
+
+
+ + ♻ ☆ A Novel Approach to Chest X-ray Lung Segmentation Using U-net and + Modified Convolutional Block Attention Module +
+ +
+ Lung segmentation in chest X-ray images is of paramount importance as it
+plays a crucial role in the diagnosis and treatment of various lung diseases.
+This paper presents a novel approach for lung segmentation in chest X-ray
+images by integrating U-net with attention mechanisms. The proposed method
+enhances the U-net architecture by incorporating a Convolutional Block
+Attention Module (CBAM), which unifies three distinct attention mechanisms:
+channel attention, spatial attention, and pixel attention. The channel
+attention mechanism enables the model to concentrate on the most informative
+features across various channels. The spatial attention mechanism enhances the
+model's precision in localization by focusing on significant spatial locations.
+Lastly, the pixel attention mechanism empowers the model to focus on individual
+pixels, further refining the model's focus and thereby improving the accuracy
+of segmentation. The adoption of the proposed CBAM in conjunction with the
+U-net architecture marks a significant advancement in the field of medical
+imaging, with potential implications for improving diagnostic precision and
+patient outcomes. The efficacy of this method is validated against contemporary
+state-of-the-art techniques, showcasing its superiority in segmentation
+performance.
+
+
+
+
+ + ♻ ☆ Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map + Optimization and Physically-Based Rendering CVPR 2024 +
+ +
+ We present Paint-it, a text-driven high-fidelity texture map synthesis method
+for 3D meshes via neural re-parameterized texture optimization. Paint-it
+synthesizes texture maps from a text description by
+synthesis-through-optimization, exploiting the Score-Distillation Sampling
+(SDS). We observe that directly applying SDS yields undesirable texture quality
+due to its noisy gradients. We reveal the importance of texture
+parameterization when using SDS. Specifically, we propose Deep Convolutional
+Physically-Based Rendering (DC-PBR) parameterization, which re-parameterizes
+the physically-based rendering (PBR) texture maps with randomly initialized
+convolution-based neural kernels, instead of a standard pixel-based
+parameterization. We show that DC-PBR inherently schedules the optimization
+curriculum according to texture frequency and naturally filters out the noisy
+signals from SDS. In experiments, Paint-it obtains remarkable quality PBR
+texture maps within 15 min., given only a text description. We demonstrate the
+generalizability and practicality of Paint-it by synthesizing high-quality
+texture maps for large-scale mesh datasets and showing test-time applications
+such as relighting and material control using a popular graphics engine.
+Project page: https://kim-youwang.github.io/paint-it
+
+
+
+ comment: CVPR 2024. Project page: https://kim-youwang.github.io/paint-it
+
+
+
+ + ♻ ☆ Zero-Shot Stitching in Reinforcement Learning using Relative + Representations +
+
+
+
+
+
+
+
+ Antonio Pio Ricciardi, Valentino Maiorca, Luca Moschella, Riccardo Marin, Emanuele Rodolà
+
+
+ Visual Reinforcement Learning is a popular and powerful framework that takes
+full advantage of the Deep Learning breakthrough. However, it is also known
+that variations in the input (e.g., different colors of the panorama due to the
+season of the year) or the task (e.g., changing the speed limit for a car to
+respect) could require complete retraining of the agents. In this work, we
+leverage recent developments in unifying latent representations to demonstrate
+that it is possible to combine the components of an agent, rather than retrain
+it from scratch. We build upon the recent relative representations framework
+and adapt it for Visual RL. This allows us to create completely new agents
+capable of handling environment-task combinations never seen during training.
+Our work paves the road toward a more accessible and flexible use of
+reinforcement learning.
+
+
+
+ comment: 13 pages, 10 figures, 4 tables
+
+
+
+ + ♻ ☆ Dynamic Event-based Optical Identification and Communication +
+
+
+
+
+
+
+
+ Axel von Arnim, Jules Lecomte, Naima Elosegui Borras, Stanislaw Wozniak, Angeliki Pantazi
+
+
+ Optical identification is often done with spatial or temporal visual pattern
+recognition and localization. Temporal pattern recognition, depending on the
+technology, involves a trade-off between communication frequency, range and
+accurate tracking. We propose a solution with light-emitting beacons that
+improves this trade-off by exploiting fast event-based cameras and, for
+tracking, sparse neuromorphic optical flow computed with spiking neurons. The
+system is embedded in a simulated drone and evaluated in an asset monitoring
+use case. It is robust to relative movements and enables simultaneous
+communication with, and tracking of, multiple moving beacons. Finally, in a
+hardware lab prototype, we demonstrate for the first time beacon tracking
+performed simultaneously with state-of-the-art frequency communication in the
+kHz range.
+
+
+
+
+ + ♻ ☆ Monkeypox disease recognition model based on improved SE-InceptionV3 +
+ +
+ In the wake of the global spread of monkeypox, accurate disease recognition
+has become crucial. This study introduces an improved SE-InceptionV3 model,
+embedding the SENet module and incorporating L2 regularization into the
+InceptionV3 framework to enhance monkeypox disease detection. Utilizing the
+Kaggle monkeypox dataset, which includes images of monkeypox and similar skin
+conditions, our model demonstrates a noteworthy accuracy of 96.71% on the test
+set, outperforming conventional methods and deep learning models. The SENet
+modules channel attention mechanism significantly elevates feature
+representation, while L2 regularization ensures robust generalization.
+Extensive experiments validate the models superiority in precision, recall, and
+F1 score, highlighting its effectiveness in differentiating monkeypox lesions
+in diverse and complex cases. The study not only provides insights into the
+application of advanced CNN architectures in medical diagnostics but also opens
+avenues for further research in model optimization and hyperparameter tuning
+for enhanced disease recognition. https://github.com/jzc777/SE-inceptionV3-L2
+
+
+
+
+ + ♻ ☆ Human Image Generation: A Comprehensive Survey +
+ +
+ Image and video synthesis has become a blooming topic in computer vision and
+machine learning communities along with the developments of deep generative
+models, due to its great academic and application value. Many researchers have
+been devoted to synthesizing high-fidelity human images as one of the most
+commonly seen object categories in daily lives, where a large number of studies
+are performed based on various models, task settings and applications. Thus, it
+is necessary to give a comprehensive overview on these variant methods on human
+image generation. In this paper, we divide human image generation techniques
+into three paradigms, i.e., data-driven methods, knowledge-guided methods and
+hybrid methods. For each paradigm, the most representative models and the
+corresponding variants are presented, where the advantages and characteristics
+of different methods are summarized in terms of model architectures. Besides,
+the main public human image datasets and evaluation metrics in the literature
+are summarized. Furthermore, due to the wide application potentials, the
+typical downstream usages of synthesized human images are covered. Finally, the
+challenges and potential opportunities of human image generation are discussed
+to shed light on future research.
+
+
+
+ comment: Under Review
+
+
+
+ + ♻ ☆ CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor CVPR 2024 +
+ +
+ Existing open-vocabulary image segmentation methods require a fine-tuning
+step on mask labels and/or image-text datasets. Mask labels are
+labor-intensive, which limits the number of categories in segmentation
+datasets. Consequently, the vocabulary capacity of pre-trained VLMs is severely
+reduced after fine-tuning. However, without fine-tuning, VLMs trained under
+weak image-text supervision tend to make suboptimal mask predictions. To
+alleviate these issues, we introduce a novel recurrent framework that
+progressively filters out irrelevant texts and enhances mask quality without
+training efforts. The recurrent unit is a two-stage segmenter built upon a
+frozen VLM. Thus, our model retains the VLM's broad vocabulary space and equips
+it with segmentation ability. Experiments show that our method outperforms not
+only the training-free counterparts, but also those fine-tuned with millions of
+data samples, and sets the new state-of-the-art records for both zero-shot
+semantic and referring segmentation. Concretely, we improve the current record
+by 28.8, 16.0, and 6.9 mIoU on Pascal VOC, COCO Object, and Pascal Context.
+
+
+
+ comment: To appear in CVPR 2024. Project page:
+ https://torrvision.com/clip_as_rnn/
+
+
+
+ + ♻ ☆ An Attention Based Pipeline for Identifying Pre-Cancer Lesions in Head + and Neck Clinical Images +
+ +
+ Early detection of cancer can help improve patient prognosis by early
+intervention. Head and neck cancer is diagnosed in specialist centres after a
+surgical biopsy, however, there is a potential for these to be missed leading
+to delayed diagnosis. To overcome these challenges, we present an attention
+based pipeline that identifies suspected lesions, segments, and classifies them
+as non-dysplastic, dysplastic and cancerous lesions. We propose (a) a vision
+transformer based Mask R-CNN network for lesion detection and segmentation of
+clinical images, and (b) Multiple Instance Learning (MIL) based scheme for
+classification. Current results show that the segmentation model produces
+segmentation masks and bounding boxes with up to 82% overlap accuracy score on
+unseen external test data and surpassing reviewed segmentation benchmarks.
+Next, a classification F1-score of 85% on the internal cohort test set. An app
+has been developed to perform lesion segmentation taken via a smart device.
+Future work involves employing endoscopic video data for precise early
+detection and prognosis.
+
+
+
+ comment: 5 pages, 3 figures, accepted in ISBI 2024, update: corrected typos
+
+
+
+ + ♻ ☆ Unified Dynamic Scanpath Predictors Outperform Individually Trained + Neural Models +
+ +
+ Previous research on scanpath prediction has mainly focused on group models,
+disregarding the fact that the scanpaths and attentional behaviors of
+individuals are diverse. The disregard of these differences is especially
+detrimental to social human-robot interaction, whereby robots commonly emulate
+human gaze based on heuristics or predefined patterns. However, human gaze
+patterns are heterogeneous and varying behaviors can significantly affect the
+outcomes of such human-robot interactions. To fill this gap, we developed a
+deep learning-based social cue integration model for saliency prediction to
+instead predict scanpaths in videos. Our model learned scanpaths by recursively
+integrating fixation history and social cues through a gating mechanism and
+sequential attention. We evaluated our approach on gaze datasets of dynamic
+social scenes, observed under the free-viewing condition. The introduction of
+fixation history into our models makes it possible to train a single unified
+model rather than the resource-intensive approach of training individual models
+for each set of scanpaths. We observed that the late neural integration
+approach surpasses early fusion when training models on a large dataset, in
+comparison to a smaller dataset with a similar distribution. Results also
+indicate that a single unified model, trained on all the observers' scanpaths,
+performs on par or better than individually trained models. We hypothesize that
+this outcome is a result of the group saliency representations instilling
+universal attention in the model, while the supervisory signal and fixation
+history guide it to learn personalized attentional behaviors, providing the
+unified model a benefit over individual models due to its implicit
+representation of universal attention.
+
+
+
+
+ + ♻ ☆ TwinDiffusion: Enhancing Coherence and Efficiency in Panoramic Image + Generation with Diffusion Models +
+ +
+ Diffusion models have emerged as effective tools for generating diverse and
+high-quality content. However, their capability in high-resolution image
+generation, particularly for panoramic images, still faces challenges such as
+visible seams and incoherent transitions. In this paper, we propose
+TwinDiffusion, an optimized framework designed to address these challenges
+through two key innovations: Crop Fusion for quality enhancement and Cross
+Sampling for efficiency optimization. We introduce a training-free optimizing
+stage to refine the similarity of the adjacent image areas, as well as an
+interleaving sampling strategy to yield dynamic patches during the cropping
+process. A comprehensive evaluation is conducted to compare TwinDiffusion with
+the existing methods, considering factors including coherence, fidelity,
+compatibility, and efficiency. The results demonstrate the superior performance
+of our approach in generating seamless and coherent panoramas, setting a new
+standard in quality and efficiency for panoramic image generation.
+
+
+
+
+ + ♻ ☆ Explainable Classification Techniques for Quantum Dot Device + Measurements +
+
+
+
+
+
+
+
+ Daniel Schug, Tyler J. Kovach, M. A. Wolfe, Jared Benson, Sanghyeok Park, J. P. Dodson, J. Corrigan, M. A. Eriksson, Justyna P. Zwolak
+
+
+ In the physical sciences, there is an increased need for robust feature
+representations of image data: image acquisition, in the generalized sense of
+two-dimensional data, is now widespread across a large number of fields,
+including quantum information science, which we consider here. While
+traditional image features are widely utilized in such cases, their use is
+rapidly being supplanted by Neural Network-based techniques that often
+sacrifice explainability in exchange for high accuracy. To ameliorate this
+trade-off, we propose a synthetic data-based technique that results in
+explainable features. We show, using Explainable Boosting Machines (EBMs), that
+this method offers superior explainability without sacrificing accuracy.
+Specifically, we show that there is a meaningful benefit to this technique in
+the context of quantum dot tuning, where human intervention is necessary at the
+current stage of development.
+
+
+
+ comment: 5 pages, 3 figures
+
+
+
+ + ♻ ☆ Towards Generalizing to Unseen Domains with Few Labels CVPR 2024 +
+
+
+
+
+
+
+
+ Chamuditha Jayanga Galappaththige, Sanoojan Baliah, Malitha Gunawardhana, Muhammad Haris Khan
+
+
+ We approach the challenge of addressing semi-supervised domain generalization
+(SSDG). Specifically, our aim is to obtain a model that learns
+domain-generalizable features by leveraging a limited subset of labelled data
+alongside a substantially larger pool of unlabeled data. Existing domain
+generalization (DG) methods which are unable to exploit unlabeled data perform
+poorly compared to semi-supervised learning (SSL) methods under SSDG setting.
+Nevertheless, SSL methods have considerable room for performance improvement
+when compared to fully-supervised DG training. To tackle this underexplored,
+yet highly practical problem of SSDG, we make the following core contributions.
+First, we propose a feature-based conformity technique that matches the
+posterior distributions from the feature space with the pseudo-label from the
+model's output space. Second, we develop a semantics alignment loss to learn
+semantically-compatible representations by regularizing the semantic structure
+in the feature space. Our method is plug-and-play and can be readily integrated
+with different SSL-based SSDG baselines without introducing any additional
+parameters. Extensive experimental results across five challenging DG
+benchmarks with four strong SSL baselines suggest that our method provides
+consistent and notable gains in two different SSDG settings.
+
+
+
+ comment: Accepted at CVPR 2024
+
+
+
+ + ♻ ☆ ReFACT: Updating Text-to-Image Models by Editing the Text Encoder NAACL 2024 +
+ +
+ Our world is marked by unprecedented technological, global, and
+socio-political transformations, posing a significant challenge to
+text-to-image generative models. These models encode factual associations
+within their parameters that can quickly become outdated, diminishing their
+utility for end-users. To that end, we introduce ReFACT, a novel approach for
+editing factual associations in text-to-image models without relaying on
+explicit input from end-users or costly re-training. ReFACT updates the weights
+of a specific layer in the text encoder, modifying only a tiny portion of the
+model's parameters and leaving the rest of the model unaffected. We empirically
+evaluate ReFACT on an existing benchmark, alongside a newly curated dataset.
+Compared to other methods, ReFACT achieves superior performance in both
+generalization to related concepts and preservation of unrelated concepts.
+Furthermore, ReFACT maintains image generation quality, making it a practical
+tool for updating and correcting factual information in text-to-image models.
+
+
+
+ comment: Accepted to NAACL 2024 (Main Conference)
+
+
+
+ + ♻ ☆ Yuille-Poggio's Flow and Global Minimizer of Polynomials through + Convexification by Heat Evolution +
+ +
+ This study examines the convexification version of the backward differential
+flow algorithm for the global minimization of polynomials, introduced by O.
+Arikan \textit{et al} in \cite{ABK}. It investigates why this approach might
+fail with high-degree polynomials yet succeeds with quartic polynomials. We
+employ the heat evolution method for convexification combined with Gaussian
+filtering, which acts as a cumulative form of Steklov's regularization. In this
+context, we apply the fingerprint theory from computer vision. Originally
+developed by A.L. Yuille and T. Poggio in the 1980s for computer vision, the
+fingerprint theory, particularly the fingerprint trajectory equation, is used
+to illustrate the scaling (temporal) evolution of minimizers. In the case of
+general polynomials, our research has led to the creation of the Yuille-Poggio
+flow and a broader interpretation of the fingerprint concepts, in particular we
+establish the condition both sufficient and necessary for the convexified
+backward differential flow algorithms to successfully achieve global
+minimization. For quartic polynomials, our analysis not only reflects the
+results of O. Arikan et al. \cite{ABK} but also presents a significantly
+simpler version of Newton's method that can always globally minimize quartic
+polynomials without convexification.
+
+
+
+
+ + ♻ ☆ The Impact of Background Removal on Performance of Neural Networks for + Fashion Image Classification and Segmentation +
+ +
+ Fashion understanding is a hot topic in computer vision, with many
+applications having great business value in the market. Fashion understanding
+remains a difficult challenge for computer vision due to the immense diversity
+of garments and various scenes and backgrounds. In this work, we try removing
+the background from fashion images to boost data quality and increase model
+performance. Having fashion images of evident persons in fully visible
+garments, we can utilize Salient Object Detection to achieve the background
+removal of fashion data to our expectations. A fashion image with the
+background removed is claimed as the "rembg" image, contrasting with the
+original one in the fashion dataset. We conducted extensive comparative
+experiments with these two types of images on multiple aspects of model
+training, including model architectures, model initialization, compatibility
+with other training tricks and data augmentations, and target task types. Our
+experiments show that background removal can effectively work for fashion data
+in simple and shallow networks that are not susceptible to overfitting. It can
+improve model accuracy by up to 5% in the classification on the FashionStyle14
+dataset when training models from scratch. However, background removal does not
+perform well in deep neural networks due to incompatibility with other
+regularization techniques like batch normalization, pre-trained initialization,
+and data augmentations introducing randomness. The loss of background pixels
+invalidates many existing training tricks in the model training, adding the
+risk of overfitting for deep models.
+
+
+
+ comment: 9 pages, 9 figures
+
+
+
+ + ♻ ☆ Not All Similarities Are Created Equal: Leveraging Data-Driven Biases to + Inform GenAI Copyright Disputes +
+
+
+
+
+
+
+
+ Uri Hacohen, Adi Haviv, Shahar Sarfaty, Bruria Friedman, Niva Elkin-Koren, Roi Livni, Amit H Bermano
+
+
+ The advent of Generative Artificial Intelligence (GenAI) models, including
+GitHub Copilot, OpenAI GPT, and Stable Diffusion, has revolutionized content
+creation, enabling non-professionals to produce high-quality content across
+various domains. This transformative technology has led to a surge of synthetic
+content and sparked legal disputes over copyright infringement. To address
+these challenges, this paper introduces a novel approach that leverages the
+learning capacity of GenAI models for copyright legal analysis, demonstrated
+with GPT2 and Stable Diffusion models. Copyright law distinguishes between
+original expressions and generic ones (Sc\`enes \`a faire), protecting the
+former and permitting reproduction of the latter. However, this distinction has
+historically been challenging to make consistently, leading to over-protection
+of copyrighted works. GenAI offers an unprecedented opportunity to enhance this
+legal analysis by revealing shared patterns in preexisting works. We propose a
+data-driven approach to identify the genericity of works created by GenAI,
+employing "data-driven bias" to assess the genericity of expressive
+compositions. This approach aids in copyright scope determination by utilizing
+the capabilities of GenAI to identify and prioritize expressive elements and
+rank them according to their frequency in the model's dataset. The potential
+implications of measuring expressive genericity for copyright law are profound.
+Such scoring could assist courts in determining copyright scope during
+litigation, inform the registration practices of Copyright Offices, allowing
+registration of only highly original synthetic works, and help copyright owners
+signal the value of their works and facilitate fairer licensing deals. More
+generally, this approach offers valuable insights to policymakers grappling
+with adapting copyright law to the challenges posed by the era of GenAI.
+
+
+
+ comment: Presented at ACM CSLAW 2024
+
+
+
+ + ♻ ☆ PINQI: An End-to-End Physics-Informed Approach to Learned Quantitative + MRI Reconstruction +
+ +
+ Quantitative Magnetic Resonance Imaging (qMRI) enables the reproducible
+measurement of biophysical parameters in tissue. The challenge lies in solving
+a nonlinear, ill-posed inverse problem to obtain the desired tissue parameter
+maps from acquired raw data. While various learned and non-learned approaches
+have been proposed, the existing learned methods fail to fully exploit the
+prior knowledge about the underlying MR physics, i.e. the signal model and the
+acquisition model. In this paper, we propose PINQI, a novel qMRI reconstruction
+method that integrates the knowledge about the signal, acquisition model, and
+learned regularization into a single end-to-end trainable neural network. Our
+approach is based on unrolled alternating optimization, utilizing
+differentiable optimization blocks to solve inner linear and non-linear
+optimization tasks, as well as convolutional layers for regularization of the
+intermediate qualitative images and parameter maps. This design enables PINQI
+to leverage the advantages of both the signal model and learned regularization.
+We evaluate the performance of our proposed network by comparing it with
+recently published approaches in the context of highly undersampled
+$T_1$-mapping, using both a simulated brain dataset, as well as real scanner
+data acquired from a physical phantom and in-vivo data from healthy volunteers.
+The results demonstrate the superiority of our proposed solution over existing
+methods and highlight the effectiveness of our method in real-world scenarios.
+
+
+
+ comment: This work has been accepted for publication in IEEE Transactions on
+ Computational Imaging. Changes were made to this version by the publisher
+ before publication. IEEE Transactions on Computational Imaging (2024)
+
+
+
+ + ♻ ☆ Domain-Specific Block Selection and Paired-View Pseudo-Labeling for + Online Test-Time Adaptation CVPR 2024 +
+ +
+ Test-time adaptation (TTA) aims to adapt a pre-trained model to a new test
+domain without access to source data after deployment. Existing approaches
+typically rely on self-training with pseudo-labels since ground-truth cannot be
+obtained from test data. Although the quality of pseudo labels is important for
+stable and accurate long-term adaptation, it has not been previously addressed.
+In this work, we propose DPLOT, a simple yet effective TTA framework that
+consists of two components: (1) domain-specific block selection and (2)
+pseudo-label generation using paired-view images. Specifically, we select
+blocks that involve domain-specific feature extraction and train these blocks
+by entropy minimization. After blocks are adjusted for current test domain, we
+generate pseudo-labels by averaging given test images and corresponding flipped
+counterparts. By simply using flip augmentation, we prevent a decrease in the
+quality of the pseudo-labels, which can be caused by the domain gap resulting
+from strong augmentation. Our experimental results demonstrate that DPLOT
+outperforms previous TTA methods in CIFAR10-C, CIFAR100-C, and ImageNet-C
+benchmarks, reducing error by up to 5.4%, 9.1%, and 2.9%, respectively. Also,
+we provide an extensive analysis to demonstrate effectiveness of our framework.
+Code is available at
+https://github.com/gist-ailab/domain-specific-block-selection-and-paired-view-pseudo-labeling-for-online-TTA.
+
+
+
+ comment: Accepted at CVPR 2024
+
+
+
+ + ♻ ☆ Language Models as Black-Box Optimizers for Vision-Language Models CVPR 2024 +
+
+
+
+
+
+
+
+ Shihong Liu, Zhiqiu Lin, Samuel Yu, Ryan Lee, Tiffany Ling, Deepak Pathak, Deva Ramanan
+
+
+ Vision-language models (VLMs) pre-trained on web-scale datasets have
+demonstrated remarkable capabilities on downstream tasks when fine-tuned with
+minimal data. However, many VLMs rely on proprietary data and are not
+open-source, which restricts the use of white-box approaches for fine-tuning.
+As such, we aim to develop a black-box approach to optimize VLMs through
+natural language prompts, thereby avoiding the need to access model parameters,
+feature embeddings, or even output logits. We propose employing chat-based LLMs
+to search for the best text prompt for VLMs. Specifically, we adopt an
+automatic hill-climbing procedure that converges to an effective prompt by
+evaluating the performance of current prompts and asking LLMs to refine them
+based on textual feedback, all within a conversational process without
+human-in-the-loop. In a challenging 1-shot image classification setup, our
+simple approach surpasses the white-box continuous prompting method (CoOp) by
+an average of 1.5% across 11 datasets including ImageNet. Our approach also
+outperforms both human-engineered and LLM-generated prompts. We highlight the
+advantage of conversational feedback that incorporates both positive and
+negative prompts, suggesting that LLMs can utilize the implicit gradient
+direction in textual feedback for a more efficient search. In addition, we find
+that the text prompts generated through our strategy are not only more
+interpretable but also transfer well across different VLM architectures in a
+black-box manner. Lastly, we apply our framework to optimize the
+state-of-the-art black-box VLM (DALL-E 3) for text-to-image generation, prompt
+inversion, and personalization.
+
+
+
+ comment: Published at CVPR 2024. Project site:
+ https://llm-can-optimize-vlm.github.io/
+
+
+
+ + ♻ ☆ Removal and Selection: Improving RGB-Infrared Object Detection via + Coarse-to-Fine Fusion +
+ +
+ Object detection in visible (RGB) and infrared (IR) images has been widely
+applied in recent years. Leveraging the complementary characteristics of RGB
+and IR images, the object detector provides reliable and robust object
+localization from day to night. Most existing fusion strategies directly input
+RGB and IR images into deep neural networks, leading to inferior detection
+performance. However, the RGB and IR features have modality-specific noise,
+these strategies will exacerbate the fused features along with the propagation.
+Inspired by the mechanism of the human brain processing multimodal information,
+in this paper, we introduce a new coarse-to-fine perspective to purify and fuse
+two modality features. Specifically, following this perspective, we design a
+Redundant Spectrum Removal module to coarsely remove interfering information
+within each modality and a Dynamic Feature Selection module to finely select
+the desired features for feature fusion. To verify the effectiveness of the
+coarse-to-fine fusion strategy, we construct a new object detector called the
+Removal and Selection Detector (RSDet). Extensive experiments on three RGB-IR
+object detection datasets verify the superior performance of our method.
+
+
+
+ comment: 11pages, 11figures
+
+
+
+ + ♻ ☆ Interpretable Geoscience Artificial Intelligence (XGeoS-AI): Application + to Demystify Image Recognition +
+ +
+ As Earth science enters the era of big data, artificial intelligence (AI) not
+only offers great potential for solving geoscience problems, but also plays a
+critical role in accelerating the understanding of the complex, interactive,
+and multiscale processes of Earth's behavior. As geoscience AI models are
+progressively utilized for significant predictions in crucial situations,
+geoscience researchers are increasingly demanding their interpretability and
+versatility. This study proposes an interpretable geoscience artificial
+intelligence (XGeoS-AI) framework to unravel the mystery of image recognition
+in the Earth sciences, and its effectiveness and versatility is demonstrated by
+taking computed tomography (CT) image recognition as an example. Inspired by
+the mechanism of human vision, the proposed XGeoS-AI framework generates a
+threshold value from a local region within the whole image to complete the
+recognition. Different kinds of artificial intelligence (AI) methods, such as
+Support Vector Regression (SVR), Multilayer Perceptron (MLP), Convolutional
+Neural Network (CNN), can be adopted as the AI engines of the proposed XGeoS-AI
+framework to efficiently complete geoscience image recognition tasks.
+Experimental results demonstrate that the effectiveness, versatility, and
+heuristics of the proposed framework have great potential in solving geoscience
+image recognition problems. Interpretable AI should receive more and more
+attention in the field of the Earth sciences, which is the key to promoting
+more rational and wider applications of AI in the field of Earth sciences. In
+addition, the proposed interpretable framework may be the forerunner of
+technological innovation in the Earth sciences.
+
+
+
+ comment: there are some erros in the results, and a newer revision is still
+ preparing
+
+
+
+ + ♻ ☆ Deep Regression Representation Learning with Topology ICML 2024 +
+ +
+ Most works studying representation learning focus only on classification and
+neglect regression. Yet, the learning objectives and therefore the
+representation topologies of the two tasks are fundamentally different:
+classification targets class separation, leading to disconnected
+representations, whereas regression requires ordinality with respect to the
+target, leading to continuous representations. We thus wonder how the
+effectiveness of a regression representation is influenced by its topology,
+with evaluation based on the Information Bottleneck (IB) principle.
+ The IB principle is an important framework that provides principles for
+learning effectiveness representations. We establish two connections between it
+and the topology of regression representations. The first connection reveals
+that a lower intrinsic dimension of the feature space implies a reduced
+complexity of the representation Z. This complexity can be quantified as the
+conditional entropy of Z on the target space Y and serves as an upper bound on
+the generalization error. The second connection suggests learning a feature
+space that is topologically similar to the target space will better align with
+the IB principle. Based on these two connections, we introduce PH-Reg, a
+regularizer specific to regression that matches the intrinsic dimension and
+topology of the feature space with the target space. Experiments on synthetic
+and real-world regression tasks demonstrate the benefits of PH-Reg.
+
+
+
+ comment: ICML 2024
+
+
+
+ + ♻ ☆ Comparison of Methods in Skin Pigment Decomposition +
+ +
+ Decomposition of skin pigment plays an important role in medical fields.
+Human skin can be decomposed into two primitive components, hemoglobin and
+melanin. It is our goal to apply these results for diagnosis of skin cancer. In
+this paper, various methods for skin pigment decomposition are reviewed
+comparatively and the performance of each method is evaluated both
+theoretically and experimentally. In addition, isometric feature mapping
+(Isomap) is introduced in order to improve the dimensionality reduction
+performance in context of skin pigment decomposition.
+
+
+
+ comment: 5 pages, 7 figures
+
+
+
+ + ♻ ☆ Towards Inclusive Face Recognition Through Synthetic Ethnicity + Alteration +
+
+
+
+
+
+
+
+ Praveen Kumar Chandaliya, Kiran Raja, Raghavendra Ramachandra, Zahid Akhtar, Christoph Busch
+
+
+ Numerous studies have shown that existing Face Recognition Systems (FRS),
+including commercial ones, often exhibit biases toward certain ethnicities due
+to under-represented data. In this work, we explore ethnicity alteration and
+skin tone modification using synthetic face image generation methods to
+increase the diversity of datasets. We conduct a detailed analysis by first
+constructing a balanced face image dataset representing three ethnicities:
+Asian, Black, and Indian. We then make use of existing Generative Adversarial
+Network-based (GAN) image-to-image translation and manifold learning models to
+alter the ethnicity from one to another. A systematic analysis is further
+conducted to assess the suitability of such datasets for FRS by studying the
+realistic skin-tone representation using Individual Typology Angle (ITA).
+Further, we also analyze the quality characteristics using existing Face image
+quality assessment (FIQA) approaches. We then provide a holistic FRS
+performance analysis using four different systems. Our findings pave the way
+for future research works in (i) developing both specific ethnicity and general
+(any to any) ethnicity alteration models, (ii) expanding such approaches to
+create databases with diverse skin tones, (iii) creating datasets representing
+various ethnicities which further can help in mitigating bias while addressing
+privacy concerns.
+
+
+
+ comment: 8 Pages
+
+
+
+ + ♻ ☆ 3DTopia: Large Text-to-3D Generation Model with Hybrid Diffusion Priors +
+
+
+
+
+
+
+
+ Fangzhou Hong, Jiaxiang Tang, Ziang Cao, Min Shi, Tong Wu, Zhaoxi Chen, Shuai Yang, Tengfei Wang, Liang Pan, Dahua Lin, Ziwei Liu
+
+
+ We present a two-stage text-to-3D generation system, namely 3DTopia, which
+generates high-quality general 3D assets within 5 minutes using hybrid
+diffusion priors. The first stage samples from a 3D diffusion prior directly
+learned from 3D data. Specifically, it is powered by a text-conditioned
+tri-plane latent diffusion model, which quickly generates coarse 3D samples for
+fast prototyping. The second stage utilizes 2D diffusion priors to further
+refine the texture of coarse 3D models from the first stage. The refinement
+consists of both latent and pixel space optimization for high-quality texture
+generation. To facilitate the training of the proposed system, we clean and
+caption the largest open-source 3D dataset, Objaverse, by combining the power
+of vision language models and large language models. Experiment results are
+reported qualitatively and quantitatively to show the performance of the
+proposed system. Our codes and models are available at
+https://github.com/3DTopia/3DTopia
+
+
+
+ comment: Code available at https://github.com/3DTopia/3DTopia
+
+
+
+ + ♻ ☆ Automatic Ultrasound Curve Angle Measurement via Affinity Clustering for + Adolescent Idiopathic Scoliosis Evaluation +
+
+
+
+
+
+
+
+ Yihao Zhou, Timothy Tin-Yan Lee, Kelly Ka-Lee Lai, Chonglin Wu, Hin Ting Lau, De Yang, Chui-Yi Chan, Winnie Chiu-Wing Chu, Jack Chun-Yiu Cheng, Tsz-Ping Lam, Yong-Ping Zheng
+
+
+ The current clinical gold standard for evaluating adolescent idiopathic
+scoliosis (AIS) is X-ray radiography, using Cobb angle measurement. However,
+the frequent monitoring of the AIS progression using X-rays poses a challenge
+due to the cumulative radiation exposure. Although 3D ultrasound has been
+validated as a reliable and radiation-free alternative for scoliosis
+assessment, the process of measuring spinal curvature is still carried out
+manually. Consequently, there is a considerable demand for a fully automatic
+system that can locate bony landmarks and perform angle measurements. To this
+end, we introduce an estimation model for automatic ultrasound curve angle
+(UCA) measurement. The model employs a dual-branch network to detect candidate
+landmarks and perform vertebra segmentation on ultrasound coronal images. An
+affinity clustering strategy is utilized within the vertebral segmentation area
+to illustrate the affinity relationship between candidate landmarks.
+Subsequently, we can efficiently perform line delineation from a clustered
+affinity map for UCA measurement. As our method is specifically designed for
+UCA calculation, this method outperforms other state-of-the-art methods for
+landmark and line detection tasks. The high correlation between the automatic
+UCA and Cobb angle (R$^2$=0.858) suggests that our proposed method can
+potentially replace manual UCA measurement in ultrasound scoliosis assessment.
+
+
+
+
+ + ♻ ☆ Synapse: Learning Preferential Concepts from Visual Demonstrations +
+ +
+ This paper addresses the problem of preference learning, which aims to learn
+user-specific preferences (e.g., "good parking spot", "convenient drop-off
+location") from visual input. Despite its similarity to learning factual
+concepts (e.g., "red cube"), preference learning is a fundamentally harder
+problem due to its subjective nature and the paucity of person-specific
+training data. We address this problem using a new framework called Synapse,
+which is a neuro-symbolic approach designed to efficiently learn preferential
+concepts from limited demonstrations. Synapse represents preferences as
+neuro-symbolic programs in a domain-specific language (DSL) that operates over
+images, and leverages a novel combination of visual parsing, large language
+models, and program synthesis to learn programs representing individual
+preferences. We evaluate Synapse through extensive experimentation including a
+user case study focusing on mobility-related concepts in mobile robotics and
+autonomous driving. Our evaluation demonstrates that Synapse significantly
+outperforms existing baselines as well as its own ablations. The code and other
+details can be found on the project website https://amrl.cs.utexas.edu/synapse .
+
+
+
+ comment: 25 pages, 7 tables, 9 figures; Preprint; Updated figures and
+ appendix, added VLM ablations
+
+
+
+ + ♻ ☆ Decodable and Sample Invariant Continuous Object Encoder ICLR2024 +
+ +
+ We propose Hyper-Dimensional Function Encoding (HDFE). Given samples of a
+continuous object (e.g. a function), HDFE produces an explicit vector
+representation of the given object, invariant to the sample distribution and
+density. Sample distribution and density invariance enables HDFE to
+consistently encode continuous objects regardless of their sampling, and
+therefore allows neural networks to receive continuous objects as inputs for
+machine learning tasks, such as classification and regression. Besides, HDFE
+does not require any training and is proved to map the object into an organized
+embedding space, which facilitates the training of the downstream tasks. In
+addition, the encoding is decodable, which enables neural networks to regress
+continuous objects by regressing their encodings. Therefore, HDFE serves as an
+interface for processing continuous objects.
+ We apply HDFE to function-to-function mapping, where vanilla HDFE achieves
+competitive performance as the state-of-the-art algorithm. We apply HDFE to
+point cloud surface normal estimation, where a simple replacement from PointNet
+to HDFE leads to immediate 12% and 15% error reductions in two benchmarks. In
+addition, by integrating HDFE into the PointNet-based SOTA network, we improve
+the SOTA baseline by 2.5% and 1.7% in the same benchmarks.
+
+
+
+ comment: ICLR2024 Conference Paper
+
+
+
+ + ♻ ☆ A Linear Time and Space Local Point Cloud Geometry Encoder via + Vectorized Kernel Mixture (VecKM) ICML2024 +
+ +
+ We propose VecKM, a local point cloud geometry encoder that is descriptive
+and efficient to compute. VecKM leverages a unique approach by vectorizing a
+kernel mixture to represent the local point cloud. Such representation's
+descriptiveness is supported by two theorems that validate its ability to
+reconstruct and preserve the similarity of the local shape. Unlike existing
+encoders downsampling the local point cloud, VecKM constructs the local
+geometry encoding using all neighboring points, producing a more descriptive
+encoding. Moreover, VecKM is efficient to compute and scalable to large point
+cloud inputs: VecKM reduces the memory cost from $(n^2+nKd)$ to $(nd+np)$; and
+reduces the major runtime cost from computing $nK$ MLPs to $n$ MLPs, where $n$
+is the size of the point cloud, $K$ is the neighborhood size, $d$ is the
+encoding dimension, and $p$ is a marginal factor. The efficiency is due to
+VecKM's unique factorizable property that eliminates the need of explicitly
+grouping points into neighbors. In the normal estimation task, VecKM
+demonstrates not only 100x faster inference speed but also highest accuracy and
+strongest robustness. In classification and segmentation tasks, integrating
+VecKM as a preprocessing module achieves consistently better performance than
+the PointNet, PointNet++, and point transformer baselines, and runs
+consistently faster by up to 10 times.
+
+
+
+ comment: ICML2024 Conference Paper
+
+
+
+ + ♻ ☆ Adaptive Guidance Learning for Camouflaged Object Detection +
+ +
+ Camouflaged object detection (COD) aims to segment objects visually embedded
+in their surroundings, which is a very challenging task due to the high
+similarity between the objects and the background. To address it, most methods
+often incorporate additional information (e.g., boundary, texture, and
+frequency clues) to guide feature learning for better detecting camouflaged
+objects from the background. Although progress has been made, these methods are
+basically individually tailored to specific auxiliary cues, thus lacking
+adaptability and not consistently achieving high segmentation performance. To
+this end, this paper proposes an adaptive guidance learning network, dubbed
+\textit{AGLNet}, which is a unified end-to-end learnable model for exploring
+and adapting different additional cues in CNN models to guide accurate
+camouflaged feature learning. Specifically, we first design a straightforward
+additional information generation (AIG) module to learn additional camouflaged
+object cues, which can be adapted for the exploration of effective camouflaged
+features. Then we present a hierarchical feature combination (HFC) module to
+deeply integrate additional cues and image features to guide camouflaged
+feature learning in a multi-level fusion manner.Followed by a recalibration
+decoder (RD), different features are further aggregated and refined for
+accurate object prediction. Extensive experiments on three widely used COD
+benchmark datasets demonstrate that the proposed method achieves significant
+performance improvements under different additional cues, and outperforms the
+recent 20 state-of-the-art methods by a large margin. Our code will be made
+publicly available at: \textcolor{blue}{{https://github.com/ZNan-Chen/AGLNet}}.
+
+
+
+
+ + ♻ ☆ Skip \n: A Simple Method to Reduce Hallucination in Large + Vision-Language Models +
+ +
+ Recent advancements in large vision-language models (LVLMs) have demonstrated
+impressive capability in visual information understanding with human language.
+Despite these advances, LVLMs still face challenges with multimodal
+hallucination, such as generating text descriptions of objects that are not
+present in the visual information. However, the underlying fundamental reasons
+of multimodal hallucinations remain poorly explored. In this paper, we propose
+a new perspective, suggesting that the inherent biases in LVLMs might be a key
+factor in hallucinations. Specifically, we systematically identify a semantic
+shift bias related to paragraph breaks (\n\n), where the content before and
+after '\n\n' in the training data frequently exhibit significant semantic
+changes. This pattern leads the model to infer that the contents following
+'\n\n' should be obviously different from the preceding contents with less
+hallucinatory descriptions, thereby increasing the probability of hallucinatory
+descriptions subsequent to the '\n\n'. We have validated this hypothesis on
+multiple publicly available LVLMs. Besides, we find that deliberately inserting
+'\n\n' at the generated description can induce more hallucinations. A simple
+method is proposed to effectively mitigate the hallucination of LVLMs by
+skipping the output of '\n'.
+
+
+
+
+ + ♻ ☆ Selective Prediction for Semantic Segmentation using Post-Hoc Confidence + Estimation and Its Performance under Distribution Shift +
+ +
+ Semantic segmentation plays a crucial role in various computer vision
+applications, yet its efficacy is often hindered by the lack of high-quality
+labeled data. To address this challenge, a common strategy is to leverage
+models trained on data from different populations, such as publicly available
+datasets. This approach, however, leads to the distribution shift problem,
+presenting a reduced performance on the population of interest. In scenarios
+where model errors can have significant consequences, selective prediction
+methods offer a means to mitigate risks and reduce reliance on expert
+supervision. This paper investigates selective prediction for semantic
+segmentation in low-resource settings, thus focusing on post-hoc confidence
+estimators applied to pre-trained models operating under distribution shift. We
+propose a novel image-level confidence measure tailored for semantic
+segmentation and demonstrate its effectiveness through experiments on three
+medical imaging tasks. Our findings show that post-hoc confidence estimators
+offer a cost-effective approach to reducing the impacts of distribution shift.
+
+
+